Getting Started with HDFS

Learning to work with Hadoop Distributed File System (HDFS) is a baseline skill for anyone administering or developing in the Hadoop ecosystem. In this course, you will learn how to work with HDFS, Hive, Pig, Sqoop and HBase from the command line.
Course info
Rating
(113)
Level
Beginner
Updated
Feb 16, 2016
Duration
2h 48m
Table of contents
Understanding HDFS
Creating, Manipulating, and Retrieving HDFS Files
Transferring Relational Data to HDFS Using Sqoop
Querying Data with Pig and Hive
Processing Sparse Data with HBase
Automating Basic HDFS Operations
Description
Course info
Rating
(113)
Level
Beginner
Updated
Feb 16, 2016
Duration
2h 48m
Description

Getting Started with Hadoop Distributed File System (HDFS) is designed to give you everything you need to learn about how to use HDFS to read, store, and remove files. In addition to working with files in Hadoop, you will learn how to take data from relational databases and import it into HDFS using Sqoop. After we have our data inside HDFS, we will learn how to use Pig and Hive to query that data. Building on our HDFS skills, we will look at how use HBase for near real-time data processing. Whether you are a developer, administrator, or data analyst, the concepts in this course are essential to getting started with HDFS.

About the author
About the author

Thomas is a Senior Software Engineer and Certified ScrumMaster. He spends most of his time working with the Hortonwork Data Platform and Agile Coaching.

More from the author
Enterprise Skills in Hortonworks Data Platform
Intermediate
1h 37m
21 Sep 2018
Analyzing Machine Data with Splunk
Beginner
2h 38m
4 Nov 2016
Section Introduction Transcripts
Section Introduction Transcripts

Creating, Manipulating, and Retrieving HDFS Files
Welcome to module 2 of Getting Started with HDFS. By now we have a good understanding of the Hadoop Distributed File System and we have our development environment set up. Now we're going to shift our focus to learning how to navigate HDFS and start learning some of the basic commands. This module was a key module to understanding how to interact with HDFS command shell. It's going to set the tone for the next modules where we'll actually be accessing the data that we have in HDFS. Let's get an overview of some of the topics that we're going to discuss in this module. First we'll look at specific actions that we can do in HDFS, things like reading, creating, and deleting files. There are also a few key points that we want to talk about in this section because there are a few things that we can't do with HDFS. Next, we want to look at how to interact with HDFS from the command line. We're going to look at a few key ways that we can access the shell and even learn how to structure our commands. Then we're going to jump in and look at the basic HDFS commands, like list and touch, and then we'll begin a walkthrough of how to move data around in HDFS. We'll look at where we can find sample data and how we can move it from a Windows machine into our Linux cluster. In the last part of this module we want to discuss a few maintenance and administrative commands. It won't be a deep dive, but it'll be enough to learn how to take out the trash in HDFS. Now let's look and see what we can do from the HDFS command line.

Transferring Relational Data to HDFS Using Sqoop
So far we've used the command line to move data into HDFS one file at a time. But what happens when you want to move an entire database with hundreds of tables? You'd better grab some coffee because it's going to be a long day trying to move all those with HDFS DFS commands. That's where Sqoop comes in. It's an application in the Hadoop ecosystem that can automate the process for you. In this module we're going to talk about Sqoop and even walk through a demo of moving data from HDFS into MySQL and vice versa. Let's get a good overview of what we're going to look at in this module. First we're going to define Sqoop and talk about how a tool that's spelled in a funny way can actually save developers hours in a day. Also highlight the benefits of using the open source tool and talk about the many use cases which will show you how to use Sqoop in your own projects. Next we'll walk through the documentation and this is where you'll find the source code and even some third party extensions that'll help you use Sqoop in your projects. And lastly we're going to walk through a demo where we'll transfer an entire table from MySQL into HDFS with Sqoop. It'll all be done from the command line. And you won't even have to write a mapReduce job to do it, Sqoop will do it for you. Now let's define what Sqoop is.

Processing Sparse Data with HBase
Hadoop is all about unstructured data. It's about being able to take our favorite book or even a dictionary, throw it into HDFS, run a MapReduce job over it to do some analysis, and then have an updated process for us. All this is done in a distributed environment where we're not really concerned about the key value pairs or where the data is stored or how it's distributed across cluster. But what happens when we do care about that structure? What happens when we need a NoSQL database? What happens when you need access to your data in real time versus batch? That's what this module is all about, processing sparse data with HBase. How are we going to learn about HBase in this module? We'll start off by explaining what HBase is and where it sits in the Hadoop architecture. Next we'll build the case for HBase and define some of the projects where HBase is a good fit and those where it's not. And since this course is all about how to do things from the command line, we're going to jump into the HBase shell, then we'll learn to ingest data into HBase in a real world project. Finally we'll look at some of the resources for HBase. Now let's start off by defining HBase.