Understanding Hadoop's Distributed File System

Sep 3, 2020 • 5 Minute Read

Introduction

The resource demand on big data apps is usually very large. Since this data is also valuable, it is important that the data is stored in replicas to reduce the risk of data loss by hardware failure. The need for scalability and reliability in big, data-intensive apps has given rise to the distributed storage and processing architecture.

The Hadoop Distributed File System (HDFS) is a descendant of the Google File System, which was developed to solve the problem of big data processing at scale. HDFS is simply a distributed file system. This means that a single large dataset can be stored in several different storage nodes within a compute cluster. HDFS is how Hadoop is able to offer scalability and reliability for the storage of large datasets in a distributed fashion.

Motivation for a Distributed Approach

Why would a company choose a distributed architecture over the usual single-node architecture?

Fault Tolerance: A distributed architecture is more immune to failures due to the existence of replicas.
Scalability: HDFS can allow adding more nodes to a cluster. It is not limited. This allows a business to easily scale with demand.
Access speeds: A distributed architecture offers fast data retrieval from storage compared to traditional relational databases.
Cost effectiveness: At the scale of big data, a distributed approach is cheaper in terms of storage and retrieval computations since the extra processing required to read and write into a relational database would cost so much more.

With all its advantages, HDFS and Hadoop work best for use cases with very large amounts of data. If your use case has relatively small and manageable data, an HDFS would not be ideal.

HDFS Architecture

Components include:

Namenode: The master node. It manages access to files on the system and also manages the namespace of the file system.

DataNode: The slave node. It performs read/write operations on the file system as well as block operations such as creation, deletion, and replication.

Block: The unit of storage. A file is broken up into blocks, and different blocks are stored on different datanodes as per instructions of the namenode.

Several datanodes can be grouped into a single unit and referred to as a Rack. The rack design depends on the big data architect and what they wish to achieve. A typical architecture may have a main rack and some replication racks for purposes of redundancy in case of failures.

Basic Usage Commands

To use the Hadoop file system, you need some background knowledge in Linux or command line usage since most of the commands are similar to those in Linux.

To start up the HDFS, assuming you have Hadoop installed on your machine, run the command

      start-dfs.sh

The general pattern is to use basic Linux file system commands but preceded by the command hadoop fs so they run on HDFS and not the local file system.

To create a new directory within the HDFS, use the command

      hadoop fs -mkdir <directory_name>

To copy a file from your host machine onto HDFS, use the command

      hadoop fs -put ~/path/to/localfile /path/to/hadoop/directory

The reverse of put is get, which copies files from HDFS back to host file system.

      hadoop fs -get /hadoop/path/to/file ~/path/to/local/storage

More Linux file system commands such as chown, rm, mv, cp, ls, and others are also available in HDFS.

Conclusion

You should now have a clear understanding of one of the fundamental technologies of Hadoop. With advancements in technology, Hadoop and HDFS have faced competition from alternatives such as Google Big Query, Cloudera and others. Further understanding Hadoop and its technologies in general is vital for anyone who wishes to work as a big data engineer or big data architect.

To further build on this guide, learn more about the other fundamental components in the Hadoop ecosystem. These include but are not limited to Apache Ambari for management of Hadoop clusters, Hive for data querying and analysis, MapReduce for distributed processing, Pig for writing data transformations using an SQL-like language, and YARN for resource management and scheduling.