An Introduction to Hadoop

Dec 18, 2020 • 7 Minute Read

Introduction

With the advent of big data and cloud computing, the need to efficiently store and process large datasets is constantly increasing. Using a single resource for storage and processing is unviable due to the cost, the sheer amount of data being generated, and the risk of failure with probable loss of valuable data.

This is the situation that gave rise to Hadoop, an open-source platform for distributed storage and processing of large datasets in compute clusters. For distributed computing, Hadoop utilizes MapReduce, and for distributed storage, it utilizes the Hadoop Distributed File System(HDFS).

Use Cases and Application Areas

Hadoop is mainly applied in areas where there is a large amount of data that requires processing and storage. Since Hadoop employs horizontal scaling, it is able to quickly scale and perform parallel processing.

Common business use cases and applications of this technology include:

Commerce and Finance: The Hadoop ecosystem can be applied to store and process the data generated by commerce and trading activities for analysis and derivation of value.
Cyber Security and Fraud Detection: Hadoop can be employed in distributed processing of network data to identify potential threats to enterprise or large government networks in real time.
Healthcare: Healthcare data is sensitive, and it can be stored in a distributed fashion using Hadoop, which also provides fail-safes in case of failure.
Marketing Analysis and Targeting: Hadoop is being deployed to capture and analyze clickstream and multimedia data generated by social media users, which can then be further processed for purposes of marketing.

More real-world applications can be found here.

Download and Setup

A prerequisite for this process is to have openssh-server, Java and JDK installed on your machine.

This guide explores installing on Ubuntu. To install on Windows, follow this guide. Mac users can follow this guide.

Below are the steps to follow when installing Hadoop:

Create a hadoop user and group hadoop.

      sudo addgroup hadoop
sudo adduser --ingroup hadoop hadoop
    

Switch user to the newly created hadoop user.

      sudo su- hadoop

Set up and test passwordless SSH Keys since Hadoop requires SSH access for node management.

      ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh localhost
    

Download Hadoop from the official Apache site. Then navigate to your downloaded file, extract it, and move it to the appropriate usr/local folder.

      tar -xzvf   hadoop-3.3.0
sudo mv hadoop-3.3.0 /usr/local/hadoop
    

Grant ownership of the folder to the hadoop user you created.

      sudo chown -R hadoop:hadoop usr/local/hadoop

Note: The folder hadoop-3.3.0 depends on the version you download. It is not fixed.

To configure your system for Hadoop, open up the ~/.bashrc file using the command cat ~/.bashrc and add the following lines at the end. Once done, save and close the file

      export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/native"
    

Run the ~/.bashrc file using the command source ~/.bashrc. At this point, Hadoop is installed.

Configure Hadoop's hadoop-env.sh file to set the JAVA_HOME variable. To open the file, run the command:

      sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Add the line export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/. This is typically the root directory for the Java installation.

Configure Hadoop's core-site.xml by adding the following lines:

      <configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdadmin/hdata</value>
</property>
</configuration>
    

To open the core-site.xml file, run the command:

      sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

Configure Hadoop's hdfs-site.xml by adding the following lines:

      <configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
    

To open the hdfs-site.xml file, run the command:

      sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Configure Hadoop's mapred-site.xml by copying the existing template mapred-site.xml.template and adding the following lines:

      <configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
    

To copy the template file into a new one and edit it, run the commands:

      cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
    

Configure Hadoop's yarn-site.xml by adding the following lines

      <configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
    

To open the yarn-site.xml file, run the command

      sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

Starting Hadoop Services

To start the Hadoop cluster and format the namenode, run the command:

      hdfs namenode -format

To start the HDFS service:

      start-dfs.sh

To start the YARN service:

      start-yarn.sh

To verify all services are running, run the command jps and visit the Hadoop web interface at https://localhost:50070. The resource manager interface can be found at https://localhost:8088.

Conclusion

This guide gives you a good initial feel of what Hadoop is all about and its use cases. As big data becomes more and more ubiquitous, the need for technologies like Hadoop and for big data engineers is growing at a fast rate. To build on the knowledge acquired in this guide, you can further explore other technologies in the ecosystem such as Apache Ambari, Hive, Pig, and YARN, among others.