Author avatar

Kimaru Thagana

An Introduction to Hadoop

Kimaru Thagana

  • Aug 31, 2020
  • 7 Min read
  • 546 Views
  • Aug 31, 2020
  • 7 Min read
  • 546 Views
Data
Hadoop
Big Data
Data Analytics

Introduction

With the advent of big data and cloud computing, the need to efficiently store and process large datasets is constantly increasing. Using a single resource for storage and processing is unviable due to the cost, the sheer amount of data being generated, and the risk of failure with probable loss of valuable data.

This is the situation that gave rise to Hadoop, an open-source platform for distributed storage and processing of large datasets in compute clusters. For distributed computing, Hadoop utilizes MapReduce, and for distributed storage, it utilizes the Hadoop Distributed File System(HDFS).

Use Cases and Application Areas

Hadoop is mainly applied in areas where there is a large amount of data that requires processing and storage. Since Hadoop employs horizontal scaling, it is able to quickly scale and perform parallel processing.

Common business use cases and applications of this technology include:

  1. Commerce and Finance: The Hadoop ecosystem can be applied to store and process the data generated by commerce and trading activities for analysis and derivation of value.
  1. Cyber Security and Fraud Detection: Hadoop can be employed in distributed processing of network data to identify potential threats to enterprise or large government networks in real time.
  1. Healthcare: Healthcare data is sensitive, and it can be stored in a distributed fashion using Hadoop, which also provides fail-safes in case of failure.
  1. Marketing Analysis and Targeting: Hadoop is being deployed to capture and analyze clickstream and multimedia data generated by social media users, which can then be further processed for purposes of marketing.

More real-world applications can be found here.

Download and Setup

A prerequisite for this process is to have openssh-server, Java and JDK installed on your machine.

This guide explores installing on Ubuntu. To install on Windows, follow this guide. Mac users can follow this guide.

Below are the steps to follow when installing Hadoop:

Create a hadoop user and group hadoop.

1
2
sudo addgroup hadoop
sudo adduser --ingroup hadoop hadoop
bash

Switch user to the newly created hadoop user.

1
sudo su- hadoop
bash

Set up and test passwordless SSH Keys since Hadoop requires SSH access for node management.

1
2
3
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh localhost
bash

Download Hadoop from the official Apache site. Then navigate to your downloaded file, extract it, and move it to the appropriate usr/local folder.

1
2
tar -xzvf   hadoop-3.3.0
sudo mv hadoop-3.3.0 /usr/local/hadoop
bash

Grant ownership of the folder to the hadoop user you created.

1
sudo chown -R hadoop:hadoop usr/local/hadoop
bash

Note: The folder hadoop-3.3.0 depends on the version you download. It is not fixed.

To configure your system for Hadoop, open up the ~/.bashrc file using the command cat ~/.bashrc and add the following lines at the end. Once done, save and close the file

1
2
3
4
5
6
7
8
9
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/native"

Run the ~/.bashrc file using the command source ~/.bashrc. At this point, Hadoop is installed.

Configure Hadoop's hadoop-env.sh file to set the JAVA_HOME variable. To open the file, run the command:

1
sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
bash

Add the line export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/. This is typically the root directory for the Java installation.

Configure Hadoop's core-site.xml by adding the following lines:

1
2
3
4
5
6
7
8
9
10
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdadmin/hdata</value>
</property>
</configuration>
xml

To open the core-site.xml file, run the command:

1
sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml
bash

Configure Hadoop's hdfs-site.xml by adding the following lines:

1
2
3
4
5
6
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
xml

To open the hdfs-site.xml file, run the command:

1
sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
bash

Configure Hadoop's mapred-site.xml by copying the existing template mapred-site.xml.template and adding the following lines:

1
2
3
4
5
6
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
xml

To copy the template file into a new one and edit it, run the commands:

1
2
cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
bash

Configure Hadoop's yarn-site.xml by adding the following lines

1
2
3
4
5
6
7
8
9
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
xml

To open the yarn-site.xml file, run the command

1
sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
bash

Starting Hadoop Services

To start the Hadoop cluster and format the namenode, run the command:

1
hdfs namenode -format
bash

To start the HDFS service:

1
start dfs.sh
bash

To start the YARN service:

1
start-yarn.sh
bash

To verify all services are running, run the command jps and visit the Hadoop web interface at http://localhost:50070. The resource manager interface can be found at http://localhost:8088.

Conclusion

This guide gives you a good initial feel of what Hadoop is all about and its use cases. As big data becomes more and more ubiquitous, the need for technologies like Hadoop and for big data engineers is growing at a fast rate. To build on the knowledge acquired in this guide, you can further explore other technologies in the ecosystem such as Apache Ambari, Hive, Pig, and YARN, among others.

7