With the advent of big data and cloud computing, the need to efficiently store and process large datasets is constantly increasing. Using a single resource for storage and processing is unviable due to the cost, the sheer amount of data being generated, and the risk of failure with probable loss of valuable data.
Hadoop is mainly applied in areas where there is a large amount of data that requires processing and storage. Since Hadoop employs horizontal scaling, it is able to quickly scale and perform parallel processing.
Common business use cases and applications of this technology include:
Marketing Analysis and Targeting: Hadoop is being deployed to capture and analyze clickstream and multimedia data generated by social media users, which can then be further processed for purposes of marketing.
More real-world applications can be found here.
A prerequisite for this process is to have openssh-server, Java and JDK installed on your machine.
This guide explores installing on Ubuntu. To install on Windows, follow this guide. Mac users can follow this guide.
Below are the steps to follow when installing Hadoop:
Create a hadoop user and group hadoop.
1sudo addgroup hadoop
2sudo adduser --ingroup hadoop hadoop
Switch user to the newly created hadoop user.
1sudo su- hadoop
Set up and test passwordless SSH Keys since Hadoop requires SSH access for node management.
1ssh-keygen -t rsa -P ""
2cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
3ssh localhost
Download Hadoop from the official Apache site. Then navigate to your downloaded file, extract it, and move it to the appropriate usr/local
folder.
1tar -xzvf hadoop-3.3.0
2sudo mv hadoop-3.3.0 /usr/local/hadoop
Grant ownership of the folder to the hadoop user you created.
1sudo chown -R hadoop:hadoop usr/local/hadoop
Note: The folder hadoop-3.3.0
depends on the version you download. It is not fixed.
To configure your system for Hadoop, open up the ~/.bashrc
file using the command
cat ~/.bashrc
and add the following lines at the end. Once done, save and close the file
1export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
2export HADOOP_HOME=/usr/local/hadoop
3export PATH=$PATH:$HADOOP_HOME/bin
4export PATH=$PATH:$HADOOP_HOME/sbin
5export HADOOP_MAPRED_HOME=$HADOOP_HOME
6export HADOOP_COMMON_HOME=$HADOOP_HOME
7export HADOOP_HDFS_HOME=$HADOOP_HOME
8export YARN_HOME=$HADOOP_HOME
9export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/native"
Run the ~/.bashrc
file using the command source ~/.bashrc
.
At this point, Hadoop is installed.
Configure Hadoop's hadoop-env.sh
file to set the JAVA_HOME variable.
To open the file, run the command:
1sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Add the line export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
.
This is typically the root directory for the Java installation.
Configure Hadoop's core-site.xml
by adding the following lines:
1<configuration>
2<property>
3<name>fs.defaultFS</name>
4<value>hdfs://localhost:9000</value>
5</property>
6<property>
7<name>hadoop.tmp.dir</name>
8<value>/home/hdadmin/hdata</value>
9</property>
10</configuration>
To open the core-site.xml file, run the command:
1sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml
Configure Hadoop's hdfs-site.xml
by adding the following lines:
1<configuration>
2<property>
3<name>dfs.replication</name>
4<value>1</value>
5</property>
6</configuration>
To open the hdfs-site.xml
file, run the command:
1sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Configure Hadoop's mapred-site.xml
by copying the existing template mapred-site.xml.template
and adding the following lines:
1<configuration>
2<property>
3<name>mapreduce.framework.name</name>
4<value>yarn</value>
5</property>
6</configuration>
To copy the template file into a new one and edit it, run the commands:
1cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
2sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
Configure Hadoop's yarn-site.xml
by adding the following lines
1<configuration>
2<property>
3<name>yarn.nodemanager.aux-services</name>
4<value>mapreduce_shuffle</value>
5</property>
6<property>
7<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
8<value>org.apache.hadoop.mapred.ShuffleHandler</value>
9</property>
To open the yarn-site.xml
file, run the command
1sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
To start the Hadoop cluster and format the namenode, run the command:
1hdfs namenode -format
To start the HDFS service:
1start-dfs.sh
To start the YARN service:
1start-yarn.sh
To verify all services are running, run the command jps
and visit the Hadoop web interface at http://localhost:50070.
The resource manager interface can be found at http://localhost:8088.
This guide gives you a good initial feel of what Hadoop is all about and its use cases. As big data becomes more and more ubiquitous, the need for technologies like Hadoop and for big data engineers is growing at a fast rate. To build on the knowledge acquired in this guide, you can further explore other technologies in the ecosystem such as Apache Ambari, Hive, Pig, and YARN, among others.