With the advent of big data and cloud computing, the need to efficiently store and process large datasets is constantly increasing. Using a single resource for storage and processing is unviable due to the cost, the sheer amount of data being generated, and the risk of failure with probable loss of valuable data.
Hadoop is mainly applied in areas where there is a large amount of data that requires processing and storage. Since Hadoop employs horizontal scaling, it is able to quickly scale and perform parallel processing.
Common business use cases and applications of this technology include:
Marketing Analysis and Targeting: Hadoop is being deployed to capture and analyze clickstream and multimedia data generated by social media users, which can then be further processed for purposes of marketing.
More real-world applications can be found here.
A prerequisite for this process is to have openssh-server, Java and JDK installed on your machine.
This guide explores installing on Ubuntu. To install on Windows, follow this guide. Mac users can follow this guide.
Below are the steps to follow when installing Hadoop:
Create a hadoop user and group hadoop.
1sudo addgroup hadoop 2sudo adduser --ingroup hadoop hadoop
Switch user to the newly created hadoop user.
1sudo su- hadoop
Set up and test passwordless SSH Keys since Hadoop requires SSH access for node management.
1ssh-keygen -t rsa -P "" 2cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 3ssh localhost
Download Hadoop from the official Apache site. Then navigate to your downloaded file, extract it, and move it to the appropriate
1tar -xzvf hadoop-3.3.0 2sudo mv hadoop-3.3.0 /usr/local/hadoop
Grant ownership of the folder to the hadoop user you created.
1sudo chown -R hadoop:hadoop usr/local/hadoop
Note: The folder
hadoop-3.3.0 depends on the version you download. It is not fixed.
To configure your system for Hadoop, open up the
~/.bashrc file using the command
cat ~/.bashrc and add the following lines at the end. Once done, save and close the file
1export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 2export HADOOP_HOME=/usr/local/hadoop 3export PATH=$PATH:$HADOOP_HOME/bin 4export PATH=$PATH:$HADOOP_HOME/sbin 5export HADOOP_MAPRED_HOME=$HADOOP_HOME 6export HADOOP_COMMON_HOME=$HADOOP_HOME 7export HADOOP_HDFS_HOME=$HADOOP_HOME 8export YARN_HOME=$HADOOP_HOME 9export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/native"
~/.bashrc file using the command
At this point, Hadoop is installed.
hadoop-env.sh file to set the JAVA_HOME variable.
To open the file, run the command:
1sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Add the line
This is typically the root directory for the Java installation.
core-site.xml by adding the following lines:
1<configuration> 2<property> 3<name>fs.defaultFS</name> 4<value>hdfs://localhost:9000</value> 5</property> 6<property> 7<name>hadoop.tmp.dir</name> 8<value>/home/hdadmin/hdata</value> 9</property> 10</configuration>
To open the core-site.xml file, run the command:
1sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml
hdfs-site.xml by adding the following lines:
1<configuration> 2<property> 3<name>dfs.replication</name> 4<value>1</value> 5</property> 6</configuration>
To open the
hdfs-site.xml file, run the command:
1sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
mapred-site.xml by copying the existing template
mapred-site.xml.template and adding the following lines:
1<configuration> 2<property> 3<name>mapreduce.framework.name</name> 4<value>yarn</value> 5</property> 6</configuration>
To copy the template file into a new one and edit it, run the commands:
1cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml 2sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
yarn-site.xml by adding the following lines
1<configuration> 2<property> 3<name>yarn.nodemanager.aux-services</name> 4<value>mapreduce_shuffle</value> 5</property> 6<property> 7<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> 8<value>org.apache.hadoop.mapred.ShuffleHandler</value> 9</property>
To open the
yarn-site.xml file, run the command
1sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
To start the Hadoop cluster and format the namenode, run the command:
1hdfs namenode -format
To start the HDFS service:
To start the YARN service:
To verify all services are running, run the command
jps and visit the Hadoop web interface at http://localhost:50070.
The resource manager interface can be found at http://localhost:8088.
This guide gives you a good initial feel of what Hadoop is all about and its use cases. As big data becomes more and more ubiquitous, the need for technologies like Hadoop and for big data engineers is growing at a fast rate. To build on the knowledge acquired in this guide, you can further explore other technologies in the ecosystem such as Apache Ambari, Hive, Pig, and YARN, among others.