Distributed systems have gained popularity for their efficiency in processing big data. Several approaches exist for processing large volumes of data by leveraging a compute cluster. Hadoop uses a divide-and conquer-paradigm. It breaks big problems into smaller ones, distributes the tasks across nodes, and later aggregates individual results to form the final output.
The component that performs processing in Hadoop is MapReduce, a software framework designed to process large data in parallel on a distributed cluster.
MapReduce works by splitting the input data into smaller chunks and feeding them to processing components referred to as mappers. The mappers then pass on their processed output to reducers, which produce the final output.
The guide assumes that you have a basic understanding of the overall Hadoop architecture. An introductory guide to Hadoop can be found here.
To perform its functions, the architecture contains two main processing elements designed in a controller-operator fashion. These are:
1. Job Tracker: The controller. Responsible for job scheduling and issuing execution commands to the operator component that resides in each node. It is also responsible for re-executing failed tasks.
2. Task Tracker: The operator. Exists on each node and executes instructions and relays feedback back to the controller component.
MapReduce Architecture Source: A4academics website
As demonstrated above, the functioning of MapReduce occurs in two phases: the map phase and the reduce phase.
This is the first phase that occurs during the execution of a processing task in the Hadoop system. The framework understands data in the form of <key,value> pairs, and hence, the input data has to be pre-processed to match this expected format. The input dataset is broken down into smaller chunks via a method referred to as logical splitting.
The logical chunks are then assigned to available mappers, which process each input record into <key,value> pairs. The output of this phase is considered intermediate output.
The output can undergo an intermediate process where the mapper output data is further processed before being fed into the reducers. The processes include combining, sorting, partitioning, and shuffling. More about the significance of these processes can be found here. The intermediate data from these processes are stored in a local file system within the respective processing node.
The number of reducers for each job is configurable and can be set within the
mapred-site.xml configuration file.
The reducer phase of processing takes the mapper phase output and processes the data to generate the final output, which is recorded in an output file within the Hadoop Distributed File System (HDFS) by a function referred to as record writer.
Consider the example of a word count task. A document with several words is submitted, and the MapReduce framework is required to produce a word count list for all the available words. The diagram below gives a visual explanation of how the task is processed. Source: A4Academics Tutorial
You have now gained knowledge of the MapReduce programming paradigm and how it is implemented in distributed computing to facilitating distributed parallel processing of large datasets. The main approach is to divide and conquer. This skill is vital for any developer holding the role of big data engineer or distributed compute architect in their organization.
To further build on this guide, study more about distributed and parallel computing, especially how resources are negotiated and allocated in a distributed cluster to handle processing and storage. In Hadoop, resource negotiation and allocation are performed by YARN. An introductory guide to Hadoop's YARN can be found here.