Author avatar

Kimaru Thagana

Introduction to Hadoop YARN

Kimaru Thagana

  • Sep 17, 2020
  • 4 Min read
  • 114 Views
  • Sep 17, 2020
  • 4 Min read
  • 114 Views
Data
Hadoop
Big Data
Data Analytics

Introduction

In big data processing, as in distributed processing, there is a critical need to manage resources within the compute cluster. The component that manages the resources must do so efficiently and independently. In the specific case of Hadoop, the first version assigned the resource management task to the Map Reduce. In this setup, there was a single component, Job Tracker, that allocated tasks to subordinate processes called in a controller-operator fashion. These tasks were mainly map and reduce tasks. The architecture presented a bottleneck due to the single controller where there was a limit on how many nodes could be added to the compute cluster.

This led to the birth of Hadoop YARN, a component whose main aim is to take up the resource management tasks from MapReduce, allow MapReduce to stick to processing, and split resource management into job scheduling, resource negotiations, and allocations. Decoupling from MapReduce gave Hadoop a large advantage since it could now run jobs that were not within the MapReduce paradigm. These include graph processing, batch processing, stream processing, and interactive processing.

This guide explores YARN (Yet Another Resource Negotiator), its architecture, and how it achieves its purpose. The guide assumes that you are familiar with the general Hadoop architecture and have a basic understanding of its components. An introductory guide to Hadoop can be found here.

Resource Utilization in a Distributed System

In a distributed system, resources, which are mainly compute power and storage, are usually remotely located and accessed. This means that there is need for a central head to coordinate how the remote resources are managed. There is also need for a resource utilization correspondent component in each node to communicate with the resource utilization controller in matters resource management and scheduling. This architecture is known as a controller-operator architecture. The controller component is the central component that gives directives in matters of resource management and subsidiary components, which are the operator nodes. They receive instructions from the controller and give feedback. This architecture allows scalability in distributed systems as long as the controller is able to efficiently handle all the operator nodes.

Hadoop YARN

This component is considered the "brain" of the Hadoop architecture. Apart from resource management and allocation, it also performs job scheduling. From the visualization below, YARN has a controller-operator paradigm. Hadoop YARN Architecture Source: Hadoop Official Site

Components

  • Resource Manager: The controller. Manages resource allocation within the compute cluster

  • Node Manager: The operator. Responsible for execution of commands from the resource manager. They are found in each data node within a Hadoop cluster.

  • Application Master: Responsible for managing jobs or tasks, negotiating resources with the resource manager, and monitoring the health status of apps running on its assigned node.

  • Container: A collection of resources such as CPU, RAM, and storage that are provided by a single node.

Conclusion

You should now understand the inner workings of the resource negotiator in Hadoop's distributed system, and have a better understanding of how several nodes in a distributed architecture can be managed.

From here, you can further explore research on resource negotiations, conflict resolution, and failure tolerance in distributes systems other than Hadoop. To build on the knowledge acquired in this guide, you can further explore other technologies in the ecosystem such as Apache Ambari, Hive and Pig, among others.

1