Hadoop 2 Architecture – Key Design Concepts, YARN, Summary

Hadoop 2 came up to overcome the limitations of Hadoop 1.x. Hadoop 2 architecture overcomes previous limitations and meets the current data processing requirements.

Hadoop 2 Architecture – Key Design Concepts

  • Split up the two major functions of job tracker
  • Cluster resource management
  • Application life-cycle management
  • MapReduce becomes user library or one of the applications residing in Hadoop.

Key concepts to understand before getting into Hadoop 2 Architecture details.

Application

The application is the job submitted to the framework. For example – MapReduce Jobs

Container

A container is a basic unit of allocation which represents a physical resource such as Container A = 2GB, 1CPU.

The node manager provides containers to an application. Each mapper and reducer runs in its own container to be accurate. The AppMaster allocates these containers for the mappers and reducers tasks.

Therefore, a container is supervised by node manager and scheduled by a resource manager.

Resource Manager

It is global resource scheduler which tracks resource usages and node health. The resource manager has two main components: scheduler and application manager.

The responsibility of scheduler is to allocate resources to various running applications.

It is important to note that resource manager doesn’t facilitate any monitoring of the applications.

Application Manager is responsible for negotiating the first container for executing the application specific Application Master and provide the service for restarting the containers on failure.

Node Manager

Node Manager manages the life cycle of the container. It also monitors the health of a node.

Application Master

It manages application scheduling and task execution. It interacts with the scheduler to acquire required resources and it interacts with node manager to execute and monitor the assigned tasks.

YARN (Yet Another Resource Negotiator)

YARN is the key component of Hadoop 2 Architecture. HDFS is the underlying file system.

It is a framework to develop and/or execute distributed processing applications. For Example MapReduce, Spark, Apache Giraph etc.

Please go to post YARN Architecture to understand Hadoop 2 architecture in detail.

Hadoop 2 Summary

Apache Hadoop 2.0 made a generational shift in architecture with YARN being integrated to whole Hadoop eco-system.

It allows multiple applications to run on the same platform. YARN is not only the major feature on Hadoop 2.0. HDFS has undergone major enhancement in terms of high availability (HA), snapshot and federation.

Name node high availability facilitates automated failover with a hot standby and resiliency for name node master service.

In high availability (HA) cluster, two separate machines are configured as Namenode. One Name node will be in the active state while as other will be in standby state.

All client operation is done by active node while as other acts as the slave node. Slave node is always in sync with the active node by having access to the directory on a shared storage device.

Snapshots create the recovery for backups at a given point of time which we call it the snapshot at that point.

To improve scalability and isolation, it does federation by creating multiple namespaces.