Hadoop 1 Architecture – Step by Step Description, Limitations

In this post, we will learn Hadoop 1 Architecture and step by step description of the architecture. Hadoop 1 Architecture had some limitations which have been addressed in Hadoop 2.x.

Hadoop 1 Architecture Description

One or more HDFS Clients submit the job to Hadoop System.
When the Hadoop System receives a client request, it first contacts the Master Node. Master Node consists of Name node, secondary name node and data nodes which form the HDFS layer while as job tracker in the Master node gives us MapReduce layer.
Once you write a MapReduce java program say using Eclipse IDE, you convert the program into a runnable jar file. Job tracker then receives this runnable jar.
Now, job tracker needs to know on which commodity machine the data blocks are residing. Name node will give the information about machine i.e., the IP address of commodity machine where data blocks are residing.
Slave node MapReduce component, i.e., task tracker, receives the runnable jar from job tracker and perform those tasks using map reduce components.
Task Tracker will create a JVM (java virtual machine) to execute the runnable jar. The program will first run the mapper routine. The mapper routine needs the key and value pair which is fetched by the task tracker. Task tracker internally accesses the data blocks residing on slave nodes.
Mapper routine will put the result set in the context which is also a 64 MB block by default.
Task Tracker will create another JVM where the reducer routine will run. Reducer takes the input as mapper output and then shuffle, sort and reduce the data blocks and finally gives you summarized information as output.
Once all Task Trackers finishes their jobs, Job Tracker takes those results and combines them into final result-set
Hadoop client then receives the final result.

Hadoop 1.x Limitations

Hadoop 1.x supports only MapReduce-based Batch/Data Processing Applications.
It does not support Real-time data processing.
It allows only one name node and one namespace per cluster to configure, i.e. it does not support federated architecture. The entire Hadoop cluster will go down if the name node fails.
Job tracker is the single point of failure which has to perform multiple activities such as Resource Management, Job Scheduling, and Job Monitoring, etc.
Hadoop 1.x does not support Horizontal Scalability.
It can support maximum of 4000 nodes and maximum of 40,000 concurrent tasks in the cluster.