YARN Architecture – Yet Another Resource Negotiator, Hadoop 2

In this post, we will learn about YARN architecture. YARN (Yet Another Resource Negotiator) is the key component of Hadoop 2.x.

The underlying file system continues to be HDFS. It is basically a framework to develop and/or execute distributed processing applications. For Example MapReduce, Spark, Apache Giraph etc.

Let us look at one of the scenarios to understand the YARN architecture better.

 

YARN Architecture
YARN Architecture

Suppose we have two client requests. One wants to run the MapReduce job while another wants to execute a shell script.

MapReduce job is represented in blue color while as Shell script one is represented in green color.

Resource manager has two main components, application manager, and scheduler. The scheduler is responsible for allocating resources to the various running applications. The scheduler is pure scheduler in the sense that it performs no monitoring or tracking of the status of the application.

The scheduler also offers no guarantee for restarting of failed tasks due to hardware or application failures. The scheduler performs its scheduling function based on resource requirement of the application. It does so based on the apps extract notion of the resource container which incorporates elements such as memory, CPU, disk, network etc.

Application Manager is responsible for accepting job submissions, negotiating the first container for executing the application specific application master and provides the services for restarting the application master container on failure.

Node Manager is per-machine framework agent responsible for containers, monitoring their resources such as CPU, memory, network etc. and reporting the same to resource manager/scheduler.

Application Master has a responsibility for negotiating the appropriate resource container from the scheduler, tracking their statuses and monitoring their progresses.

The green color job in the diagram will have its own application master and the blue color job will have its own application master. An application master will handle containers.

YARN architecture - another viewAnother view of YARN architecture is where resource manager is handling job queue, resource allocation, and job scheduling.

It is allocating resources against the available resource list. Slave node is having app master handling task queue and job lifecycle logic.

Hadoop Commands – HDFS dfs commands, Hadoop Linux commands

Hadoop commands list is a lot bigger than the list demonstrated here, however, we have explained some of the very useful Hadoop commands below.

“hadoop fs” lists all the Hadoop commands that can be run in FsShell

“hadoop fs -help ” will display help for that command where is the actual name of the command.

Hadoop Commands and HDFS Commands

All HDFS commands are invoked by the “bin/hdfs” script. If we will run the hdfs scripts without any argument then it will print the description of all commands.

CommandsUsagesDescription
classpathhdfsclasspathIt prints the class path needed to get the Hadoop jar and the required libraries.
lshadoop fs -ls /List the contents of the root directory in HDFS
versionhadoop versionPrint the Hadoop version
dfhadoop fs -dfhdfs:/amount of space used and available on currently mounted filesystem
balancerhadoop balancerRun a cluster balancing utility
mkdirhadoop fs -mkdir /usr/training/hadoop_filesCreate a new directory hadoop_files below the /usr/training directory in HDFS
puthadoop fs -put /data/myfile.txt /usr/training/hadoop_filesAdd a sample text file from the unix local directory (/data/myfile.txt) to the HDFS directory /usr/training/hadoop_files
lshadoop fs -ls /usr/training/hadoop_filesList the contents of this new directory in HDFS.
puthadoop fs -put /data/finance /usr/training/hadoop_filesAdd the entire local unix directory to HDFS filesystem (/usr/training/hadoop_files)
duhadoop fs -du -s -h hadoop_files/financeSee how much space a given directory occupies in HDFS.
rmhadoop fs -rm hadoop/finance/myfile.txtDelete a file “myfile.txt” from the “finance” directory.
rmhadoop fs -rm hadoop_files/finance/*Delete all files from the “finance” directory using a wildcard.
expungehadoop fs –expungeTo empty the trash
cathadoop fs -cat hadoop_files/myfile.txtSee the content of “myfile.txt” present in /hadoop_file directory
copyToLocalhadoop fs -copyToLocalhadoop_files/myfile.txt /scratch/dataAdd the myfile.txt file from “hadoop_files” directory which is present in HDFS directory to the directory “data” which is present in your local directory
gethadoop fs -get hadoop_files/myfile.txt /scratch/dataget command can be used alternaively to “copyToLocal” command
chmodsudo -u hdfs hadoop fs -chmod 600 hadoop_files/myfiles.txt Use “-chmod” command to change permissions of a file. Default file permissions are 666 in HDFS
mvhadoop fs -mv hadoop_filesapache_hadoopMove a directory from one location to othe
expungehadoop fs -expungeCommand to make the name node leave safe mode

Hadoop fs commands – HDFS dfs commands

CommandsUsagesDescription
fshadoop fsList all the Hadoop file system shell commands
helphadoop fs –helpHelp for any command
TOUCHZhdfs dfs –touchz /hadoop_files/myfile.txtCreate a file in HDFS with file size 0 bytes
rmrhdfs dfs –rmr /hadoop_files/Remove the directory to HDFS
counthdfs dfs –count /userCount the number of directories, files, and bytes under the paths that match the specified file pattern.

Hadoop Linux commands

CommandExampleDescription
lsls -l

ls -a

ls -l /etc

Lists files in current directory.If you run ls without any additional parameters, the program will list the contents of the current directory in short form.
-l
detailed list
-a
displays hidden files
cpcp [option(s)] <sourcefile> <targetfile>

cp file1 new-file2

cp -r dir1 dir2

Copies sourcefile to targetfile.
-i
Waits for confirmation, if necessary, before an existing targetfile is overwritten
-r
Copies recursively (includes subdirectories)
mv$ mv file_1.txt /scratch/kmakMove or rename files. Copies sourcefile to targetfile then deletes the original sourcefile.
rmrm myfile.txt

rm -r mydirectory

Removes the specified files from the file system. Directories are not removed by rm unless the option -r is used.
lnln file1.txt file2.txtln creates links between files.
cdcd  /scratch/kmak/biChanges the shell’s current working directory.
pwdpwdPrint working directory.It writes the full pathname of the current working directory to the standard output.
mkdirmkdir <mydir>It is used to create directories on a file system.
rmdirrmdir <emptydir>Deletes the specified directory provided it is already empty.
nlnl myfile.txtnl numbers the lines in a file.
geditgedit myfile.txtText editor
statstat myfile.txtDisplays the status of an entire file system.
wcwc myfile.txt
wc -l myfile.txt
wc -c myfile.txt
It is used to find out the number of newline count, word count, byte, and characters count in a file specified by the file arguments.
chownchown chope file.txt
chown -R chope /scratch/work
It changes the owner and owning group of files.
chgrpchgrp oracle myfile.txtChanges group ownership of a file or files.
ifconfigIfconfigIt is used to view and change the configuration of the network interfaces on your system.

Hive Introduction – Benefits and Limitations, Principles

In the following post, we will cover Hive Introduction and key principles of Hive.

Hive Introduction – Benefits and Limitations

Hive is a data warehouse tool developed on top of Hadoop to process structured data. This is basically a wrapper written on top of map reduce programming layer that makes querying and analyzing easy.

It facilitates analysis of large data sets, ad-hoc queries, and easy data summarization through a query processing language named HQL (Hive Query Language) for the data residing on HDFS.

Due to SQL-like language, Hive is a popular choice for Hadoop Analytics. Hive’s SQL gives users multiple places to integrate their own functionality to do custom analysis, such as User Defined Functions (UDFs).

It provides massive scale-out and fault tolerance capabilities for data storage and processing of commodity hardware.

Hive was originally developed by Facebook in 2007 to handle massive volumes of data, and later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive.

It is nowadays used by many companies. For Example, Amazon uses it for Elastic MapReduce.

It is important to note that Hive is not a relational database which does not support low-level insert, update or delete operations.

It is not used for real-time data processing. Hive is not designed for online transaction processing. However, it is best suited for traditional data warehousing.

Relying on MapReduce for execution, Hive is batch-oriented and has high latency for query execution. Therefore, it uses the concept of MapReduce for execution and HDFS for storage and retrieval of data.

Principles of Hive

  1. Hive commands are similar to that of SQL which is a data warehousing tool similar to Hive.
  2. It is an extensible framework which supports different file and data formats.
  3. We can easily plug-in map reduce code in the language of our choice using user-defined functions.
  4. Performance is better in Hive since Hive engine uses the best built-in script to reduce the execution time while enabling high output.

Hive Components – Metastore, UI, Driver, Compiler and Execution Engine

Some of the key Hive components that we are going to learn in this post are UI, Driver, Compiler, Metastore, and Execution engine. Let us understand these Hive components one by one in detail below.

Apache Hive components

Hive User Interfaces (UI)

The user interface is for users to submit queries and other operations to the system. Hive includes mainly three ways to communicate to the Hive drivers.

  • CLI (Command Line Interface)

This is the most common way of interacting with Hive where we use Linux terminal to issue queries directly to Hive drivers.

  • HWI (Hive Web Interface)

It is an alternative to the CLI where we use the web browser to interact with Hive.

  • JDBC/ODBC/Thrift Server

This allows the remote client to submit the request to HIVE and retrieve the result. HIVE_PORT environment variable needs to be specified with the available port number to let the server listen on.

It is important to note that CLI is a fat client which requires a local copy of all the HIVE components as well as the Hadoop client and configurations.

Hive Driver

This component receives the queries from user interfaces (UI) and provides execute and fetch API’s modeled on JDBC/ODBC drivers.

Hive Compiler

This very component parses the query, does semantic analysis on different query blocks and finally generates the execution plan.

This is done with the help of tables and partitioned metadata that needed to be looked up into Metastore.

Hive Metastore

A Metastore is a component that stores the system catalog and metadata about tables, columns, partitions and so on.

For example – A create table definition statement is stored here. Metastore uses a relational database to store its metadata.

Apache Hive uses Derby database by default. However, this database has limitation such as multi-user access.

Any JDBC compliant database such as MySQL, Oracle can be used for Metastore. The key attributes that should be configured for Hive Metastore are given below:

hive components
HIVE Components

Hive Execution Engine

This component is responsible for executing the execution plan created by the compiler.

The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. It processes the query and generates results same as MapReduce results. It basically uses the flavor of MapReduce.

HIVE Architecture – Hadoop, HIVE Query Flow | RCV Academy

The below diagram represents Hadoop Hive Architecture and typical query that flows through the HIVE system.

HIVE Architecture
HIVE Architecture

The UI calls the execute query interface to the driver. The driver creates a session handle for the query and sends the query to the compiler to generate an execution plan.

The compiler needs the metadata to send a request for “getMetaData” and receives the “sendMetaData” request from metastore.

This metadata does the typecheck of the query expression and prunes the partitions based on query predicates.

The plan generated by the compiler is a sequence of steps where each step is either a MapReduce job, a metadata operation or an operation on HDFS.

The execution engine submits these stages to appropriate components (steps 6, 6.1, 6.2 and 6.3). Once the output is generated it is written to a temporary HDFS file through serializer.

The content of the file is read by execution engine directly from HDFS and displayed to UI clients.