NoSQL Column Family Database – Cloud BigTable, NoSQL Database

NoSQL column family database is another aggregate oriented database. In NoSQL column family database we have a single key which is also known as row key and within that, we can store multiple column families where each column family is a combination of columns that fit together.

Column family as a whole is effectively your aggregate. We use row key and column family name to address a column family.

It is, however, one of the most complicated aggregate databases but the gain we have in terms of retrieval time of aggregate rows. When we are taking these aggregates into the memory, instead of spreading across a lot of individual records we store the whole thing in one database in one go.

The database is designed in such a way that it clearly knows what the aggregate boundaries are. This is very useful when we run this database on the cluster.

As we know that aggregate binds the data together, hence different aggregates are spread across different nodes in the cluster.

Therefore, if somebody wants to retrieve the data, say about a particular order, then you need to go to one node in the cluster instead of shooting on all other nodes to pick up different rows and aggregate it.

Among the most popular column family NoSQL databases are Apache HBase and Cassandra.

NoSQL Column Family Database
Column Family Database

Aggregate orientation is not always a good thing.

Let us consider the user needs the revenue details by product. He does not care about the revenue by orders.

Effectively he wants to change the aggregate structure from order aggregate line item to produce aggregate line items. Therefore, the product becomes the root of the aggregate.

In a relational database, it is straightforward. We just query few tables and make joins and the result is there on your screen. But when it comes to aggregate orientation database it is a pain.

We have to run different MapReduce jobs to rearrange your data into different aggregate forms and keep doing the incremental update on aggregated data in order to serve your business requirement, but this is very complicated.

Therefore, the aggregate oriented database has an advantage if most of the time you use the same aggregate to push data back and forth into the system. It is a disadvantage if you want to slice and dice data in different ways.

Application of Column family NoSQL Database

Let us understand the key application of column family NoSQL database in real world scenarios.

Big Table (Column Family Database) to store sparse data

We know that NULL values in the relational database typically consume 2 bytes of space.

This is a significant amount of wasted space when there are a number of NULL values in the database.

Let us suppose we have a “Contact Application” that stores username and contact details for every type of the network such as Home-Phone, Cell-Phone, Zynga etc.

If let us say for few of the user only Cell-Phone detail is available then there will be hundreds of bytes wasted per record.

Below is the sample contact table in RDBMS which clearly depicts the waste of space per record.

ContactIDHome-PhoneCell-PhoneEmail1Email2FacebookTwitter
1X2BNULL9867x@abcNULLNULLNULL
2X2B1234NULLNULLNULLNULL#bigtable
3X3Y34569845NULLy@wqaa@fb.com#hadoop
Contact Table in RDBMS

The storage issue can be fixed by using the BigTable which manages sparse data very well instead of RDBMS.

The BigTable will store only the columns that have values for each record instance.

If we indicate only Home-Phone, Cell-Phone and Email1 details that need to be stored for ContactID ‘1X2B’ then it will store only these three column values and rest will be ignored i.e. null will not be considered and hence no wastage of space.

RowKeyColumn Values
1234ph:cell=9867email:1=x@abc
3678social:twitter=#bigtableph:home=1234
5987email:2=y@wqasocial:facebook=a@fb.com
Contact Table in Big Table Storage

Analysing Log File using BigTable

Log analysis is a common use case for any Big Data project. All data generated through log files by your IT infrastructure often are referred to as data exhaust.

The vast information about the logs is stored in big tables. It is then analyzed nearly in real time in order to track the most updated information.

The reason why log files are stored in the BigTable is that they have flexible columns with varying structures.

HostNameIP AddressEvent DTTypeDurationDescription
server110.219.12.34-Feb-15 1:04:21 PMdwn.exe15Desktop Manager
server210.112.3.454-May-15 1:04:21 PMjboss45 
Typical Log File

 

RowKeyColumn Values
12AE23host:name=server1ip:address
=10.219.12.3
event:DT=4-Feb-15 1:04:21 PMType=
dwn.exe
Dur=15Desc=
Desktop Manager
Log File stored in BigTable

Flume Installation – Apache Flume Agent

Flume installation is a very simple process. Go through the following steps to install and configure Apache Flume.

Steps for Flume Installation

  1. Download the Apache flume from the Apache Webpage. (Apache Flume Download Url)
  2. Extract the folder from the zip file that is downloaded and point to the flume folder in bash profile. The entry to the bash profile is to make sure that you can start the flume agent from any directory. For Example : export FLUME_HOME = $HOME/apache-flume-1.6.0-bin
  3. Append the path variable with FLUME_HOME. For example, export PATH=$PATH:$FLUME_HOME/bin

Flume Agent

A flume agent is an independent java daemon (JVM) which receives events (data) from an external source to the next destination. The next destination can be an agent itself or it can be a sink. A flume agent can connect to any number of sources to any number of data storesin big data.

The next destination can be an agent itself or it can be a sink. A flume agent can connect to any number of sources to any number of datastores.

Let’s understand this with an example:

Apache Flume Agent

If you have two data sources say DS1 and DS2 and you want to write DS1 data into HDFS and DS2 data into Cassandra.

For this scenario, one flume agent is enough to complete the job. A flume agent does hop by hop operation i.e. writing the data of DS1 and DS2 to HDFS and Cassandra is one hop complete.

Let us suppose you have another hop for the data i.e. data that is written to the HDFS is read by some other application and finally it needs to go to some other datastore say hive.

Here, there are two flume agents required since we have two hops of data. One flume agent for DS1 and DS2 data to HDFS, Cassandra respectively and the other flume agent from HDFS to Hive.

There are three basic components of a flume agent:

  1. The first component to receive data. This component is called as the source.
  2. The second component to buffer data. This component is called as the channel.
  3. The third component to write data. This component is called as the sink.

A flume agent is set up using a configuration file and within that configuration file, we have to configure about the sources and the format in which data is sent to your source.

It is important to configure the channel capacity as per the rates of the source and the sink. There are many types of channels but the two are most commonly used channels.

First, is called an in-memory channel which uses the memory of the system where flume agent is running and buffers the data where in-memory. The in-memory channel basically acts like an in-memory queue.

The source will write to the tail of the queue and the sink will read from the head of the queue.

But there are issues with the memory channel. One primary issue is that it is capacity constraint by the amount of memory system has. In case of the crash, memory channel is not persistent.

Therefore, all the data that is present in the buffer might be lost. File channels are the best because it gives you fault-tolerance and non-lossy data i.e. you will get a guarantee of no data loss.

Since the data is buffered on disk for file channels, you can have larger buffer capacity as per your requirement. Channels are continuously polled by sink components which write the data to endpoints.

Multiple sources can write to the single channel. Also, one source can write to multiple channels i.e. there is a many-to-many relationship between sources and channel.

However, channel to sink relationship is one to one. The channel will not immediately delete the data as it writes to the sink. It will wait for an acknowledgment from sink before deleting any data from the channel.

Hadoop 2 Architecture – Key Design Concepts, YARN, Summary

Hadoop 2 came up to overcome the limitations of Hadoop 1.x. Hadoop 2 architecture overcomes previous limitations and meets the current data processing requirements.

Hadoop 2 Architecture – Key Design Concepts

  • Split up the two major functions of job tracker
  • Cluster resource management
  • Application life-cycle management
  • MapReduce becomes user library or one of the applications residing in Hadoop.

Key concepts to understand before getting into Hadoop 2 Architecture details.

Application

The application is the job submitted to the framework. For example – MapReduce Jobs

Container

A container is a basic unit of allocation which represents a physical resource such as Container A = 2GB, 1CPU.

The node manager provides containers to an application. Each mapper and reducer runs in its own container to be accurate. The AppMaster allocates these containers for the mappers and reducers tasks.

Therefore, a container is supervised by node manager and scheduled by a resource manager.

Resource Manager

It is global resource scheduler which tracks resource usages and node health. The resource manager has two main components: scheduler and application manager.

The responsibility of scheduler is to allocate resources to various running applications.

It is important to note that resource manager doesn’t facilitate any monitoring of the applications.

Application Manager is responsible for negotiating the first container for executing the application specific Application Master and provide the service for restarting the containers on failure.

Node Manager

Node Manager manages the life cycle of the container. It also monitors the health of a node.

Application Master

It manages application scheduling and task execution. It interacts with the scheduler to acquire required resources and it interacts with node manager to execute and monitor the assigned tasks.

YARN (Yet Another Resource Negotiator)

YARN is the key component of Hadoop 2 Architecture. HDFS is the underlying file system.

It is a framework to develop and/or execute distributed processing applications. For Example MapReduce, Spark, Apache Giraph etc.

Please go to post YARN Architecture to understand Hadoop 2 architecture in detail.

Hadoop 2 Summary

Apache Hadoop 2.0 made a generational shift in architecture with YARN being integrated to whole Hadoop eco-system.

It allows multiple applications to run on the same platform. YARN is not only the major feature on Hadoop 2.0. HDFS has undergone major enhancement in terms of high availability (HA), snapshot and federation.

Name node high availability facilitates automated failover with a hot standby and resiliency for name node master service.

In high availability (HA) cluster, two separate machines are configured as Namenode. One Name node will be in the active state while as other will be in standby state.

All client operation is done by active node while as other acts as the slave node. Slave node is always in sync with the active node by having access to the directory on a shared storage device.

Snapshots create the recovery for backups at a given point of time which we call it the snapshot at that point.

To improve scalability and isolation, it does federation by creating multiple namespaces.

YARN Architecture – Yet Another Resource Negotiator, Hadoop 2

In this post, we will learn about YARN architecture. YARN (Yet Another Resource Negotiator) is the key component of Hadoop 2.x.

The underlying file system continues to be HDFS. It is basically a framework to develop and/or execute distributed processing applications. For Example MapReduce, Spark, Apache Giraph etc.

Let us look at one of the scenarios to understand the YARN architecture better.

 

YARN Architecture
YARN Architecture

Suppose we have two client requests. One wants to run the MapReduce job while another wants to execute a shell script.

MapReduce job is represented in blue color while as Shell script one is represented in green color.

Resource manager has two main components, application manager, and scheduler. The scheduler is responsible for allocating resources to the various running applications. The scheduler is pure scheduler in the sense that it performs no monitoring or tracking of the status of the application.

The scheduler also offers no guarantee for restarting of failed tasks due to hardware or application failures. The scheduler performs its scheduling function based on resource requirement of the application. It does so based on the apps extract notion of the resource container which incorporates elements such as memory, CPU, disk, network etc.

Application Manager is responsible for accepting job submissions, negotiating the first container for executing the application specific application master and provides the services for restarting the application master container on failure.

Node Manager is per-machine framework agent responsible for containers, monitoring their resources such as CPU, memory, network etc. and reporting the same to resource manager/scheduler.

Application Master has a responsibility for negotiating the appropriate resource container from the scheduler, tracking their statuses and monitoring their progresses.

The green color job in the diagram will have its own application master and the blue color job will have its own application master. An application master will handle containers.

YARN architecture - another viewAnother view of YARN architecture is where resource manager is handling job queue, resource allocation, and job scheduling.

It is allocating resources against the available resource list. Slave node is having app master handling task queue and job lifecycle logic.

Hadoop Commands – HDFS dfs commands, Hadoop Linux commands

Hadoop commands list is a lot bigger than the list demonstrated here, however, we have explained some of the very useful Hadoop commands below.

“hadoop fs” lists all the Hadoop commands that can be run in FsShell

“hadoop fs -help ” will display help for that command where is the actual name of the command.

Hadoop Commands and HDFS Commands

All HDFS commands are invoked by the “bin/hdfs” script. If we will run the hdfs scripts without any argument then it will print the description of all commands.

CommandsUsagesDescription
classpathhdfsclasspathIt prints the class path needed to get the Hadoop jar and the required libraries.
lshadoop fs -ls /List the contents of the root directory in HDFS
versionhadoop versionPrint the Hadoop version
dfhadoop fs -dfhdfs:/amount of space used and available on currently mounted filesystem
balancerhadoop balancerRun a cluster balancing utility
mkdirhadoop fs -mkdir /usr/training/hadoop_filesCreate a new directory hadoop_files below the /usr/training directory in HDFS
puthadoop fs -put /data/myfile.txt /usr/training/hadoop_filesAdd a sample text file from the unix local directory (/data/myfile.txt) to the HDFS directory /usr/training/hadoop_files
lshadoop fs -ls /usr/training/hadoop_filesList the contents of this new directory in HDFS.
puthadoop fs -put /data/finance /usr/training/hadoop_filesAdd the entire local unix directory to HDFS filesystem (/usr/training/hadoop_files)
duhadoop fs -du -s -h hadoop_files/financeSee how much space a given directory occupies in HDFS.
rmhadoop fs -rm hadoop/finance/myfile.txtDelete a file “myfile.txt” from the “finance” directory.
rmhadoop fs -rm hadoop_files/finance/*Delete all files from the “finance” directory using a wildcard.
expungehadoop fs –expungeTo empty the trash
cathadoop fs -cat hadoop_files/myfile.txtSee the content of “myfile.txt” present in /hadoop_file directory
copyToLocalhadoop fs -copyToLocalhadoop_files/myfile.txt /scratch/dataAdd the myfile.txt file from “hadoop_files” directory which is present in HDFS directory to the directory “data” which is present in your local directory
gethadoop fs -get hadoop_files/myfile.txt /scratch/dataget command can be used alternaively to “copyToLocal” command
chmodsudo -u hdfs hadoop fs -chmod 600 hadoop_files/myfiles.txt Use “-chmod” command to change permissions of a file. Default file permissions are 666 in HDFS
mvhadoop fs -mv hadoop_filesapache_hadoopMove a directory from one location to othe
expungehadoop fs -expungeCommand to make the name node leave safe mode

Hadoop fs commands – HDFS dfs commands

CommandsUsagesDescription
fshadoop fsList all the Hadoop file system shell commands
helphadoop fs –helpHelp for any command
TOUCHZhdfs dfs –touchz /hadoop_files/myfile.txtCreate a file in HDFS with file size 0 bytes
rmrhdfs dfs –rmr /hadoop_files/Remove the directory to HDFS
counthdfs dfs –count /userCount the number of directories, files, and bytes under the paths that match the specified file pattern.

Hadoop Linux commands

CommandExampleDescription
lsls -l

ls -a

ls -l /etc

Lists files in current directory.If you run ls without any additional parameters, the program will list the contents of the current directory in short form.
-l
detailed list
-a
displays hidden files
cpcp [option(s)] <sourcefile> <targetfile>

cp file1 new-file2

cp -r dir1 dir2

Copies sourcefile to targetfile.
-i
Waits for confirmation, if necessary, before an existing targetfile is overwritten
-r
Copies recursively (includes subdirectories)
mv$ mv file_1.txt /scratch/kmakMove or rename files. Copies sourcefile to targetfile then deletes the original sourcefile.
rmrm myfile.txt

rm -r mydirectory

Removes the specified files from the file system. Directories are not removed by rm unless the option -r is used.
lnln file1.txt file2.txtln creates links between files.
cdcd  /scratch/kmak/biChanges the shell’s current working directory.
pwdpwdPrint working directory.It writes the full pathname of the current working directory to the standard output.
mkdirmkdir <mydir>It is used to create directories on a file system.
rmdirrmdir <emptydir>Deletes the specified directory provided it is already empty.
nlnl myfile.txtnl numbers the lines in a file.
geditgedit myfile.txtText editor
statstat myfile.txtDisplays the status of an entire file system.
wcwc myfile.txt
wc -l myfile.txt
wc -c myfile.txt
It is used to find out the number of newline count, word count, byte, and characters count in a file specified by the file arguments.
chownchown chope file.txt
chown -R chope /scratch/work
It changes the owner and owning group of files.
chgrpchgrp oracle myfile.txtChanges group ownership of a file or files.
ifconfigIfconfigIt is used to view and change the configuration of the network interfaces on your system.