NoSQL Column Family Database – Cloud BigTable, NoSQL Database

NoSQL column family database is another aggregate oriented database. In NoSQL column family database we have a single key which is also known as row key and within that, we can store multiple column families where each column family is a combination of columns that fit together.

Column family as a whole is effectively your aggregate. We use row key and column family name to address a column family.

It is, however, one of the most complicated aggregate databases but the gain we have in terms of retrieval time of aggregate rows. When we are taking these aggregates into the memory, instead of spreading across a lot of individual records we store the whole thing in one database in one go.

The database is designed in such a way that it clearly knows what the aggregate boundaries are. This is very useful when we run this database on the cluster.

As we know that aggregate binds the data together, hence different aggregates are spread across different nodes in the cluster.

Therefore, if somebody wants to retrieve the data, say about a particular order, then you need to go to one node in the cluster instead of shooting on all other nodes to pick up different rows and aggregate it.

Among the most popular column family NoSQL databases are Apache HBase and Cassandra.

NoSQL Column Family Database — Column Family Database

Aggregate orientation is not always a good thing.

Let us consider the user needs the revenue details by product. He does not care about the revenue by orders.

Effectively he wants to change the aggregate structure from order aggregate line item to produce aggregate line items. Therefore, the product becomes the root of the aggregate.

In a relational database, it is straightforward. We just query few tables and make joins and the result is there on your screen. But when it comes to aggregate orientation database it is a pain.

We have to run different MapReduce jobs to rearrange your data into different aggregate forms and keep doing the incremental update on aggregated data in order to serve your business requirement, but this is very complicated.

Therefore, the aggregate oriented database has an advantage if most of the time you use the same aggregate to push data back and forth into the system. It is a disadvantage if you want to slice and dice data in different ways.

Application of Column family NoSQL Database

Let us understand the key application of column family NoSQL database in real world scenarios.

Big Table (Column Family Database) to store sparse data

We know that NULL values in the relational database typically consume 2 bytes of space.

This is a significant amount of wasted space when there are a number of NULL values in the database.

Let us suppose we have a “Contact Application” that stores username and contact details for every type of the network such as Home-Phone, Cell-Phone, Zynga etc.

If let us say for few of the user only Cell-Phone detail is available then there will be hundreds of bytes wasted per record.

Below is the sample contact table in RDBMS which clearly depicts the waste of space per record.

ContactID	Home-Phone	Cell-Phone	Email1	Email2	Facebook	Twitter
1X2B	NULL	9867	x@abc	NULL	NULL	NULL
2X2B	1234	NULL	NULL	NULL	NULL	#bigtable
3X3Y	3456	9845	NULL	y@wqa	a@fb.com	#hadoop
Contact Table in RDBMS

The storage issue can be fixed by using the BigTable which manages sparse data very well instead of RDBMS.

The BigTable will store only the columns that have values for each record instance.

If we indicate only Home-Phone, Cell-Phone and Email1 details that need to be stored for ContactID ‘1X2B’ then it will store only these three column values and rest will be ignored i.e. null will not be considered and hence no wastage of space.

RowKey	Column Values
1234	ph:cell=9867	email:1=x@abc
3678	social:twitter=#bigtable	ph:home=1234
5987	email:2=y@wqa	social:facebook=a@fb.com
Contact Table in Big Table Storage

Analysing Log File using BigTable

Log analysis is a common use case for any Big Data project. All data generated through log files by your IT infrastructure often are referred to as data exhaust.

The vast information about the logs is stored in big tables. It is then analyzed nearly in real time in order to track the most updated information.

The reason why log files are stored in the BigTable is that they have flexible columns with varying structures.

HostName	IP Address	Event DT	Type	Duration	Description
server1	10.219.12.3	4-Feb-15 1:04:21 PM	dwn.exe	15	Desktop Manager
server2	10.112.3.45	4-May-15 1:04:21 PM	jboss	45
Typical Log File

RowKey	Column Values
12AE23	host:name=server1	ip:address =10.219.12.3	event:DT=4-Feb-15 1:04:21 PM	Type= dwn.exe	Dur=15	Desc= Desktop Manager
Log File stored in BigTable

Flume Installation – Apache Flume Agent

Flume installation is a very simple process. Go through the following steps to install and configure Apache Flume.

Steps for Flume Installation

Download the Apache flume from the Apache Webpage. (Apache Flume Download Url)
Extract the folder from the zip file that is downloaded and point to the flume folder in bash profile. The entry to the bash profile is to make sure that you can start the flume agent from any directory. For Example : export FLUME_HOME = $HOME/apache-flume-1.6.0-bin
Append the path variable with FLUME_HOME. For example, export PATH=$PATH:$FLUME_HOME/bin

Flume Agent

A flume agent is an independent java daemon (JVM) which receives events (data) from an external source to the next destination. The next destination can be an agent itself or it can be a sink. A flume agent can connect to any number of sources to any number of data storesin big data.

The next destination can be an agent itself or it can be a sink. A flume agent can connect to any number of sources to any number of datastores.

Let’s understand this with an example:

If you have two data sources say DS1 and DS2 and you want to write DS1 data into HDFS and DS2 data into Cassandra.

For this scenario, one flume agent is enough to complete the job. A flume agent does hop by hop operation i.e. writing the data of DS1 and DS2 to HDFS and Cassandra is one hop complete.

Let us suppose you have another hop for the data i.e. data that is written to the HDFS is read by some other application and finally it needs to go to some other datastore say hive.

Here, there are two flume agents required since we have two hops of data. One flume agent for DS1 and DS2 data to HDFS, Cassandra respectively and the other flume agent from HDFS to Hive.

There are three basic components of a flume agent:

The first component to receive data. This component is called as the source.
The second component to buffer data. This component is called as the channel.
The third component to write data. This component is called as the sink.

A flume agent is set up using a configuration file and within that configuration file, we have to configure about the sources and the format in which data is sent to your source.

It is important to configure the channel capacity as per the rates of the source and the sink. There are many types of channels but the two are most commonly used channels.

First, is called an in-memory channel which uses the memory of the system where flume agent is running and buffers the data where in-memory. The in-memory channel basically acts like an in-memory queue.

The source will write to the tail of the queue and the sink will read from the head of the queue.

But there are issues with the memory channel. One primary issue is that it is capacity constraint by the amount of memory system has. In case of the crash, memory channel is not persistent.

Therefore, all the data that is present in the buffer might be lost. File channels are the best because it gives you fault-tolerance and non-lossy data i.e. you will get a guarantee of no data loss.

Since the data is buffered on disk for file channels, you can have larger buffer capacity as per your requirement. Channels are continuously polled by sink components which write the data to endpoints.

Multiple sources can write to the single channel. Also, one source can write to multiple channels i.e. there is a many-to-many relationship between sources and channel.

However, channel to sink relationship is one to one. The channel will not immediately delete the data as it writes to the sink. It will wait for an acknowledgment from sink before deleting any data from the channel.

Hadoop 2 Architecture – Key Design Concepts, YARN, Summary

Hadoop 2 came up to overcome the limitations of Hadoop 1.x. Hadoop 2 architecture overcomes previous limitations and meets the current data processing requirements.

Hadoop 2 Architecture – Key Design Concepts

Split up the two major functions of job tracker
Cluster resource management
Application life-cycle management
MapReduce becomes user library or one of the applications residing in Hadoop.

Key concepts to understand before getting into Hadoop 2 Architecture details.

Application

The application is the job submitted to the framework. For example – MapReduce Jobs

Container

A container is a basic unit of allocation which represents a physical resource such as Container A = 2GB, 1CPU.

The node manager provides containers to an application. Each mapper and reducer runs in its own container to be accurate. The AppMaster allocates these containers for the mappers and reducers tasks.

Therefore, a container is supervised by node manager and scheduled by a resource manager.

Resource Manager

It is global resource scheduler which tracks resource usages and node health. The resource manager has two main components: scheduler and application manager.

The responsibility of scheduler is to allocate resources to various running applications.

It is important to note that resource manager doesn’t facilitate any monitoring of the applications.

Application Manager is responsible for negotiating the first container for executing the application specific Application Master and provide the service for restarting the containers on failure.

Node Manager

Node Manager manages the life cycle of the container. It also monitors the health of a node.

Application Master

It manages application scheduling and task execution. It interacts with the scheduler to acquire required resources and it interacts with node manager to execute and monitor the assigned tasks.

YARN (Yet Another Resource Negotiator)

YARN is the key component of Hadoop 2 Architecture. HDFS is the underlying file system.

It is a framework to develop and/or execute distributed processing applications. For Example MapReduce, Spark, Apache Giraph etc.

Please go to post YARN Architecture to understand Hadoop 2 architecture in detail.

Hadoop 2 Summary

Apache Hadoop 2.0 made a generational shift in architecture with YARN being integrated to whole Hadoop eco-system.

It allows multiple applications to run on the same platform. YARN is not only the major feature on Hadoop 2.0. HDFS has undergone major enhancement in terms of high availability (HA), snapshot and federation.

Name node high availability facilitates automated failover with a hot standby and resiliency for name node master service.

In high availability (HA) cluster, two separate machines are configured as Namenode. One Name node will be in the active state while as other will be in standby state.

All client operation is done by active node while as other acts as the slave node. Slave node is always in sync with the active node by having access to the directory on a shared storage device.

Snapshots create the recovery for backups at a given point of time which we call it the snapshot at that point.

To improve scalability and isolation, it does federation by creating multiple namespaces.

YARN Architecture – Yet Another Resource Negotiator, Hadoop 2

In this post, we will learn about YARN architecture. YARN (Yet Another Resource Negotiator) is the key component of Hadoop 2.x.

The underlying file system continues to be HDFS. It is basically a framework to develop and/or execute distributed processing applications. For Example MapReduce, Spark, Apache Giraph etc.

Let us look at one of the scenarios to understand the YARN architecture better.

Suppose we have two client requests. One wants to run the MapReduce job while another wants to execute a shell script.

MapReduce job is represented in blue color while as Shell script one is represented in green color.

Resource manager has two main components, application manager, and scheduler. The scheduler is responsible for allocating resources to the various running applications. The scheduler is pure scheduler in the sense that it performs no monitoring or tracking of the status of the application.

The scheduler also offers no guarantee for restarting of failed tasks due to hardware or application failures. The scheduler performs its scheduling function based on resource requirement of the application. It does so based on the apps extract notion of the resource container which incorporates elements such as memory, CPU, disk, network etc.

Application Manager is responsible for accepting job submissions, negotiating the first container for executing the application specific application master and provides the services for restarting the application master container on failure.

Node Manager is per-machine framework agent responsible for containers, monitoring their resources such as CPU, memory, network etc. and reporting the same to resource manager/scheduler.

Application Master has a responsibility for negotiating the appropriate resource container from the scheduler, tracking their statuses and monitoring their progresses.

The green color job in the diagram will have its own application master and the blue color job will have its own application master. An application master will handle containers.

Another view of YARN architecture is where resource manager is handling job queue, resource allocation, and job scheduling.

It is allocating resources against the available resource list. Slave node is having app master handling task queue and job lifecycle logic.

Hadoop Commands – HDFS dfs commands, Hadoop Linux commands

Hadoop commands list is a lot bigger than the list demonstrated here, however, we have explained some of the very useful Hadoop commands below.

“hadoop fs” lists all the Hadoop commands that can be run in FsShell

“hadoop fs -help ” will display help for that command where is the actual name of the command.

Hadoop Commands and HDFS Commands

All HDFS commands are invoked by the “bin/hdfs” script. If we will run the hdfs scripts without any argument then it will print the description of all commands.

Commands	Usages	Description
classpath	hdfsclasspath	It prints the class path needed to get the Hadoop jar and the required libraries.
ls	hadoop fs -ls /	List the contents of the root directory in HDFS
version	hadoop version	Print the Hadoop version
df	hadoop fs -dfhdfs:/	amount of space used and available on currently mounted filesystem
balancer	hadoop balancer	Run a cluster balancing utility
mkdir	hadoop fs -mkdir /usr/training/hadoop_files	Create a new directory hadoop_files below the /usr/training directory in HDFS
put	hadoop fs -put /data/myfile.txt /usr/training/hadoop_files	Add a sample text file from the unix local directory (/data/myfile.txt) to the HDFS directory /usr/training/hadoop_files
ls	hadoop fs -ls /usr/training/hadoop_files	List the contents of this new directory in HDFS.
put	hadoop fs -put /data/finance /usr/training/hadoop_files	Add the entire local unix directory to HDFS filesystem (/usr/training/hadoop_files)
du	hadoop fs -du -s -h hadoop_files/finance	See how much space a given directory occupies in HDFS.
rm	hadoop fs -rm hadoop/finance/myfile.txt	Delete a file “myfile.txt” from the “finance” directory.
rm	hadoop fs -rm hadoop_files/finance/*	Delete all files from the “finance” directory using a wildcard.
expunge	hadoop fs –expunge	To empty the trash
cat	hadoop fs -cat hadoop_files/myfile.txt	See the content of “myfile.txt” present in /hadoop_file directory
copyToLocal	hadoop fs -copyToLocalhadoop_files/myfile.txt /scratch/data	Add the myfile.txt file from “hadoop_files” directory which is present in HDFS directory to the directory “data” which is present in your local directory
get	hadoop fs -get hadoop_files/myfile.txt /scratch/data	get command can be used alternaively to “copyToLocal” command
chmod	sudo -u hdfs hadoop fs -chmod 600 hadoop_files/myfiles.txt	Use “-chmod” command to change permissions of a file. Default file permissions are 666 in HDFS
mv	hadoop fs -mv hadoop_filesapache_hadoop	Move a directory from one location to othe
expunge	hadoop fs -expunge	Command to make the name node leave safe mode

Hadoop fs commands – HDFS dfs commands

Commands	Usages	Description
fs	hadoop fs	List all the Hadoop file system shell commands
help	hadoop fs –help	Help for any command
TOUCHZ	hdfs dfs –touchz /hadoop_files/myfile.txt	Create a file in HDFS with file size 0 bytes
rmr	hdfs dfs –rmr /hadoop_files/	Remove the directory to HDFS
count	hdfs dfs –count /user	Count the number of directories, files, and bytes under the paths that match the specified file pattern.

Hadoop Linux commands

Command	Example	Description
ls	ls -l ls -a ls -l /etc	Lists files in current directory.If you run ls without any additional parameters, the program will list the contents of the current directory in short form. -l detailed list -a displays hidden files
cp	cp [option(s)] <sourcefile> <targetfile> cp file1 new-file2 cp -r dir1 dir2	Copies sourcefile to targetfile. -i Waits for confirmation, if necessary, before an existing targetfile is overwritten -r Copies recursively (includes subdirectories)
mv	$ mv file_1.txt /scratch/kmak	Move or rename files. Copies sourcefile to targetfile then deletes the original sourcefile.
rm	rm myfile.txt rm -r mydirectory	Removes the specified files from the file system. Directories are not removed by rm unless the option -r is used.
ln	ln file1.txt file2.txt	ln creates links between files.
cd	cd /scratch/kmak/bi	Changes the shell’s current working directory.
pwd	pwd	Print working directory.It writes the full pathname of the current working directory to the standard output.
mkdir	mkdir <mydir>	It is used to create directories on a file system.
rmdir	rmdir <emptydir>	Deletes the specified directory provided it is already empty.
nl	nl myfile.txt	nl numbers the lines in a file.
gedit	gedit myfile.txt	Text editor
stat	stat myfile.txt	Displays the status of an entire file system.
wc	wc myfile.txt wc -l myfile.txt wc -c myfile.txt	It is used to find out the number of newline count, word count, byte, and characters count in a file specified by the file arguments.
chown	chown chope file.txt chown -R chope /scratch/work	It changes the owner and owning group of files.
chgrp	chgrp oracle myfile.txt	Changes group ownership of a file or files.
ifconfig	Ifconfig	It is used to view and change the configuration of the network interfaces on your system.