NoSQL databases – Introduction, features, NoSQL vs SQL

NoSQL databases are a non-relational database management system, which is cluster-friendly, designed for the large volume of distributed data stores.

Relational Model follows the de facto standard for database design which uses primary and foreign keys relation in order to store or manipulate data.

However, with the growing volume of unstructured data in distributed computing environment, relational model does not suites well. Relational models were not built to take advantage of commodity storage and the processing power available in today’s arena.

As the data volume grows in size, it is difficult to store the data into single node system which relational model adhere to. This gives the birth of commodity storage where the large cluster of commodity machine interacting each other in distributed fashion.

SQL (Structured Query Language) is designed to work with single node system. It does not work very well with the large cluster of storage. Therefore, top internet companies such as Google, Facebook and Amazon started looking solution to overcome the drawback of RDBMS.

This inspired the whole new movement of databases which is the “NoSQL” movement.

NoSQL databases do not require the fixed schema and typically it scales horizontally i.e. addition of extra commodity machine to the resource pool, so that load can be distributed easily.

Sometimes we create the data with several levels of nesting which is highly complicated to understand. For example geo-spatial, molecular modeling data.

Big Data NoSQL databases ease the representation of nested or multi-level hierarchical data using the JSON (JavaScript Object Notation) format.

NoSQL Databases Features

Lets got through some of the key NoSQL database features and how it is different from the traditional databases.

Schemaless databases

Traditional databases require pre-defined schema.

A database schema is basically the structure that tells how the data is organized, relations among database objects and the constraint that is applied to the data.

While in NoSQL database there is no need to define the structure. This gives us the flexibility to store information without doing upfront schema design.

Therefore, a user of NoSQL databases can store data of different structures in the same database table. That’s why; these databases are also sometimes referred as “Schema on Read” databases.

That means data is applied to a plan or schema only when it is read or pulled out from a stored location.

Non-Relational

NoSQL databases are non-relational in nature. It can store any type of contents.

There is no need to apply the data modeling techniques such as ER modeling, Star modeling etc.

A single record can accommodate transaction as well as attribute details such as address, account, cost center etc.

Non-Relational doesn’t fit into rows and columns and is mainly designed to take care unstructured data.

Distributed Computing

You can scale your system horizontally by taking advantage of low-end commodity servers.

Distribution of processing load and the scaling of data sets is the common features of many NoSQL databases.

Data is automatically distributed over the cluster of commodity servers and if you need further improvement to the scalability, you can keep adding the commodity server in the cluster.

Aggregate Data Models

Aggregation Model talks about data as a unit. It makes easier to manage data over a cluster.

When the unit of data gets retrieved from the NoSQL databases, it gets all the related data along with it.

Let us say we need to find Product by Category. In the relational model, we use normalization technique and we create two tables as Product and Category respectively. Whenever we need to retrieve the details about Product by Category then we perform a join operation and retrieve the details.

While as in NoSQL databases, we create one document which holds product as well as category information.

Product =

{

sku: 321342,

name:book

price:50.00

subject: mathematics

item_in_stocks: 5000

category:[{id:1,name:math5},{id:2,name:math6}]

}

Flume Hadoop Agent – Spool directory to HDFS

In the previous post, we have already seen how to write the data from spooling directory to the console log. In this post, we will learn how Flume Hadoop agent can write data from Spooling directory to HDFS.

Similarly, if we want to write the data from spooling directory to HDFS we need to edit the properties files and change the sink information to HDFS.

spool-to-hdfs.properties

agent1.sources =datasource1
agent1.sinks =datastore1
agent1.channels =ch1

agent1.sources.datasource1.channels = ch1
agent1.sinks.datastore1.channels = ch1

agent1.sources.datasource1.type =spooldir
agent1.sources.datasource1.spooldir =/usr/kmayank/spooldir
agent1.sinks.datastore1.type =hdfs
agent1.sinks.datastore1.hdfs.path =/temp/flume
agent1.sinks.datastore1.hdfs.filePrefix =events
agent1.sinks.datastore1.hdfs.fileSuffix =.log
agent1.sinks.datastore1.hdfs.inUsePrefix =_
agent1.sinks.datastore1.hdfs.filetype =Datastream

agent1.channels.ch1.type =file

With the help of above properties file when the file will be written to HDFS, the file name will be events_.log

How many events read from the source will go into one HDFS file?

The files in HDFS are rolled over every 30 seconds by default. You can change the interval by setting the “rollInterval” property. The value of “rollInterval” property will be specified in seconds. You can also roll over the files by event count or cumulative event size.

“Filetype” property can be of three different types i.e. sequence file, datastream (text file) or compressed stream.

The default option is sequence file which is binary format file. The event body will be the byte array and the byte array will be written to the sequence file.

It is expected that whoever is reading the sequence file knows how to serialize the binary data to object/data.

Datastream can be any file which is uncompressed such as text file while as the compressed stream will be any file format such as gzip, rar, bzip2 etc.

Now, start the flume agent using below command:

>flume-ng agent \
>--conf-file spool-to-hdfs.properties \
>--name agent1 \
>--Dflume.root.logger=WARN, console

Once, the Flume Hadoop agent is ready, start putting the files in spooling directory. It will trigger some actions in the flume agent.

Once you will see that the spooling directory files are suffixed with “COMPLETED”, go to the HDFS and check whether files have arrived or not. Use below command to list the file in HDFS directory.

hadoop fs –ls /temp/flume

Use the ‘cat’ command to print the content of the file.

hadoop fs –cat /temp/flume/<file_name>

Apache Flume – Apache Sqoop | Need and Importance

This tutorial outlines the need and importance of Apache Flume and Apache Sqoop. There are quite a few number of data stores in the market today. Among them, the most popular ones are HDFS, Hive, HBase, MongoDB, Cassandra.

Advantages of Distributed Data Stores

All of them are open source technologies, but the key question to think about is, why are these data stores so popular?

All of these data stores are distributed in nature i.e. they use the cluster of machines to scale them linearly.

The other advantage of these data stores is that you can use one single system for both transactional and analytical data.

But the main question here is where does this data come from? General, the data come from two kinds of sources.

Either it is an application that is producing the data such as user notifications, weblogs, sales etc produces a large amount of data on regular basis.

Therefore, we need a datastore to store this kind of data.

The other kind of data source is a traditional RDBMS data. This could be OracleDB, MySQL, IBM DB2, SQL Server etc.

Steps to port RDBMS data to HDFS

Let us suppose we have a requirement to port the archived RDBMS data to HDFS datastore. The important thing to understand is that how do we get the RDBMS or Application data into HDFS.

Normally, Hadoop Ecosystem exposes JAVA API’s to write data into HDFS or different data stores. Therefore, we have JAVA API’s for HDFS, HBASE, Cassandra etc.

We can directly use these API’s to write the data to HDFS, HBase, Cassandra etc. But there are problems when you are streaming from an application or bulk transferring the data from tables in RDBMS.

Let us suppose we have an application that sends “user notifications” for a professional network and then tracks metrics such as the number of views or number of clicks on user notifications.

We have the number of events that are producing data for example:

Creating a notification is an event
The user reading the notification
The user clicking on a notification produces data

This data needs to be stored as the event occurs. The moment this data gets generated, it needs to be sent to the data store. This is called streaming data.

Let us suppose we want to write this data to HDFS. For this, firstly we need to integrate my application with JAVA API’s of HDFS data store.

Secondly, we need to devise a mechanism to buffer these streaming data. It is important to note that HDFS stores the data in the form of files which is distributed across the cluster of commodity server in terms of the block of data say 64MB, 128 MB chunk etc.

The metadata about these data blocks is written into one high-end server which we call as NameNode. If there is a number of small files then we are creating an extra overhead to NameNode to keep the metadata information for more number of files.

If we have the small amount of data then we are not taking the advantage of HDFS which has the cluster of commodity server.

Now, we have a challenge here, how do we create large size files with the small number from of streaming data. One way is to keep HDFS file open and keep writing into the same file until it is big enough.

But, this is not a good idea because if you will keep the file open for longer duration there are a lot of chances that file become corrupt or data being lost in case there is a failure.

Therefore, the solution is the buffer, i.e either you have an in-memory buffer or intermediate file before we write to HDFS.

It is not just the buffer mechanism, you need to create a single large file but the buffer layer should be fault tolerant and non-lossy.

Therefore, we can say that writing an application which buffers the data and integrating with the JAVA API’s of the various data store is a time-consuming and costly process.

Similarly, when you want to port your RDBMS data to HDFS, the similar kind of challenges will occur.

So, directly using the JAVA API’s is a very tedious process and hence the solution is flume Hadoop and Sqoop Hadoop which are the technologies developed to isolate and abstract the transport of data between a source and a data store.

Apache Flume

Apache Flume and Apache Sqoop are open source technologies developed and maintained by Apache foundation.

Apache Flume acts as a buffer between your streaming application and data store. Apache Flume can read data from different types of sources such as HTTP, a file directory, Syslog messages etc. and write data to many types of sinks e.g. HDFS, HBase, Cassandra etc.

Once we define the source and the sink then we will put Apache Flume in between to transport the data between application and data store.

Apache Flume will buffer the data based on the requirement of the source and the sink.

The source might have different rates at which it writes the data and the sink might have the different rate at which it reads the data.

Different sources might produce data at different rates and in different formats. Flume Hadoop can read the different sources at the same time. Therefore Flume Hadoop will able to deal with different rates and different formats.

Sink might require data to be written in particular format at a particular rate. Apache Flume comes with built-in handlers for common sources and sinks.

Apache Flume takes care of fault tolerance and guarantees no data loss for certain configurations. Apache Flume is a push-based system meaning it will decide when the data should be written to sink.

Apache Sqoop

In a pull-based system, the sink would subscribe a system that produces the data and pull the data at a regular frequency.

Whenever you add multiple sinks then the configuration of flume must change while as in case of the pull-based system configuration of a system that is writing will not change only a new system will start subscribing when it is added as part of the sink. Apache Kafka is an example of the pull-based system.

Apache Kafka is an example of the pull-based system.

Apache Sqoop is a command line tool which can directly import data to RDBMS from Hadoop layer.

Therefore, Sqoop is a pull-based system used for bulk import, not the streaming data.

Sqoop comes with the connector for many popular RDBMS. With the help of Sqoop, you can import the entire tables from RDBMS or the result of specific SQL Queries. You can also schedule periodic imports from Apache Sqoop jobs.

Big Data Application in Businesses – Using big data to improve business

Efficient data analysis enables companies to optimize everything in the value chain – from sales to order delivery, to optimal store hours.

Below tabular chart shows in what area various businesses use big data application to improve their business models.

Big data enables the organization to define key marketing strategies and is utilized in almost every sector of industries.

Domain	Applications
Retail / ecommerce	01. Market basket analysis 02. Campaign & customer loyalty mgmt program 03.Supply chain management & analytics 04. Behavior tracking 05. Market and consumer segmentation 06. Recommendation engines (to increase order size through complementary products) 07. Cross-channel analytics 08. Individual targeting with right offer at right time
Financial Services	01. Real-time customer insights 02. Risk analysis and management 03. Fraud detection 04. Customer loyalty management 05. Credit risk modeling/analysis 06. Trade surveillance, detecting abnormal activities
IT Operations	01. Log analysis for pattern identification/process analysis. 02. Massive storage and parallel processing 03. Data mashup to extract intelligence from data
Health & Life Sciences	01. Health-insurance fraud detection 02. Campaign management 03. Brand & reputation management 04. Patient care and service quality management 05. Gene mapping and analytics 06. Drug discovery
Communication, Media & Technology	01. Real-time calls analysis 02. Network performance management 03. Social graph analysis 04. Mobile user usage analysis
Governance	01. Compliance and regulatory analysis 02. Threat detection, crime prediction 03. Smart cities and e-governance 04. Energy management

Big Data Challenges – Top challenges in big data analytics

There are multiple big data challenges that this great opportunity has thrown at us.

With the advent of the “Internet of things (IOT)”, efficient analytics and increased connectivity through new technology and software bring significant opportunities for companies.

However, we do see companies facing challenges in leveraging the value that data have to offer. Below are few of the major Big Data challenges:

Meeting the need for speed (Processing Capabilities)

How to match the processing speed with the speed at which the data is being generated.

How to extract useful information out of the heap, one possible solution is hardware.

Few customers use increased memory and powerful parallel processing to crunch large volumes of data quickly.

Another method is putting data in-memory. This allows organizations to explore huge data volumes and gain business insights in near-real time.

Understanding the data

This is one of the basic challenges to understand and prioritize the data coming from the variety of sources where ninety percent of data is noise.

We have to filter out the valuable data from noise. It requires the good understanding of data so that you can use visualization as part of data analysis.

One solution to this challenge is to have the proper domain expertise in place. The people who are analyzing the data should have a deep understanding of where the data comes from, what audience will be consuming the data and how that audience will interpret the information.

Addressing data quality and consistency

Even if you put the data in the proper context for the audience who will be consuming the information, the value of data for decision-making purposes will be jeopardized if the data is not accurate or timely.

Again the data visualization tools and techniques play an important role to assure data quality.

Data access and connectivity

Data access and connectivity can be another obstacle.

Companies often do not have the right platforms to aggregate and manage the data across the enterprise as the majority of data points are not yet connected.

In order to overcome this obstacle of growing volume of data which is not yet connected, companies like Accenture, Siemens formed a joint venture which focuses on solutions and services for system integration and data management.