NoSQL Column Family Database – Cloud BigTable, NoSQL Database

NoSQL column family database is another aggregate oriented database. In NoSQL column family database we have a single key which is also known as row key and within that, we can store multiple column families where each column family is a combination of columns that fit together.

Column family as a whole is effectively your aggregate. We use row key and column family name to address a column family.

It is, however, one of the most complicated aggregate databases but the gain we have in terms of retrieval time of aggregate rows. When we are taking these aggregates into the memory, instead of spreading across a lot of individual records we store the whole thing in one database in one go.

The database is designed in such a way that it clearly knows what the aggregate boundaries are. This is very useful when we run this database on the cluster.

As we know that aggregate binds the data together, hence different aggregates are spread across different nodes in the cluster.

Therefore, if somebody wants to retrieve the data, say about a particular order, then you need to go to one node in the cluster instead of shooting on all other nodes to pick up different rows and aggregate it.

Among the most popular column family NoSQL databases are Apache HBase and Cassandra.

NoSQL Column Family Database
Column Family Database

Aggregate orientation is not always a good thing.

Let us consider the user needs the revenue details by product. He does not care about the revenue by orders.

Effectively he wants to change the aggregate structure from order aggregate line item to produce aggregate line items. Therefore, the product becomes the root of the aggregate.

In a relational database, it is straightforward. We just query few tables and make joins and the result is there on your screen. But when it comes to aggregate orientation database it is a pain.

We have to run different MapReduce jobs to rearrange your data into different aggregate forms and keep doing the incremental update on aggregated data in order to serve your business requirement, but this is very complicated.

Therefore, the aggregate oriented database has an advantage if most of the time you use the same aggregate to push data back and forth into the system. It is a disadvantage if you want to slice and dice data in different ways.

Application of Column family NoSQL Database

Let us understand the key application of column family NoSQL database in real world scenarios.

Big Table (Column Family Database) to store sparse data

We know that NULL values in the relational database typically consume 2 bytes of space.

This is a significant amount of wasted space when there are a number of NULL values in the database.

Let us suppose we have a “Contact Application” that stores username and contact details for every type of the network such as Home-Phone, Cell-Phone, Zynga etc.

If let us say for few of the user only Cell-Phone detail is available then there will be hundreds of bytes wasted per record.

Below is the sample contact table in RDBMS which clearly depicts the waste of space per record.

ContactIDHome-PhoneCell-PhoneEmail1Email2FacebookTwitter
1X2BNULL9867x@abcNULLNULLNULL
2X2B1234NULLNULLNULLNULL#bigtable
3X3Y34569845NULLy@wqaa@fb.com#hadoop
Contact Table in RDBMS

The storage issue can be fixed by using the BigTable which manages sparse data very well instead of RDBMS.

The BigTable will store only the columns that have values for each record instance.

If we indicate only Home-Phone, Cell-Phone and Email1 details that need to be stored for ContactID ‘1X2B’ then it will store only these three column values and rest will be ignored i.e. null will not be considered and hence no wastage of space.

RowKeyColumn Values
1234ph:cell=9867email:1=x@abc
3678social:twitter=#bigtableph:home=1234
5987email:2=y@wqasocial:facebook=a@fb.com
Contact Table in Big Table Storage

Analysing Log File using BigTable

Log analysis is a common use case for any Big Data project. All data generated through log files by your IT infrastructure often are referred to as data exhaust.

The vast information about the logs is stored in big tables. It is then analyzed nearly in real time in order to track the most updated information.

The reason why log files are stored in the BigTable is that they have flexible columns with varying structures.

HostNameIP AddressEvent DTTypeDurationDescription
server110.219.12.34-Feb-15 1:04:21 PMdwn.exe15Desktop Manager
server210.112.3.454-May-15 1:04:21 PMjboss45 
Typical Log File

 

RowKeyColumn Values
12AE23host:name=server1ip:address
=10.219.12.3
event:DT=4-Feb-15 1:04:21 PMType=
dwn.exe
Dur=15Desc=
Desktop Manager
Log File stored in BigTable

Document Database – Application and Example, NoSQL Database

The main concept behind document database is documents which can be JSON, BSON, XML, and so on. Document database stores documents and retrieves documents.

The data structure defined inside the document databases is hierarchical in nature which can be a scalar value, map or a collection. It is similar to a key-value database but the only difference is that the document database stores the data in form of a document which embeds attribute metadata associated with the stored content.

Document database
Document Database

Every document databases use their own file structure to store data. For example, Apache CouchDB uses JSON to store data, javascript as its query language and HTTP protocol for its API’s.

Document databases are one of the main categories of NoSQL databases. XML databases are a subclass of document databases and are optimized to work with XML documents.

Graph databases are similar to document databases. But graph databases have one more layer, the relationship, which allows graph databases to link documents for rapid traversal.

Among the most popular document databases are MongoDB, Informix, DocumentDB, CouchDB, BaseX.

Application of Document Databases

Let us see an example of the application of document database in the real-world scenario.

Financial derivatives trading service management

Financial Product Markup Language (FpML) is the most common type of document in financial services industry.

They follow an XML schema structure. FpML is mainly used for trading purposes in derivative markets.

Financial institutions need to analyze multiple risk metrics and single view of the customer using the FpML document.

These documents are fed to the document databases and by using the dynamic query language to allow granular access to any data attribute. These databases also allow grouping and reshaping of data through their aggregation framework for their intraday analysis.

These databases also allow grouping and reshaping of data through their aggregation framework for their intraday analysis.

Key-Value Database – NoSQL Key Value, Application and Examples

Key-Value database has a Big Hash Table of keys and values which are highly distributed across a cluster of commodity servers. Key-Value database typically guarantees Availability and Partition Tolerance.

The key-value database trades off the Consistency in data in order to improve write time.

key-value database
Key-Value Database

The key in the key-value database can be synthetic or auto-generated which enables you to uniquely identify a single record in the database. The values can be String, JSON, BLOB etc.

Among the most popular key-value database are Amazon DynamoDB, Oracle NoSQL Database, Riak, Berkeley DB, Aerospike, Project Voldemort, IBM Informix C-ISAM.

Application of Key-Value Database – NoSQL Key Value

Let us take some real-life examples where the key-value database is utilized and the benefits they provide.

Managing Web Advertisements

Key-Value databases are mainly used by web advertisement companies.

User’s activity is tracked on web-based, language and location. On the basis of users online activity, web advertisement companies decide which advertisement to show to the user.

It is also important to note that serving advertisement should be fast enough.

It is important to target right advertisement to the right customer in order to receive more clicks and hence to maximize the profits.

Combination of factors such as user’s tracked activity online, language and location determine what a user is interested in forms the key while as all other factors that are needed to serve the advertisement better is kept as the value in key-value databases.

User’s session data retrieval

Your website needs to be efficient and fast to give a user the best service.

How much efficient your database is, if your website runs slow then from a user perspective your entire service is slow.

Websites primarily go slow because of user’s session are handled poorly. Instead of caching the information if every request requires opening a new session then the website will go slow.

User interactions with the website are tracked by the website cookies.

A cookie is a small file which has a unique id that can act as a key in key-value databases. The server uses the cookies to identify the returning users or a new set of users.

The server needs to fetch the data quickly by doing a lookup on cookies. The cookies will give the information about which pages they visit, what information they are looking for and about user’s profile etc.

Key-value stores are, therefore, ideal for storing and retrieving session data at high speeds. The unique Id generated by cookies act as a key while as the other information such as user profiles act a value.

NoSQL Database Types – Introduction, Example, Comparison and List

In this post, you will learn about NoSQL databases types and basic features of different NoSQL database types. NoSQL databases can be broadly categorized into four categories.

  1. Key-Value databases
  2. Document Databases
  3. Column family NoSQL Database
  4. Graph Databases

NoSQL Database Types Introduction

Let’s go through the short introduction and understand the features of all these NoSQL database types below. NoSQL databases are widely used in Big Data and provide operational intelligence to users.

Key-Value databases

It has a Big Hash Table of keys and values which are highly distributed across a cluster of commodity servers. Key-Value databases typically guarantee Availability and Partition Tolerance.

Key-value databases trade off the Consistency in data in order to improve write time.

The key can be synthetic or auto-generated which enables you to uniquely identify a single record in the database. The values can be String, JSON, BLOB etc.

Among the most popular key-value databases are Amazon DynamoDB, Oracle NoSQL Database, Riak, Berkeley DB, Aerospike, Project Voldemort, IBM Informix C-ISAM.

Document Databases

The main concept behind document databases is documents which can be JSON, BSON, XML, and so on. Document databases store documents and retrieve documents.

The data structure defined inside the document databases is hierarchical in nature which can be a scalar value, map or a collection. It is similar to a key-value database but the only difference is that the document database stores the data in form of a document which embeds attribute metadata associated with the stored content.

Every document databases use their own file structure to store data. For example, Apache CouchDB uses JSON to store data, javascript as its query language and HTTP protocol for its API’s.

Among the most popular document databases are MongoDB, Informix, DocumentDB, CouchDB, BaseX.

Column family NoSQL Database

Column family NoSQL database is another aggregate oriented database.

In column family NoSQL database we have a single key which is also known as the row-key and within that, we can store multiple column families where each column family is a combination of columns that fit together. Column family as a whole is effectively your aggregate. We use row key and column family name to address a column family.

It is, however, one of the most complicated aggregate database but the gain we have in terms of retrieval time of aggregate rows. When we are taking these aggregates into the memory, instead of spreading across a lot of individual records we store the whole thing in one database in one go.

The database is designed in such a way that it clearly knows what the aggregate boundaries are. This is very useful when we run this database on the cluster.

As we know that aggregate binds the data together, hence different aggregates are spread across different nodes in the cluster.

Therefore, if somebody wants to retrieve the data, say about a particular order, then you need to go to one node in the cluster instead of shooting on all other nodes to pick up different rows and aggregate it.

Among the most popular column family NoSQL databases are Apache HBase and Cassandra.

Graph Databases

Graph databases store data in the form of the graph.

Let us try to understand what the graph is. A graph is a mathematical model used to establish a relation between two objects.

We will discuss the whole concept of graph database taking Neo4j as the base database.

Neo4j is an open source NoSQL graph database implemented in JAVA and Scala. The source code is available on GitHub and is used by companies such as Walmart, eBay, LinkedIn etc.

CAP Theorem – Brewer’s Theorem | Hadoop HBase

In this post, we will understand about CAP theorem or Brewer’s theorem. This theorem was proposed by Eric Brewer of  University of California, Berkeley.

CAP Theorem or Brewer’s Theorem

CAP theorem, also known as Brewer’s theorem states that it is impossible for a distributed computing system to simultaneously provide all the three guarantee i.e.  Consistency, Availability or Partition tolerance.

Therefore, at any point of time for any distributed system, we can choose only two of consistency, availability or partition tolerance.

Availability

Even if any of one node goes down, we can still access the data.

Consistency

You access the most recent data.

Partition Tolerance

Between the nodes, it should tolerate network outage.

The above of the three guarantees are shown in three vertices of a triangle and we are free to choose any side of the triangle.

Therefore, we can choose (Availability and Consistency) or (Availability and Partition Tolerance) or (Consistency and Partition Tolerance).

Please refer to figure below:

CAP theorem
CAP Theorem

Relational Databases such as Oracle, MySQL choose Availability and Consistency while databases such as Cassandra, Couch, DynoDB choose Availability and Partition Tolerance and the databases such as HBase, MongoDB choose Consistency and Partition Tolerance.

CAP Theorem Example 1:  Consistency and Partition Tolerance

Let us take an example to understand one of the use cases say (Consistency and Partition Tolerance).

These databases are usually shared or distributed data and they tend to have master or primary node through which they can handle the right request. A good example is MongoDB.

What happens when the master goes down?

In this case, usually another master will get elected and till then data can’t be read from other nodes as it is not consistent. Therefore, availability is sacrificed.

However, if the write operation went fine and there is network outage between the nodes, there is no problem because the secondary node can serve the data. Therefore, partition tolerance is achieved.

CAP Theorem Example 2: Availability and Partition Tolerance

Let us try to understand an example for Availability and Partition Tolerance.

These databases are also shared and distributed in nature and usually master-less. This means every node is equal. Cassandra is a good example of this kind of databases.

Let us consider we have an overnight batch job that writes the data from a mainframe to Cassandra database and the same database is read throughout a day. If we have to read the data as and when it is written then we might get stale data and hence the consistency is sacrificed.

Since this is the read heavy and write once use case, I don’t care about reading data immediately. I just care about once the write has happened, we can read from any of the nodes.

But Availability is one of the important parameters because if one of the nodes goes down we can be able to read the data from another backup node. The system as a whole is available.

Partition tolerance will help us in any network outage between the nodes. If any of the nodes goes down due to network issue another node can take it up.