The various Big Data layers are discussed below, there are four main big data layers.
Big data management architecture should be able to incorporate all possible data sources and provide a cheap option for Total Cost of Ownership (TCO).
Big Data technologies provide a concept of utilizing all available data through an integrated system.
You can choose either open source frameworks or packaged licensed products to take full advantage of the functionality of the various components in the stack. The various Big Data layers are discussed below:
Data Sources Layer
Data Source layer has a different scale – while the most obvious, many companies work in the multi-terabyte and even petabyte arena.
It incorporates structured, unstructured
and/or semi-structured data captured from transactions, interactions and observations systems such as Facebook, twitter.
This very wide variety of data, coming in huge volume with high velocity has to be seamlessly merged and consolidated so that the analytics engines, as well as the visualization tools, can operate on it as one single big data set.
The responsibility of this layer is to separate the noise and relevant information from the humongous data set which is present at different data access points.
This layer should have the ability to validate, cleanse, transform, reduce, and integrate the data into the big data tech stack for further processing.
Once the relevant information is captured, it is sent to manage layer where Hadoop distributed file system (HDFS) stores the relevant information based on multiple commodity servers.
This layer is supported by storage layer—that is the robust and inexpensive physical infrastructure is fundamental to the operation and scalability of big data architecture.
This layer also provides the tools and query languages to access the NoSQL databases using the HDFS storage file system sitting on top of the Hadoop physical infrastructure layer.
The data is no longer stored in a monolithic server where the SQL functions are applied to crunch it.
Redundancy is built into this infrastructure for the very simple reason that we are dealing with large volume of data from different sources.
The key building blocks of the Hadoop platform management layer is MapReduce programming which executes set of functions against a large amount of data in batch mode.
The map function does the distributed computation task while the reduce function combines all the elements back together to provide a result.
An example of MapReduce program would be to determine how many times a particular word appeared in a document.
Analyze & Visualize Layer
This layer provides the data discovery mechanisms from the huge volume of data.
For the huge volume of data, we need fast search engines with iterative and cognitive approaches. Search engine results can be presented in various forms using “new age” visualization tools and methods.
Real-time analysis can leverage NoSQL stores (for example, Cassandra, MongoDB, and others) to analyze data produced by web-facing apps.
Big data is important because it enables companies to gather, store, manage, and manipulate vast amounts data at the right speed and at the right time to gain the right insights. To achieve this, big data dimension must be considered in any big data solution.
A big data solution must address the below mentioned four V’s, also known as attributes of big data.
In addition to Volume, the velocity of data is used to define the speed with which different types of data gets generated every second.
Variety refers to structured, unstructured and semi-structured nature of data such as web logs, sensor data, radio frequency ID (RFID), meter data, stock ticker data, tweets, images, and video files on the Internet.
Veracity plays an important role in addressing whether data can be trusted or not when decisions need to be taken.
The variety of data is the first big data dimension.
Variety refers to collecting data from various sources (human and machine) and include data from sources like, social media, credit card usage, website visits, retail shops, hospitals, mobiles, sensors, log files, security cameras, etc.
As data is captured from the variety of sources and multiple data types like structured, semi-structured and unstructured from internal systems and external systems so it becomes very important to integrate these multiple data types.
Volume is the second dimension of big data, volume refers to the quantity of data.
With internet era the data is generated by machines, human interaction on social sites and other platforms, so the volume of data generated every day is humongous.
IBM estimates that 2.5 quintillion bytes of data is created each day.
The third big data dimension deals with the speed of data which flows from various sources like social media and internal business processes.
In the internet era the flow of data from social media is massive and continuous so handling the velocity of such amount of data and coming up with meaningful information helps the organization in making key business decisions.
Veracity is the fourth attribute which refers to the abnormality of data. How much of the data can be trusted as it is when decisions have to be taken.
This dimension focuses on how to integrate data from different sources into a consistently high-quality data which can be helpful in making the meaningful decision for a business.
Before we start Big Data definition and introduction we need to understand why do we need big data technology when we have high performance and reliable relational database management system (RDBMS)?
Why Big Data
The reason to use big data is that, in the relational databases, data is stored in a structured format with data modeling techniques such as entity-relationship modeling, star schema modeling or snowflake schema techniques.
Initially, it was just transactional data and hence if the data grows over a period of time, organizations started analyzing the data using data marts and data warehouses.
Business Intelligence done on top of data marts and data warehouses is the key drivers for CxOs to make forecasts, define budgets, and determine new market drivers of growth.
Until the era of internet, business intelligence analysis was done on the enterprise data. However, in the era of internet, data existing outside the enterprise become the key need for strategic decisions.
Things started getting more complex in terms of the variety, velocity and volume of data with the advent of social networking sites and search engines such as Google, Yahoo, and Bing.
Businesses need to find the pragmatic approach to capture this information to survive or gain a competitive advantage with other vendors.
Organizations need to collect this data generated from a variety of sources such as images, streaming videos, social media feeds, text files, documents, sensor data, and so on to respond and innovate quickly to customer needs in order to gain the competitive advantage over other companies.
The solution of above problem is “BIG DATA” however the unstructured or semi-structured nature of data with the velocity with which it is getting created is the real challenge for the big data.
Big Data Definition
Let us go through big data definition below to understand about big data.
Big data is a term that describes the large volume of data, both structured and unstructured, that a business generates on a day-to-day basis.
However, it’s not the amount of data that’s important. The idea of Big Data is basically how do I extract extra dollars from someone’s pocket to maximize sales and minimize cost in order to increase the profit margin.
Organizations are discovering that important predictions can be made by sorting through and analyzing Big Data.
Data is the new oil. There are 1000’s of companies which are just working towards collecting the data. No manufacturing plant, no supply chain strategies; they just collect the data.
Big Data in Action – Examples of Big Data Analytics
American retail company Walmart collects 2.5 petabytes of unstructured data from 1 million customers every hour which is equivalent to 167 times the books in America’s Library of Congress.
With tons of unstructured data being generated every hour, Walmart is improving its operational efficiency by leveraging big data analytics.
One of the finest applications Walmart has is Savings Catcher Application which alerts the customer whenever its neighboring competitor reduces the cost of an item the customer already bought.
This application then sends a gift voucher to the customer to compensate the price difference. This application runs on top of the tons and tons of data which Walmart collects every hour.
The universe of Big Data is surrounded by customer reviews, feedbacks, who are talking about a particular product through the communication channels such as Facebook, Twitter, product review forums, etc.
It is important for organizations to understand and analyze what customers say about their goods and/or services to ensure customer satisfaction.
Important predictions such as analyzing customer sentiments, which give organizations a clear picture of what they need to do to outperform their competitors can be made by sorting through and analyzing Big Data.
Therefore, big data can be analyzed for insights that lead to better decisions and strategic business moves.