The various Big Data layers are discussed below, there are four main big data layers.
Big data management architecture should be able to incorporate all possible data sources and provide a cheap option for Total Cost of Ownership (TCO).
Big Data technologies provide a concept of utilizing all available data through an integrated system.
You can choose either open source frameworks or packaged licensed products to take full advantage of the functionality of the various components in the stack. The various Big Data layers are discussed below:
Data Sources Layer
Data Source layer has a different scale – while the most obvious, many companies work in the multi-terabyte and even petabyte arena.
It incorporates structured, unstructured
and/or semi-structured data captured from transactions, interactions and observations systems such as Facebook, twitter.
This very wide variety of data, coming in huge volume with high velocity has to be seamlessly merged and consolidated so that the analytics engines, as well as the visualization tools, can operate on it as one single big data set.
Acquire/Ingestion Layer
The responsibility of this layer is to separate the noise and relevant information from the humongous data set which is present at different data access points.
This layer should have the ability to validate, cleanse, transform, reduce, and integrate the data into the big data tech stack for further processing.
Once the relevant information is captured, it is sent to manage layer where Hadoop distributed file system (HDFS) stores the relevant information based on multiple commodity servers.
Manage Layer
This layer is supported by storage layer—that is the robust and inexpensive physical infrastructure is fundamental to the operation and scalability of big data architecture.
This layer also provides the tools and query languages to access the NoSQL databases using the HDFS storage file system sitting on top of the Hadoop physical infrastructure layer.
The data is no longer stored in a monolithic server where the SQL functions are applied to crunch it.
Redundancy is built into this infrastructure for the very simple reason that we are dealing with large volume of data from different sources.
The key building blocks of the Hadoop platform management layer is MapReduce programming which executes set of functions against a large amount of data in batch mode.
The map function does the distributed computation task while the reduce function combines all the elements back together to provide a result.
An example of MapReduce program would be to determine how many times a particular word appeared in a document.
Analyze & Visualize Layer
This layer provides the data discovery mechanisms from the huge volume of data.
For the huge volume of data, we need fast search engines with iterative and cognitive approaches. Search engine results can be presented in various forms using “new age” visualization tools and methods.
Real-time analysis can leverage NoSQL stores (for example, Cassandra, MongoDB, and others) to analyze data produced by web-facing apps.