Big Data Innovation – Google file system, MapReduce, Big Table

In this post, you will learn about some of the key big data innovation like Google file system, MapReduce framework, and big table.

The Google File System

This very file system is one of the important big data innovation which is a scalable distributed file system for data-intensive applications.

Google file system delivers high aggregate performance by providing fault tolerance mechanism while running on inexpensive commodity hardware.

It has successfully met the storage need within Google for processing of data used by Google service as well as research and development efforts that require large data sets.

It provides hundreds of terabytes of storage across thousands of disks on over a thousand machines.

Constant monitoring, error detection, fault tolerance, and automatic recovery are integral to the google file system.

Since it will deal with multi-GB files and each typically contains many application objects such as web document, I/O, and block operations are taken into consideration to efficiently process multi-gigabyte or terabytes files.

Appending and atomicity guarantees became the focus of performance optimization while caching data blocks of huge files.

The architecture of GFS consists of a single master and multiple chunk servers and is accessed by multiple clients.

A file is divided into fixed size 64-bit chunk which will be sending to commodity server based on the master decision at the time of chunk creation.

MapReduce Framework

Google uses this programming framework for different purposes.

This framework can be easily used by the programmer without the knowledge of parallel and distributed systems since it hides the details of parallelization.

In the traditional programming environment, data goes to a program which is time-consuming, while in MapReduce programming framework program goes to data which is performance effective and scalable compared to traditional programming framework.

Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and thousands of MapReduce jobs are executed on Google’s clusters every day.

Since the programs are written in the functional style, it is automatically parallelized and gets executed on a large cluster of commodity servers.

Big Table

Big Table was developed by Google to be a distributed storage system intended to manage highly scalable structured data.

The databases which store the Big Table data are named as NoSQL databases.

It is intended to store huge volume of data across the cluster. Unlike a traditional relational database model, Data is organized into tables with rows and columns.

Big Table is a sparse, distributed, persistent multidimensional sorted map.