Job Roles Big Data – Roles and Responsibilities in Big data jobs

Big data drove an estimated approximately $200 billion in IT spending in 2015. The main reason for this growth is the potential Chief Information Officers (CIOs) see in the greater insights and intelligence contained in the huge unstructured data.

Analysis of unstructured data requires new systems of record—for example, NoSQL databases which can forecast better and align their strategic plans and initiatives for an organization.

Big Data job opportunities attract many experienced and talented software engineers who are technically proficient and most importantly are passionate about what they do. Here are some of the job opportunities in Big Data space:

Here are some of the job opportunities in Big Data space.

Big Data Architect

Big Data Architect is expected to organize, administer, manages and govern Big Data on large clusters.

He also does documentation for Big Data based production environment involving Petabytes of data. Big Data Architect needs to have rich experience in Java, MapReduce, Hive, HBase, PIG, Sqoop, and so on.

He also administers Linux/Unix environments and designs Big Data Architecture involving Cluster node Configuration,  namenode/datanode, connectivity, etc.

Big Data developer

A Big Data developer is the one who likes programming and wants to make the most out of it.

He needs to have hands-on experience in SQL, core Java, and any scripting language. Also, working knowledge of Big Data related technologies such as Pig, Hive, Python, NoSQL databases, Flume helps in accelerating his career growth.

Data Scientists

Data scientist is another tech-savvy name of this century which is slowly replacing the title of Business Analyst.

Data scientist professionals generate, evaluate, spread and integrate the knowledge gathered and stored in big data environments; therefore data scientists need to have an in-depth knowledge of business as well as data.

They basically design intelligent analytic models, write algorithms work with databases, and get involved in writing complex queries in databases, and so on.

Data scientists are different from traditional data analysts in the way that data scientists analyze data from various sources, instead of relying on a single source.

They are also expected to have experience in SAS, SPSS and programming languages such as R.

Hadoop Administrator

The prime role of Hadoop administrator is administering Hadoop and its database systems.

A Hadoop Administrator should have extensive knowledge on hardware system and Hadoop design principals.

They should be responsible for maintaining large clusters of hardware and should have strong scripting skills. Their core technologies include Hadoop, MapReduce, Hive, Linux, Java, Database administration.


Apart from the above key roles in big data space, there are several other titles ranging from Hadoop Analyst to Hadoop Engineer, Hadoop trainer, Hadoop consultant and so on.

Big Data Application in Businesses – Using big data to improve business

Efficient data analysis enables companies to optimize everything in the value chain – from sales to order delivery, to optimal store hours.

Below tabular chart shows in what area various businesses use big data application to improve their business models.

Big data enables the organization to define key marketing strategies and is utilized in almost every sector of industries.

Retail / ecommerce01. Market basket analysis
02. Campaign & customer loyalty mgmt program
03.Supply chain management & analytics
04. Behavior tracking
05. Market and consumer segmentation
06. Recommendation engines (to increase order size through complementary products)
07. Cross-channel analytics
08. Individual targeting with right offer at right time
Financial Services01. Real-time customer insights
02. Risk analysis and management
03. Fraud detection
04. Customer loyalty management
05. Credit risk modeling/analysis
06. Trade surveillance, detecting abnormal activities
IT Operations01. Log analysis for pattern identification/process analysis.
02. Massive storage and parallel processing
03. Data mashup to extract intelligence from data
Health & Life Sciences01. Health-insurance fraud detection
02. Campaign management
03. Brand & reputation management
04. Patient care and service quality management
05. Gene mapping and analytics
06. Drug discovery
Communication, Media & Technology01. Real-time calls analysis
02. Network performance management
03. Social graph analysis
04. Mobile user usage analysis
Governance01. Compliance and regulatory analysis
02. Threat detection, crime prediction
03. Smart cities and e-governance
04. Energy management

Big Data Innovation – Google file system, MapReduce, Big Table

In this post, you will learn about some of the key big data innovation like Google file system, MapReduce framework, and big table.

The Google File System

This very file system is one of the important big data innovation which is a scalable distributed file system for data-intensive applications.

Google file system delivers high aggregate performance by providing fault tolerance mechanism while running on inexpensive commodity hardware.

It has successfully met the storage need within Google for processing of data used by Google service as well as research and development efforts that require large data sets.

It provides hundreds of terabytes of storage across thousands of disks on over a thousand machines.

Constant monitoring, error detection, fault tolerance, and automatic recovery are integral to the google file system.

Since it will deal with multi-GB files and each typically contains many application objects such as web document, I/O, and block operations are taken into consideration to efficiently process multi-gigabyte or terabytes files.

Appending and atomicity guarantees became the focus of performance optimization while caching data blocks of huge files.

The architecture of GFS consists of a single master and multiple chunk servers and is accessed by multiple clients.

A file is divided into fixed size 64-bit chunk which will be sending to commodity server based on the master decision at the time of chunk creation.

MapReduce Framework

Google uses this programming framework for different purposes.

This framework can be easily used by the programmer without the knowledge of parallel and distributed systems since it hides the details of parallelization.

In the traditional programming environment, data goes to a program which is time-consuming, while in MapReduce programming framework program goes to data which is performance effective and scalable compared to traditional programming framework.

Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and thousands of MapReduce jobs are executed on Google’s clusters every day.

Since the programs are written in the functional style, it is automatically parallelized and gets executed on a large cluster of commodity servers.

Big Table

Big Table was developed by Google to be a distributed storage system intended to manage highly scalable structured data.

The databases which store the Big Table data are named as NoSQL databases.

It is intended to store huge volume of data across the cluster. Unlike a traditional relational database model, Data is organized into tables with rows and columns.

Big Table is a sparse, distributed, persistent multidimensional sorted map.

Big Data Layers – Data Source, Ingestion, Manage and Analyze Layer

The various Big Data layers are discussed below, there are four main big data layers.

Big data management architecture should be able to incorporate all possible data sources and provide a cheap option for Total Cost of Ownership (TCO).

Big Data technologies provide a concept of utilizing all available data through an integrated system.

You can choose either open source frameworks or packaged licensed products to take full advantage of the functionality of the various components in the stack. The various Big Data layers are discussed below:

Big Data Layers
Different Layers of Big Data

Data Sources Layer

Data Source layer has a different scale – while the most obvious, many companies work in the multi-terabyte and even petabyte arena.

It incorporates structured, unstructured

and/or semi-structured data captured from transactions, interactions and observations systems such as Facebook, twitter.

This very wide variety of data, coming in huge volume with high velocity has to be seamlessly merged and consolidated so that the analytics engines, as well as the visualization tools, can operate on it as one single big data set.

Acquire/Ingestion Layer

The responsibility of this layer is to separate the noise and relevant information from the humongous data set which is present at different data access points.

This layer should have the ability to validate, cleanse, transform, reduce, and integrate the data into the big data tech stack for further processing.

Once the relevant information is captured, it is sent to manage layer where Hadoop distributed file system (HDFS) stores the relevant information based on multiple commodity servers.

Manage Layer

This layer is supported by storage layer—that is the robust and inexpensive physical infrastructure is fundamental to the operation and scalability of big data architecture.

This layer also provides the tools and query languages to access the NoSQL databases using the HDFS storage file system sitting on top of the Hadoop physical infrastructure layer.

The data is no longer stored in a monolithic server where the SQL functions are applied to crunch it.

Redundancy is built into this infrastructure for the very simple reason that we are dealing with large volume of data from different sources.

The key building blocks of the Hadoop platform management layer is MapReduce programming which executes set of functions against a large amount of data in batch mode.

The map function does the distributed computation task while the reduce function combines all the elements back together to provide a result.

An example of MapReduce program would be to determine how many times a particular word appeared in a document.

Analyze & Visualize Layer

This layer provides the data discovery mechanisms from the huge volume of data.

For the huge volume of data, we need fast search engines with iterative and cognitive approaches. Search engine results can be presented in various forms using “new age” visualization tools and methods.

Real-time analysis can leverage NoSQL stores (for example, Cassandra, MongoDB, and others) to analyze data produced by web-facing apps.

Big Data Challenges – Top challenges in big data analytics

There are multiple big data challenges that this great opportunity has thrown at us.

With the advent of the “Internet of things (IOT)”, efficient analytics and increased connectivity through new technology and software bring significant opportunities for companies.

However, we do see companies facing challenges in leveraging the value that data have to offer. Below are few of the major Big Data challenges:

Meeting the need for speed (Processing Capabilities)

How to match the processing speed with the speed at which the data is being generated.

How to extract useful information out of the heap, one possible solution is hardware.

Few customers use increased memory and powerful parallel processing to crunch large volumes of data quickly.

Another method is putting data in-memory. This allows organizations to explore huge data volumes and gain business insights in near-real time.

Understanding the data

This is one of the basic challenges to understand and prioritize the data coming from the variety of sources where ninety percent of data is noise.

We have to filter out the valuable data from noise. It requires the good understanding of data so that you can use visualization as part of data analysis.

One solution to this challenge is to have the proper domain expertise in place. The people who are analyzing the data should have a deep understanding of where the data comes from, what audience will be consuming the data and how that audience will interpret the information.

Addressing data quality and consistency

Even if you put the data in the proper context for the audience who will be consuming the information, the value of data for decision-making purposes will be jeopardized if the data is not accurate or timely.

Again the data visualization tools and techniques play an important role to assure data quality.

Data access and connectivity

Data access and connectivity can be another obstacle.

Companies often do not have the right platforms to aggregate and manage the data across the enterprise as the majority of data points are not yet connected.

In order to overcome this obstacle of growing volume of data which is not yet connected, companies like Accenture, Siemens formed a joint venture which focuses on solutions and services for system integration and data management.