Pig Tutorial – Hadoop Pig Introduction, Pig Latin, Use Cases, Examples

In this series, we will cover Pig tutorial. Apache Pig provides a platform for executing large data sets in a distributed fashion on the cluster of commodity machines.

Pig tutorial – Pig Latin Introduction

The language which is used to execute the data sets is called Pig Latin. It allows you to perform data transformation such as join, sort, filter, and grouping of records etc.

It is sort of ETL process for Big Data environment. It also facilitates users to create their own functions for reading, processing, and writing data.

Pig is an open source project developed by Apache consortium (http://pig.apache.org). Therefore, users are free to download it as the source or binary code.

Pig Latin programs run on Hadoop cluster and it makes use of both Hadoop distributed file system, as well as MapReduce programming layer.

However, for prototyping Pig Latin programs can also run in “local mode” without a cluster. All the process invoked during running the program in local mode resides in single local JVM.

Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex Java codes. The map, sort, shuffle and reduce phase while using pig Latin language can be taken care internally by the operators and functions you will use in pig script.

Basic “hello world program” using Apache Pig

The basic “hello world program” in Hadoop is the word count program.

The same example is explained in “Hadoop and HDFS” tutorial using JAVA map-reduce program.

Now, let’s look at using Pig Latin program. Let us consider our input is a text file with words delimited by space and lines terminated by ‘\n’ stored in “src.txt” file. Sample data of the file as below:

Old MacDonald had a farm
And on his farm he had some chicks
With a chick chick here
And a chick chick there
Here a chick there a chick
Everywhere a chick chick

The word count program for above sample data using Pig Latin is as below:

A = load 'src.txt' as line;

--TOKENIZE splits the line into a field for each word. e.g. ({(Old),(MacDonald),(had),(a),(farm)})

B = foreach A generate TOKENIZE(line) as tokens;

--Convert the output of TOKENIZE into separate record. e.g. (Old)

-- (MacDonald)

-- (had)

-- (a)

-- (farm)

C = foreach B generate FLATTEN(tokens) as words;

--We have to count each word occurrences, for that we have to group all the words.

D = group C by words;

E = foreach D generate group, COUNT(C);

F = order E by $1;

-- We can print the word count on console using Dump.

dump F;

Sample output will be as follows:

(MacDonald,1)
(had,1)

Pig Use Cases – Examples

Pig Use Case#1

The weblog can be processed using pig because it has a goldmine of information. Using this information we can analyze the overall server usages and improve the server performance.

We can create the Usage Tracking mechanism such as monitoring users, processes, preempting security attacks on your server.

We can also analyze the frequent errors and take a corrective measure to enhance user experience.

Pig Use Case#2

To know about the effectiveness of an advertisement is one of the important goals for any companies.

Many companies invest millions of dollars in buying the ads space and for them and it is critical to know how popular their advertisement both in physical and virtual space.

Gathering advertising information from multiple sources and analysis to understand the customer behavior and its effectiveness is one of the important goals for many companies. This can be easily achieved by using Pig Latin language.

Pig Use Case#3

Processing the healthcare information is one of the important use cases of Pig.

Neural Network for Breast Cancer Data Built on Google App Engine is one of the important application developed using pig and Hadoop.

Pig Use Case#4

Stock analysis using Pig Latin such as to calculate average dividend, Total trade estimation, Group the stocks and Join them.