Pig Operators – Pig Input, Output Operators, Pig Relational Operators

Input, output operators, relational operators, bincond operators are some of the Pig operators. Let us understand each of these, one by one.

Pig Input Output Operators

Pig LOAD Operator (Input)

The first task for any data flow language is to provide the input. Load operator in the Pig is used for input operation which reads the data from HDFS or local file system.

By default, it looks for the tab delimited file.

For Example X = load ‘/data/hdfs/emp’; will look for “emp” file in the directory “/data/hdfs/”. If the directory path is not specified, Pig will look for home directory on HDFS file system.

If you are loading the data from other storage system say HBase then you need to specify the loader function for that very storage system.

X = load ’emp’ using HBaseStorage();

If we will not specify the loader function then by default it will use the “PigStorage” and the file it assumes as tab delimited file.

If we have a file with the field separated other than tab delimited then we need to exclusively pass it as an argument in the load function. Example as below:

X = load ’emp’ using PigStorage(‘,’);

Pig Latin also allows you to specify the schema of data you are loading by using the “as” clause in the load statement.

X = load ’emp’ as (ename,eno,sal,dno);

If we will load the data without specifying schema then columns will be addressed as $01, $02, etc. While specifying the schema, we can also specify the datatype along with column name details. For Example:

X = load ’emp’ as (ename: chararray, eno: int,sal:float,dno:int);

X = load ”hdfs://localhost:9000/pig_data/emp_data.txt’ USING PigStorage(‘,’) as (ename: chararray, eno: int, sal:float, dno:int);

Pig STORE Operator (Output)

Once the data is processed, you want to write the data somewhere. The “store” operator is used for this purpose. By default, Pig stores the processed data into HDFS in tab-delimited format.

Store processed into ‘/data/hdfs/emp’;

PigStorage will be used as the default store function otherwise we can specify exclusively depending upon the storage.

Store emp into emp using HBaseStorage();

We can also specify the file delimiter while writing the data.

Store emp into emp using PigStorage(‘,’);

Pig DUMP Operator (on command window)

If you wish to see the data on screen or command window (grunt prompt) then we can use the dump operator.

dump emp;

Pig Relational Operators

Pig FOREACH Operator

Loop through each tuple and generate new tuple(s). Let us suppose we have a file emp.txt kept on HDFS directory. Sample data of emp.txt as below:

mak,101,5000.0,500.0,10
ronning,102,6000.0,300.0,20
puru,103,6500.0,700.0,10

Firstly we have to load the data into pig say through relation name as “emp_details”.

grunt> emp_details = LOAD ’emp’ USING PigStorage(‘,’) as (ename: chararray, eno: int,sal:float,bonus:float,dno:int);

Now we need to get the ename, eno and dno for each employee from the relation emp_details and store it into another relation named employee_foreach.

grunt> employee_foreach = FOREACH emp_details GENERATE ename,eno,dno;

Verify the foreach relation “employee_foreach”  using DUMP operator.

grunt> Dump employee_foreach;

Standard arithmetic operation for integers and floating point numbers are supported in foreach relational operator.

grunt> emp_details = LOAD ’emp’ USING PigStorage(‘,’) as (ename: chararray, eno: int,sal:float,bonus:float,dno:int);

grunt> emp_total_sal = foreach emp_details GENERATE sal+bonus;

grunt> emp_total_sal1 = foreach emp_details GENERATE $2+$3;

emp_total_sal and emp_total_sal1 gives you the same output. References through positions are useful when the schema is unknown or undeclared. Positional references starts from 0 and is preceded by $ symbol.

Range of fields can also be accessed by using double dot (..). For Example:

grunt> emp_details = LOAD ’emp’ USING PigStorage(‘,’) as (ename: chararray, eno: int,sal:float,bonus:float,dno:int);

grunt> beginning = FOREACH emp_details GENERATE ..sal;

The output of the above statement will generate the values for the columns ename, eno, sal.

grunt> middle = FOREACH emp_details GENERATE eno..bonus;

The output of the above statement will generate the values for the columns eno, sal, bonus.

grunt> end = FOREACH emp_details GENERATE bonus..;

The output of the above statement will generate the values for the columns bonus, dno.

Bincond or Boolean Test

The binary conditional operator also referred as “bincond” operator. Let us understand it with the help of an example.

5==5 ? 1:2 It begins with the Boolean test followed by the symbol “?”. If the Boolean condition is true then it will return the first value after “?” otherwise it will return the value which is after the “:”. Here, the Boolean condition is true hence the output will be “1”.

5==6 ? 1:2 Output, in this case, will be “2”.

We have to use projection operator for complex data types. If you reference a key that does not exist in the map, the result is a null. For Example:

student_details = LOAD ‘student’ as (sname:chararray, sclass:chararray, rollnum:int, stud:map[]);

avg = FOREACH student_details GENERATE stud#’student_avg’);

For maps this is # (the hash), followed by the name of the key as a string. Here, ‘student_avg’ is the name of the key and ‘stud’ is the name of the column/field.

Pig FILTER Operator

A filter operator allows you to select required tuples based on the predicate clause. Let us consider the same emp file. Our requirement is to filter the department number (dno) =10 data.

grunt> filter_data = FILTER emp BY dno == 10;

If you will dump the “filter_data” relation, then the output on your screen as below:

mak,101,5000.0,500.0,10

puru,103,6500.0,700.0,10

We can use multiple filters combined together using the Boolean operators “and” and “or”. Pig also uses the regular expression to match the values present in the file. For example: If we want all the records whose ename starts with ‘ma’ then we can use the expression as:

grunt> filter_ma= FILTER emp by ename matches ‘ma.*’;

Since, the filter passes only those values which are ‘true’. It evaluates on the basis of ‘true’ or ‘false’.

It is important to note that if say z==null then the result would be null only which is neither true nor false.

Let us suppose we have values as 1, 8 and null. If the filter is x==8 then the return value will be 8. If the filter is x!=8 then the return value will be 1.

We can see that null is not considered in either case. Therefore, to play around with null values we either use ‘is null’ or ‘is not null’ operator.

Pig GROUP Operator

Pig group operator fundamentally works differently from what we use in SQL.

This basically collects records together in one bag with same key values. In SQL, group by clause creates the group of values which is fed into one or more aggregate function while as in Pig Latin, it just groups all the records together and put it into one bag.

Hence, in Pig Latin there is no direct connection with group and aggregate function.

grunt> emp_details = LOAD ’emp’ USING PigStorage(‘,’) as (ename: chararray, eno: int,sal:float,bonus:float,dno:int);

grunt> grpd = GROUP emp_details BY dno;

grunt> cnt = FOREACH grpd GENERATE group,COUNT(emp_details);

Pig ORDER BY Operator

Pig Order By operator is used to display the result of a relation in sorted order based on one or more fields. For Example:

grunt> Order_by_ename = ORDER emp_details BY ename ASC;

Pig DISTINCT Operator

This is used to remove duplicate records from the file. It doesn’t work on the individual field rather it work on entire records.

grunt> unique_records = distinct emp_details;

Pig LIMIT Operator

Limit allows you to limit the number of records you wanted to display from a file.

grunt> emp_details = LOAD ‘emp’;

grunt> first50 = limit emp_details BY 50;

Pig SAMPLE Operator

Sample operator allows you to get the sample of data from your whole data-set i.e it returns the percentage of rows. It takes the value between 0 and 1. If it is 0.2, then it indicates 20% of the data.

grunt> emp_details = LOAD ‘emp’;

grunt> sample20 = SAMPLE emp_details BY 0.2;

Pig PARALLEL

Pig Parallel command is used for parallel data processing. It is used to set the number of reducers at the operator level.

We can include the PARALLEL clause wherever we have a reducer phase such as DISTINCT, JOIN, GROUP, COGROUP, ORDER BY etc.

For Example: SET DEFAULT_PARALLEL 10; 

Meaning is that all MapReduce jobs that get launched will have 10 parallel reducers running at a time.

It is important to note that parallel only sets the reducer parallelism while as the mapper parallelism is controlled by the MapReduce engine itself.

Pig FLATTEN Operator

Pig Flatten removes the level of nesting for the tuples as well as a bag. For Example: We have a tuple in the form of (1, (2,3)).

GENERATE expression $0 and flatten($1), will transform the tuple as (1,2,3). 

When we un-nest a bag using flatten operator, then it creates tuples. For Example: we have bag as (1,{(2,3),(4,5)}).

GENERATE $0, flatten($1), then we create a tuple as (1,2,3), (1,4,5)

Pig COGROUP Operator

Pig COGROUP operator works same as GROUP operator. The only difference between both is that GROUP operator works with single relation and COGROUP operator is used when we have more than one relation.

Let us suppose we have below two relations with their data sets:

student_details.txt 

101,Kum May,29,9010101010,Bangalore102,

Abh Nig,24,9020202020,Delhi103,

Sum Nig,24,9030303030,Delhi

employee_details.txt 

101,Nancy,22,London102,

Martin,24,Newyork103,

Romi,23,Tokyo

Now, let us try to group the student_details.txt and employee_details.txt records.

grunt> cogroup_final = COGROUP employee_details by age, student_details by age; 

Output as below:

(22, {(101, Nancy, 22, London)}, {})

(23, {(103, Romy, 23, Tokyo)},{})

(24, {(102, Martin, 24, Newyork)}, {(102, Abh Nig, 24, 9020202020, Delhi), (103, Sum Nig, 24, 9030303030, Delhi)})

(29, {}, {(101, Kum May, 29, 9010101010, Bangalore)})

Pig SPLIT Operator

Pig Split operator is used to split a single relation into more than one relation depending upon the condition you will provide.

Let us suppose we have emp_details as one relation. We have to split the relation based on department number (dno). Sample data of emp_details as below:

mak,101,5000.0,500.0,10

ronning,102,6000.0,300.0,20

puru,103,6500.0,700.0,10

jetha,103,6500.0,700.0,30

grunt> SPLIT emp_details into emp_details1 IF dno=10, emp_details2 if (dno=20 OR dno=30);

grunt> DUMP emp_details1;

mak,101,5000.0,500.0,10

puru,103,6500.0,700.0,10

grunt> DUMP emp_details2;

ronning,102,6000.0,300.0,20

jetha,103,6500.0,700.0,30

Pig Latin Introduction – Examples, Pig Data Types | RCV Academy

In the following post, we will learn about Pig Latin and Pig Data types in detail.

Pig Latin Overview

Pig Latin provides a platform to non-java programmer where each processing step results in a new data set or relation.

For example, X = load ’emp’; Here “X” is the name of relation or new data set which is fed from loading the data set “emp”,”X” which is the name of relation is not a variable however it seems to act like a variable.

Once the assignment is done to a given relation say “X”, it is permanent. We can reuse the relation name in other steps as well but it is not advisable to do so because of better script readability purpose.

For Example:

X = load 'emp';
X = filter X by sal > 10000.0;
X = foreach X generate Ename;

Here at each step, the reassignment is not done for “X”, rather a new data set is getting created at each step.

Pig Latin also has a concept of fields or columns. In the above example “sal” and “Ename” is termed as field or column.

It is also important to know that keywords in Apache Pig Latin are not case sensitive.

For example, LOAD is equivalent to load. But the relations and column names are case sensitive.  For example, X = load ’emp’; is not equivalent to x = load ’emp’;

For multi-line comments in the Apache pig scripts, we use “/* … */” and for single-line comment we use “–“.

Pig Data Types

Pig Scalar Data Types

  • Int (signed 32 bit integer)
  • Long (signed 64 bit integer)
  • Float (32 bit floating point)
  • Double (64 bit floating point)
  • Chararray (Character array(String) in UTF-8
  • Bytearray (Binary object)

Pig Complex Data Types

Map

A map is a collection of key-value pairs.

Key-value pairs are separated by the pound sign #. “Key” must be a chararray datatype and should be a unique value while as “value” can be of any datatype.

For example:

[1#Honda, 2#Toyota, 3#Suzuki], [name#Mak, phone#99845,age#29].

Tuple

A tuple is similar to a row in SQL with the fields resembling SQL columns. In other

In other words, we can say that tuples are an ordered set of fields formed by grouping scalar data types. Two consecutive tuples need not have to contain the same number of fields.

For example:

(mak, 29, 4000.0)

Bag

A bag is formed by the collection of tuples. A bag can have duplicate tuples.

If Pig tries to access a field that does not exist, a null value is substituted.

For Example:

({(a),(b)},{},{(c),(d)},{ (mak, 29, 4000.0)}), (BigData, {Hadoop, Mapreduce, Pig, Hive})

NULLS

A null data element in Apache Pig is just same as the SQL null data element. The null value in Apache Pig means the value is unknown.

Apache Pig Installation – Execution, Configuration and Utility Commands

Apache Pig Installation can be done on the local machine or Hadoop cluster. To install Apache Pig, download package from the Apache Pig’s release page here.

You can also download the pig package from Cloudera or Apache’s Maven repository.

Pig does not need the Hadoop cluster for installation however it runs on the machine from where we launch the Hadoop jobs. It can also be installed on your local desktop or laptop for prototyping purpose. In this case, Apache Pig will run in local mode. If your desktop or laptop can access the Hadoop cluster then you can install the pig package there as well.

Pig package is written using JAVA language hence portable to all the operating systems but the pig script uses bash script, so it requires UNIX /LINUX operating system.

Once you have downloaded the pig, place it in the directory of your choice and untar it using below command:

tar –xvf <pig_downloaded_tar_file>

Apache Pig Execution

You can execute pig script in following three ways:

  1. Interactive mode

pig  -x  local <script.pig>

Runs in the single virtual machine and all files are in the local system

  1. Interactive mode in Hadoop File System

pig  -x

Runs in Hadoop cluster, it is the default mode

  1. Script mode

pig  -x  local
pig  myscript .pig

Script is a text file can be run in local or MapReduce mode

Command Line and their configurations

Pig provides a wide variety of command line option. Below are few of them:

-h or –help

It will list all the available command line options.

-e or –execute

If you want to execute a single command through pig then you can use this option. e.g. pig –e fs –ls will list the home directory.

P or –propertyfile

It is used to specify a property file that a pig script should read.

The below tabular chart shows the return codes used by pig along with their description.

ValueDescription
0Success
1Retriable failure
2Failure
3Partial failure – Used with multi-query
4Illegal arguments passed to Pig
5IOException thrown – thrown usually by a UDF
6PigException thrown – thrown usually by Python UDF.
7ParseException thrown – in case of variable substitution
8an unexpected exception

Grunt

The interactive shell name of Apache Pig is called Grunt. It provides the shell for users to interact with HDFS using PigLatin.

Once you enter the below command on the Unix env where Pig package is installed

pig –x local

The output will be:

grunt>

To exit the grunt shell you can type ‘quit’ or Ctrl-D.

HDFS command in the grunt shell can be accessed using keyword ‘fs’. Dash(-) is the mandatory part when Hadoop fs is used.

grunt>fs -ls

Utility Commands for controlling Pig from grunt shell

Kill jobid

You can find the job’s ID by looking at the Hadoop’s Job Tracker GUI. The above command can be used to kill a Pig job based on the job id.

exec

exec command to run a Pig script in batch mode with no interaction between the script and the Grunt shell.

Example as below:

grunt> exec script.pig

grunt> exec –param p1=myparam1 –param p2=myparam2 script.pig

run

Issuing a run command on the grunt shell has basically the same effect as typing the statements manually.

Run and exec commands are useful for debugging because you can modify a Pig script in an editor and then rerun the script in the Grunt shell without leaving the shell.

Pig Tutorial – Hadoop Pig Introduction, Pig Latin, Use Cases, Examples

In this series, we will cover Pig tutorial. Apache Pig provides a platform for executing large data sets in a distributed fashion on the cluster of commodity machines.

Pig tutorial – Pig Latin Introduction

The language which is used to execute the data sets is called Pig Latin. It allows you to perform data transformation such as join, sort, filter, and grouping of records etc.

It is sort of ETL process for Big Data environment. It also facilitates users to create their own functions for reading, processing, and writing data.

Pig is an open source project developed by Apache consortium (http://pig.apache.org). Therefore, users are free to download it as the source or binary code.

Pig Latin programs run on Hadoop cluster and it makes use of both Hadoop distributed file system, as well as MapReduce programming layer.

However, for prototyping Pig Latin programs can also run in “local mode” without a cluster. All the process invoked during running the program in local mode resides in single local JVM.

Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex Java codes. The map, sort, shuffle and reduce phase while using pig Latin language can be taken care internally by the operators and functions you will use in pig script.

Basic “hello world program” using Apache Pig

The basic “hello world program” in Hadoop is the word count program.

The same example is explained in “Hadoop and HDFS” tutorial using JAVA map-reduce program.

Now, let’s look at using Pig Latin program. Let us consider our input is a text file with words delimited by space and lines terminated by ‘\n’ stored in “src.txt” file. Sample data of the file as below:

Old MacDonald had a farm
And on his farm he had some chicks
With a chick chick here
And a chick chick there
Here a chick there a chick
Everywhere a chick chick

The word count program for above sample data using Pig Latin is as below:

A = load 'src.txt' as line;

--TOKENIZE splits the line into a field for each word. e.g. ({(Old),(MacDonald),(had),(a),(farm)})

B = foreach A generate TOKENIZE(line) as tokens;

--Convert the output of TOKENIZE into separate record. e.g. (Old)

-- (MacDonald)

-- (had)

-- (a)

-- (farm)

C = foreach B generate FLATTEN(tokens) as words;

--We have to count each word occurrences, for that we have to group all the words.

D = group C by words;

E = foreach D generate group, COUNT(C);

F = order E by $1;

-- We can print the word count on console using Dump.

dump F;

Sample output will be as follows:

(MacDonald,1)
(had,1)

Pig Use Cases – Examples

Pig Use Case#1

The weblog can be processed using pig because it has a goldmine of information. Using this information we can analyze the overall server usages and improve the server performance.

We can create the Usage Tracking mechanism such as monitoring users, processes, preempting security attacks on your server.

We can also analyze the frequent errors and take a corrective measure to enhance user experience.

Pig Use Case#2

To know about the effectiveness of an advertisement is one of the important goals for any companies.

Many companies invest millions of dollars in buying the ads space and for them and it is critical to know how popular their advertisement both in physical and virtual space.

Gathering advertising information from multiple sources and analysis to understand the customer behavior and its effectiveness is one of the important goals for many companies. This can be easily achieved by using Pig Latin language.

Pig Use Case#3

Processing the healthcare information is one of the important use cases of Pig.

Neural Network for Breast Cancer Data Built on Google App Engine is one of the important application developed using pig and Hadoop.

Pig Use Case#4

Stock analysis using Pig Latin such as to calculate average dividend, Total trade estimation, Group the stocks and Join them.