Apache Pig Installation – Execution, Configuration and Utility Commands

Apache Pig Installation – Execution, Configuration and Utility Commands

Apache Pig Installation can be done on the local machine or Hadoop cluster. To install Apache Pig, download package from the Apache Pig’s release page here.

You can also download the pig package from Cloudera or Apache’s Maven repository.

Pig does not need the Hadoop cluster for installation however it runs on the machine from where we launch the Hadoop jobs. It can also be installed on your local desktop or laptop for prototyping purpose. In this case, Apache Pig will run in local mode. If your desktop or laptop can access the Hadoop cluster then you can install the pig package there as well.

Pig package is written using JAVA language hence portable to all the operating systems but the pig script uses bash script, so it requires UNIX /LINUX operating system.

Once you have downloaded the pig, place it in the directory of your choice and untar it using below command:

tar –xvf <pig_downloaded_tar_file>

Apache Pig Execution

You can execute pig script in following three ways:

  1. Interactive mode

pig  -x  local <script.pig>

Runs in the single virtual machine and all files are in the local system

  1. Interactive mode in Hadoop File System

pig  -x

Runs in Hadoop cluster, it is the default mode

  1. Script mode

pig  -x  local
pig  myscript .pig

Script is a text file can be run in local or MapReduce mode

Command Line and their configurations

Pig provides a wide variety of command line option. Below are few of them:

-h or –help

It will list all the available command line options.

-e or –execute

If you want to execute a single command through pig then you can use this option. e.g. pig –e fs –ls will list the home directory.

P or –propertyfile

It is used to specify a property file that a pig script should read.

The below tabular chart shows the return codes used by pig along with their description.

ValueDescription
0Success
1Retriable failure
2Failure
3Partial failure – Used with multi-query
4Illegal arguments passed to Pig
5IOException thrown – thrown usually by a UDF
6PigException thrown – thrown usually by Python UDF.
7ParseException thrown – in case of variable substitution
8an unexpected exception

Grunt

The interactive shell name of Apache Pig is called Grunt. It provides the shell for users to interact with HDFS using PigLatin.

Once you enter the below command on the Unix env where Pig package is installed

pig –x local

The output will be:

grunt>

To exit the grunt shell you can type ‘quit’ or Ctrl-D.

HDFS command in the grunt shell can be accessed using keyword ‘fs’. Dash(-) is the mandatory part when Hadoop fs is used.

grunt>fs -ls

Utility Commands for controlling Pig from grunt shell

Kill jobid

You can find the job’s ID by looking at the Hadoop’s Job Tracker GUI. The above command can be used to kill a Pig job based on the job id.

exec

exec command to run a Pig script in batch mode with no interaction between the script and the Grunt shell.

Example as below:

grunt> exec script.pig

grunt> exec –param p1=myparam1 –param p2=myparam2 script.pig

run

Issuing a run command on the grunt shell has basically the same effect as typing the statements manually.

Run and exec commands are useful for debugging because you can modify a Pig script in an editor and then rerun the script in the Grunt shell without leaving the shell.