Tuesday, 25 December 2012

Big Data - An Introduction to Pig

In my previous article, I have explained Big Data and Hadoop in details. In this article I would like to go little deeper with Pig. Pig is a high-level platform for creating Map Reduce programs used with Hadoop. Pig is made up of two components: the first is the language itself, which is called PigLatin and the second is a runtime environment where PigLatin programs are executed. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in JavaPython or JavaScript and then call directly from the language.

We know Pig was initially developed at Yahoo research at 2006. The whole intension behind developing Pig was to allow people using Hadoop to focus more on analyzing large data sets and spend less time to write mapper and reducer programs. Pigs eat almost anything, the Pig programming language is designed to handle any kind of data and for the very same reason Yahoo! named it Pig.

The first step in a Pig program is to LOAD the data you want to manipulate from HDFS. Then you run the data through a set of transformations (which, under the covers, are translated into a set of mapper and reducer tasks). Finally, you DUMP the data to the screen or you STORE the results in a file somewhere.

Let us talk about LOAD, TRANSFORM, DUMP and STORE in details.

The objects that are being worked on by Hadoop are stored in HDFS. In order for a Pig program to access this data, the program must first tell Pig what file (or files) it will use, and it is done through the LOAD 'data_file' command (where 'data_file' specifies either an HDFS file or directory). If a directory is specified, all the files in that directory will be loaded into the program. If the data is stored in a file format that is not natively accessible to Pig, you can optionally add the USING function to the LOAD statement to specify a user-defined function that can read in and interpret the data.

The transformation logic is where all the data manipulation happens. Here we can FILTER out rows that are not of interest, JOIN two sets of data files, GROUP data to build aggregations, ORDER results, and much more. The following is an example of a Pig program that takes a file composed of Facebook comments, selects only those comments that are in English, then groups them by the user who is commenting, and displays the sum of the number of re-comments of that user’s comments.

L = LOAD 'hdfs//node/facebook_comment';
FL = FILTER L BY iso_language_code EQ 'en';
G = GROUP FL BY from_user;
RT = FOREACH G GENERATE group, SUM(recomments);

If we don’t specify the DUMP or STORE command, the results of a Pig program are not generated. When we are debugging our Pig programs, we typically use the DUMP command to send the output to the screen. We simply change the DUMP call to a STORE call so that any results from running your programs are stored in a file for further processing or analysis when we go into production. Please note that DUMP command can be used anywhere in our program to dump intermediate result sets to the screen and actually we need it because it helps us in debugging.

How to run Pig program
Now when we are ready with our Pig program, than we need to run in the Hadoop environment. There are three ways to run a Pig program: 
  1.  Embedded in a script
  2.  Embedded in a Java program
  3.  From the GRUNT(Pig command line)

It doesn’t matter which of the three ways we run the program. The Pig runtime environment translates the program into a set of map and reduces tasks and runs them under our behalf. I will talk about Python in my next blog, till than happy reading.

No comments:

Post a Comment