In
my previous article, I have explained Big Data and Hadoop in details. In
this article I would like to go little deeper with Pig. Pig is a high-level
platform for creating Map Reduce programs used with Hadoop. Pig is
made up of two components: the first is the language itself, which is called PigLatin
and the second is a runtime environment where PigLatin programs are
executed. Pig Latin can be extended using UDF (User Defined Functions) which
the user can write in Java, Python or JavaScript and then call
directly from the language.
We
know Pig was initially developed at Yahoo research at 2006. The whole intension
behind developing Pig was to allow people using Hadoop to focus more on
analyzing large data sets and spend less time to write mapper and reducer
programs. Pigs eat almost anything, the Pig programming language is designed to
handle any kind of data and for the very same reason Yahoo! named it Pig.
The
first step in a Pig program is to LOAD
the data you want to manipulate from HDFS. Then you run the data through a set
of transformations (which, under the
covers, are translated into a set of mapper and reducer tasks). Finally, you DUMP the data to the screen or you STORE the results in a file somewhere.
Let
us talk about LOAD, TRANSFORM, DUMP and STORE in details.
LOAD
The
objects that are being worked on by Hadoop are stored in HDFS. In order for a
Pig program to access this data, the program must first tell Pig what file (or
files) it will use, and it is done through the LOAD 'data_file' command (where
'data_file' specifies either an HDFS file or directory). If a directory is
specified, all the files in that directory will be loaded into the program. If
the data is stored in a file format that is not natively accessible to Pig, you
can optionally add the USING function to the LOAD statement to specify a
user-defined function that can read in and interpret the data.
TRANSFORM
The
transformation logic is where all the data manipulation happens. Here we can
FILTER out rows that are not of interest, JOIN two sets of data files, GROUP
data to build aggregations, ORDER results, and much more. The following is an
example of a Pig program that takes a file composed of Facebook comments, selects
only those comments that are in English, then groups them by the user who is commenting,
and displays the sum of the number of re-comments of that user’s comments.
L
= LOAD 'hdfs//node/facebook_comment';
FL
= FILTER L BY iso_language_code EQ 'en';
G
= GROUP FL BY from_user;
RT
= FOREACH G GENERATE group, SUM(recomments);
DUMP and
STORE
If
we don’t specify the DUMP or STORE command, the results of a Pig program are
not generated. When we are debugging our Pig programs, we typically use the DUMP
command to send the output to the screen. We simply change the DUMP call to a
STORE call so that any results from running your programs are stored in a file
for further processing or analysis when we go into production. Please note that
DUMP command can be used anywhere in our program to dump intermediate result
sets to the screen and actually we need it because it helps us in debugging.
How to
run Pig program
Now
when we are ready with our Pig program, than we need to run in the Hadoop environment.
There are three ways to run a Pig program:
- Embedded in a script
- Embedded in a Java program
- From the GRUNT(Pig command line)
It
doesn’t matter which of the three ways we run the program. The Pig runtime environment
translates the program into a set of map and reduces tasks and runs them under our
behalf. I will talk about Python in my next blog, till than happy reading.
No comments:
Post a Comment