Monday, 24 December 2012

Big Data And Hadoop Part 2

In my previous blog I have given more insight on Big Data and now we will take a closer look at Hadoop.


Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes (a thousand terabytes) of data. It is a top-level Apache project which is written in Java. Hadoop was inspired by Google’s work on its Google (distributed) File System (GFS) and the Map-Reduce programming paradigm.

Hadoop is designed to scan through large data sets to produce its results through a highly scalable, distributed batch processing system. People, who always misunderstand Hadoop, Please take a note that 
  • It is not about speed-of-thought response times
  • It is not real-time warehousing 

Hadoop a computing environment built on a top of distributed cluster system designed specifically for very large scale data operation. It is about discovery and making the once near-impossible possible from a scalability and analysis perspective. The Hadoop methodology is built around a function-to-data model where analysis programs are sent to data. One of the key components of Hadoop is the redundancy built into the environment and this redundancy provides fault tolerance to it.

Hadoop Component
Hadoop has three below component:

  1. Hadoop Distributed File System
  2. Programming paradigm – Map Reduce pattern
  3. Hadoop Common

Hadoop is generally seen as having two parts: a file system (the Hadoop Distributed File System) and a programming paradigm (Map-Reduce).

Components of Hadoop
Now let us take a closer look at Hadoop components.

The Hadoop Distributed File System (HDFS)
It’s possible to scale a Hadoop cluster to hundreds and even to thousands of nodes. Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster. The map and reduce functions execute on smaller subsets of larger data sets. Hadoop uses commonly available servers in a very large cluster, where each server has a set of inexpensive internal disk drives. For higher performance, MapReduce tries to assign workloads to these servers where the data to be processed is stored. This is known as data locality. For the very same reason SAN (storage area network) or NAS (network attached storage), is not recommended in Hadoop Environment.

The strength of Hadoop is that it has built-in fault tolerance and fault compensation capabilities. This is the same for HDFS, in that data is divided into blocks, and copies of these blocks are stored on other servers in the Hadoop cluster. That is, an individual file is actually stored as smaller blocks that are replicated across multiple servers in the entire cluster.
Let us take an example we have to store the Unique Registration Numbers (URN) for everyone in the India; then people with a last name starting with A might be stored on server 1, B on server 2, C on Server 3 and so on. In a Hadoop pieces of this URN would be stored across the cluster. In order to reconstruct the entire URN information, we would need the blocks from every server in the Hadoop cluster. To achieve availability as components fail, HDFS replicates these smaller pieces on to two additional servers by default. A data file in HDFS is divided into blocks, and the default size of these blocks for Apache Hadoop is 64 MB.

All of Hadoop’s data placement logic is managed by a special server called NameNode. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.All of the NameNode's information is stored in memory, which allows it to provide quick response times to storage manipulation or read requests. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. Backup Node is available from Hadoop 0.21+ version and used for HA (High Availability) in case NameNode fails.

Please take a note that there’s no need for MapReduce code to directly reference the NameNode. Interaction with the NameNode is mostly done when the jobs are scheduled on various servers in the Hadoop cluster. This greatly reduces communications to the NameNode during job execution, which helps to improve scalability of the solution. In summary, the NameNode deals with cluster metadata describing where files are stored;actual data being processed by MapReduce jobs never flows through the NameNode.

HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. It means that commands you might use in interacting with files (copying, deleting, etc.) are available in a different form with HDFS. There may be syntactical differences or limitations in functionality. To work around this, you may write your own Java applications to perform some of the functions, or learn the different HDFS commands to manage and manipulate files in the file system. 

Basics of MapReduce
MapReduce is the heart of Hadoop. It is this programming paradigm in which work is broken down into mapper and reducer tasks to manipulate data that is stored across a cluster of servers for massive parallelism. The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

The people are generally confused about MapReduce so I am trying to explain it by giving a simple example but in real world it is not so simple J

Assume we have three files, and each file contains two columns {a key (city) and a value (population) in Hadoop term} for the various months. Please see below:

Mumbai, 20000
New Delhi, 30000
Bangalore, 15000
Mumbai, 22000
New Delhi, 31000

Now we want to find the minimum population for each city across all of the data files. There may be the case that each file might have the same city represented multiple times. Using the MapReduce paradigm, we can break this down into three map tasks, where each mapper works on one of the three files and the mapper task goes through the data and returns the minimum population for each city.

For example, the results produced from one mapper task for the data above would look like this:
(Mumbai, 20000) (New Delhi, 30000) (Bangalore, 15000)

Let’s assume the other 2 mapper tasks produced the following intermediate results:

(Mumbai, 20000) (New Delhi, 31000) (Bangalore, 15000)
(Mumbai, 22000) (New Delhi, 30000) (Bangalore, 14000)

All three of these output streams would be fed into the reduce tasks, which combine the input results and output a single value for each city, producing a final result set as follows:
(Mumbai, 20000) (New Delhi, 30000) (Bangalore, 15000)
The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
  1. Client applications submit jobs to the Job tracker.
  2. The JobTracker talks to the NameNode to determine the location of the data
  3. The JobTracker locates TaskTracker nodes with available slots at or near the data
  4. The JobTracker submits the work to the chosen TaskTracker nodes.
  5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
  6. TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
  7. When the work is completed, the JobTracker updates its status.
  8. Client applications can poll the JobTracker for information.
The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.All MapReduce programs that run natively under Hadoop are written in Java, and it is the Java Archive file (jar) that’s distributed by the JobTracker to the various Hadoop cluster nodes to execute the map and reduce tasks.
Hadoop Common Components
The Hadoop Common Components are a set of libraries that support the various Hadoop subprojects. To interact with files in HDFS, we need to use the /bin/hdfs dfs <args> file system shell command interface, where args represents the command arguments we want to use on files in the file system.

Here are some examples of HDFS shell commands:

cat Copies the file to standard output (stdout).
chmod Changes the permissions for reading and writing to a given file or set of files.
chown Changes the owner of a given file or set of files.
copyFromLocal Copies a file from the local file system into HDFS.
copyToLocal Copies a file from HDFS to the local file system.
cp Copies HDFS files from one directory to another.
expunge Empties all of the files that are in the trash.
ls Displays a listing of files in a given directory.
mkdir Creates a directory in HDFS.
mv Moves files from one directory to another.
rm Deletes a file and sends it to the trash

Application Development in Hadoop
In below section, we will cover three Pig, Hive, and Jaql which are top 3 application development language currently.

Pig is a high-level platform for creating Map Reduce programs used with Hadoop. Pig was originally developed at Yahoo Research for researchers to have an ad-hoc way of creating and executing map-reduce jobs on very large data sets. Pig is made up of two components: the first is the language itself, which is called PigLatin and the second is a runtime environment where PigLatin programs are executed.

The first step in a Pig program is to LOAD the data you want to manipulate from HDFS. Then you run the data through a set of transformations (which, under the covers, are translated into a set of mapper and reducer tasks). Finally, you DUMP the data to the screen or you STORE the results in a file somewhere.

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

As with any database management system (DBMS), we can run our Hive queries in many ways. You can run them from a command line interface (known as the Hive shell), from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) application leveraging the Hive JDBC/ODBC drivers, or from what is called a HiveThrift Client. The Hive Thrift Client is much like any database client that gets installed on a user’s client machine (or in a middle tier of 3-tier architecture): it communicates with the Hive services running on the server. You can use the Hive Thrift Client within applications written in C++, Java, PHP, Python, or Ruby (much like you can use these client-side languages with embedded SQL to access a database such as DB2 or Informix.

Jaql has been designed to flexibly read and write data from a variety of data stores and formats. Its core IO functions are read and write functions that are parameterized for specific data stores (e.g., file system, database, or a web service) and formats (e.g., JSON, XML, or CSV). Jaql infrastructure is extremely flexible and extensible, and allows for the passing of data between the query interface and the application language like Java, JavaScript, Python, Perl, Ruby, etc.

Specifically, Jaql allows us to select, join, group, and filter data that is stored in HDFS, much like a blend of Pig and Hive. Jaql’s query language was inspired by many programming and query languages, including Lisp, SQL, XQuery, and Pig. Jaql is a functional, declarative query language that is designed to process large data sets. For parallelism, Jaql rewrites high-level queries, when appropriate, into “low-level” queries consisting of MapReduce jobs.

Before we move further on JAQL let us talk about JSON first. Now day’s developers prefer JSON as their choice for a data interchange format. There are 2 reasons behind that; first it’s easy for humans to read due to its structure, secondly it’s easy for applications to parse it. JSON is built on top of two types of structures. The first is a collection of name/value pairs. These name/value pairs can represent anything since they are simply text strings that could represent a record in a database, an object, an associative array, and more. The second JSON structure is the ability to create an ordered list of values much like an array, list, or sequence you might have in our existing applications.

Internally, the Jaql engines transforms the query into map and reduce tasks that can significantly reduce the application development time associated with analyzing massive amounts of data in Hadoop.

Other Hadoop Related Projects
Please see below the list of some other important Hadoop related projects, for further readings please refer to ( or google:

ZooKeeper  - It provides coordination services for distributed applications.
HBase - It is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets.
Cassandra – It is essentially a hybrid between a key-value and a row-oriented database.
Oozie - It is an open source project that simplifies workflow and coordination between jobs.
LuceneIt is an extremely popular open source Apache project for text search and is included in many open source projects.
Apache Avro -It is mainly used for data serialization.
Chukwa -A monitoring system specifically used for large distributed system.
Mahout - It is a machine learning library.
Flume -A flume is a channel that directs water from a source to some other location where water is needed.

If I am ending this article without putting a single word about BigQuery, than I am not giving justice to it. BigQuery for Developers is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. Please take a note that it is an Infrastructure as a Service (IaaS) that may be used complementary with MapReduce.

I will try to bring out more and more, hidden but known truth of Big Data and Hadoop in my subsequent articles. Till than keep reading!


  1. This blog is gives great information on big data hadoop online training in hyderabad, uk, usa, canada.

    best online hadoop training in hyderabad.
    hadoop online training in usa, uk, canada.

  2. This blog is gives great information on big data hadoop online training in hyderabad, uk, usa, canada.

    best online hadoop training in hyderabad.
    hadoop online training in usa, uk, canada.

  3. This idea is mind blowing. I think everyone should know such information like you have described on this post. Thank you for sharing this explanation.Your final conclusion was good. We are sowing seeds and need to be patiently wait till it blossoms.

    Web Designing Training in Chennai

    Java Training in Chennai

    Salesforce Training in Chennai