In my previous blog I have given more insight on Big Data and now we will take a closer look at Hadoop.
Hadoop
Hadoop is a software
framework that supports data-intensive distributed applications under a free
license. It enables applications to work with thousands of nodes and petabytes
(a thousand terabytes) of data. It is a top-level Apache project which is written
in Java. Hadoop was inspired by Google’s work on its Google (distributed) File
System (GFS) and the Map-Reduce programming paradigm.
Hadoop is designed to
scan through large data sets to produce its results through a highly scalable,
distributed batch processing system. People, who always misunderstand Hadoop, Please
take a note that
- It
is not about speed-of-thought response times
- It
is not real-time warehousing
Hadoop a computing
environment built on a top of distributed cluster system designed specifically
for very large scale data operation. It is about discovery and making the once
near-impossible possible from a scalability and analysis perspective. The
Hadoop methodology is built around a function-to-data model where analysis
programs are sent to data. One of the key components of Hadoop is the
redundancy built into the environment and this redundancy provides fault
tolerance to it.
Hadoop Component
Hadoop has three
below component:
- Hadoop Distributed File System
- Programming paradigm – Map Reduce
pattern
- Hadoop Common
Hadoop is generally
seen as having two parts: a file system (the Hadoop Distributed File System) and a programming paradigm (Map-Reduce).
Components of Hadoop
Now let us take a
closer look at Hadoop components.
The Hadoop Distributed File System
(HDFS)
It’s possible to
scale a Hadoop cluster to hundreds and even to thousands of nodes. Data in a
Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout
the cluster. The map and reduce functions execute on smaller subsets of larger
data sets. Hadoop uses commonly available servers in a very large cluster,
where each server has a set of inexpensive internal disk drives. For higher
performance, MapReduce tries to assign workloads to these servers where the
data to be processed is stored. This is known as data locality. For
the very same reason SAN (storage
area network) or NAS (network
attached storage), is not recommended in Hadoop Environment.
The strength of
Hadoop is that it has built-in fault tolerance and fault compensation
capabilities. This is the same for HDFS, in that data is divided into blocks,
and copies of these blocks are stored on other servers in the Hadoop cluster.
That is, an individual file is actually stored as smaller blocks that are
replicated across multiple servers in the entire cluster.
Let us take an
example we have to store the Unique Registration Numbers (URN) for everyone in
the India; then people with a last name starting with A might be stored
on server 1, B on server 2, C on Server 3 and so on. In a Hadoop pieces
of this URN would be stored across the cluster. In order to reconstruct the
entire URN information, we would need the blocks from every server in the Hadoop
cluster. To achieve availability as components fail, HDFS replicates these
smaller pieces on to two additional servers by default. A data file in HDFS is
divided into blocks, and the default size of these blocks for Apache Hadoop is
64 MB.
All of Hadoop’s data placement logic is managed by a special server
called NameNode. It keeps the directory tree of all files in the file system,
and tracks where across the cluster the file data is kept. It does not store
the data of these files itself.All of the NameNode's information is stored in memory, which allows it to provide quick response times to storage manipulation or read requests. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. Backup Node is available from Hadoop 0.21+ version and used for HA (High Availability) in case NameNode fails.
Please take a note that there’s no need for MapReduce code to directly
reference the NameNode. Interaction with the NameNode is mostly done when the
jobs are scheduled on various servers in the Hadoop cluster. This greatly
reduces communications to the NameNode during job execution, which helps to
improve scalability of the solution. In summary, the NameNode deals with
cluster metadata describing where files are stored;actual data being processed
by MapReduce jobs never flows through the NameNode.
HDFS is not fully POSIX compliant because the requirements for a POSIX
filesystem differ from the target goals for a Hadoop application. The tradeoff
of not having a fully POSIX compliant filesystem is increased performance for
data throughput. It means that commands
you might use in interacting with files (copying, deleting, etc.) are available
in a different form with HDFS. There may be syntactical differences or limitations
in functionality. To work around this, you may write your own Java applications
to perform some of the functions, or learn the different HDFS commands to
manage and manipulate files in the file system.
Basics of MapReduce
MapReduce is the
heart of Hadoop. It is this programming paradigm in which work is broken down
into mapper and reducer tasks to manipulate data that is stored across a
cluster of servers for massive parallelism. The term MapReduce actually refers to two separate and distinct
tasks that Hadoop programs perform. The first is the map job, which takes a set
of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). The reduce job takes
the output from a map as input and combines those data tuples into a smaller
set of tuples. As the sequence of the name MapReduce implies, the reduce job is
always performed after the map job.
The people are
generally confused about MapReduce so I am trying to explain it by giving a
simple example but in real world it is not so simple J
Assume we have three
files, and each file contains two columns {a key (city) and a value (population)
in Hadoop term} for the various months. Please see below:
Mumbai,
20000
New
Delhi, 30000
Bangalore,
15000
Mumbai,
22000
New
Delhi, 31000
Now we want to find
the minimum population for each city across all of the data files. There may be
the case that each file might have the same city represented multiple times.
Using the MapReduce paradigm, we can break this down into three map tasks,
where each mapper works on one of the three files and the mapper task goes
through the data and returns the minimum population for each city.
For example, the
results produced from one mapper task for the data above would look like this:
(Mumbai,
20000) (New Delhi, 30000) (Bangalore, 15000)
Let’s assume the
other 2 mapper tasks produced the following intermediate results:
(Mumbai,
20000) (New Delhi, 31000) (Bangalore, 15000)
(Mumbai,
22000) (New Delhi, 30000) (Bangalore, 14000)
All three of these
output streams would be fed into the reduce tasks, which combine the input
results and output a single value for each city, producing a final result set
as follows:
(Mumbai,
20000) (New Delhi, 30000) (Bangalore, 15000)
The
JobTracker is the service within Hadoop that farms out MapReduce tasks to
specific nodes in the cluster, ideally the nodes that have the data, or at
least are in the same rack.
- Client
applications submit jobs to the Job tracker.
- The
JobTracker talks to the NameNode to
determine the location of the data
- The
JobTracker locates TaskTracker nodes
with available slots at or near the data
- The
JobTracker submits the work to the chosen TaskTracker nodes.
- The TaskTracker nodes
are monitored. If they do not submit heartbeat signals often enough, they
are deemed to have failed and the work is scheduled on a different TaskTracker.
- A TaskTracker will
notify the JobTracker when a task fails. The JobTracker decides what to do
then: it may resubmit the job elsewhere, it may mark that specific record
as something to avoid, and it may may even blacklist the TaskTracker as
unreliable.
- When the work
is completed, the JobTracker updates its status.
- Client
applications can poll the JobTracker for information.
The
JobTracker is a point of failure for the Hadoop MapReduce service. If
it goes down, all running jobs are halted.All
MapReduce programs that run natively under Hadoop are written in Java, and it
is the Java Archive file (jar) that’s distributed by the JobTracker to the
various Hadoop cluster nodes to execute the map and reduce tasks.
Hadoop Common Components
The Hadoop Common
Components are a set of libraries that support the various Hadoop subprojects. To
interact with files in HDFS, we need to use the /bin/hdfs dfs <args> file
system shell command interface, where args represents the command arguments we
want to use on files in the file system.
Here are some
examples of HDFS shell commands:
cat
Copies
the file to standard output (stdout).
chmod
Changes
the permissions for reading and writing to a given file or set of files.
chown
Changes
the owner of a given file or set of files.
copyFromLocal
Copies
a file from the local file system into HDFS.
copyToLocal
Copies
a file from HDFS to the local file system.
cp
Copies
HDFS files from one directory to another.
expunge
Empties
all of the files that are in the trash.
ls
Displays
a listing of files in a given directory.
mkdir
Creates
a directory in HDFS.
mv
Moves
files from one directory to another.
rm
Deletes
a file and sends it to the trash
Application
Development in Hadoop
In below section, we
will cover three Pig, Hive, and Jaql which are top 3 application development
language currently.
Pig
Pig is a high-level
platform for creating Map Reduce programs used with Hadoop. Pig was
originally developed at Yahoo Research for researchers to have an ad-hoc way of
creating and executing map-reduce jobs on very large data sets. Pig is made up
of two components: the first is the language itself, which is called PigLatin
and the second is a runtime environment where PigLatin programs are
executed.
The first step in a
Pig program is to LOAD the data you
want to manipulate from HDFS. Then you run the data through a set of transformations (which, under the covers,
are translated into a set of mapper and reducer tasks). Finally, you DUMP the data to the screen or you STORE the results in a file somewhere.
Hive
Hive is a data
warehouse system for Hadoop that facilitates easy data summarization, ad-hoc
queries, and the analysis of large datasets stored in Hadoop compatible file
systems. Hive provides a mechanism to project structure onto this data and
query the data using a SQL-like language called HiveQL. At the same time this
language also allows traditional map/reduce programmers to plug in their custom
mappers and reducers when it is inconvenient or inefficient to express this
logic in HiveQL.
As with any database
management system (DBMS), we can run our Hive queries in many ways. You can run
them from a command line interface (known as the Hive shell), from a
Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC)
application leveraging the Hive JDBC/ODBC drivers, or from what is called a HiveThrift
Client. The Hive Thrift Client is much like any database client that gets installed
on a user’s client machine (or in a middle tier of 3-tier architecture): it
communicates with the Hive services running on the server. You can use the Hive
Thrift Client within applications written in C++, Java, PHP, Python, or Ruby
(much like you can use these client-side languages with embedded SQL to access
a database such as DB2 or Informix.
Jaql
Jaql has been
designed to flexibly read and write data from a variety of data stores and
formats. Its core IO functions are read and write functions that are parameterized
for specific data stores (e.g., file system, database, or a web service) and
formats (e.g., JSON, XML, or CSV). Jaql infrastructure is extremely flexible
and extensible, and allows for the passing of data between the query interface
and the application language like Java, JavaScript, Python, Perl, Ruby, etc.
Specifically, Jaql
allows us to select, join, group, and filter data that is stored in HDFS, much
like a blend of Pig and Hive. Jaql’s query language was inspired by many
programming and query languages, including Lisp, SQL, XQuery, and Pig. Jaql is
a functional, declarative query language that is designed to process large data
sets. For parallelism, Jaql rewrites high-level queries, when appropriate, into
“low-level” queries consisting of MapReduce jobs.
Before we move
further on JAQL let us talk about JSON first. Now day’s developers prefer JSON
as their choice for a data interchange format. There are 2 reasons behind that;
first it’s easy for humans to read due to its structure, secondly it’s easy for
applications to parse it. JSON is built on top of two types of structures. The
first is a collection of name/value pairs. These name/value pairs can represent
anything since they are simply text strings that could represent a record in a
database, an object, an associative array, and more. The second JSON structure
is the ability to create an ordered list of values much like an array, list, or
sequence you might have in our existing applications.
Internally, the Jaql engines
transforms the query into map and reduce tasks that can significantly reduce
the application development time associated with analyzing massive amounts of
data in Hadoop.
Other Hadoop Related Projects
Please see below the
list of some other important Hadoop related projects, for further readings
please refer to (hadoop.apache.org) or google:
ZooKeeper - It provides coordination services for
distributed applications.
HBase
- It is a
column-oriented database management system that runs on top of HDFS. It is well
suited for sparse data sets.
Cassandra – It is essentially a hybrid between a key-value and a row-oriented database.
Oozie - It is an open source project
that simplifies workflow and coordination between jobs.
Lucene – It is an extremely popular open source Apache project for
text search and is included in many open source projects.
Apache
Avro -It is mainly used
for data serialization.
Chukwa -A monitoring system specifically used
for large distributed system.
Mahout - It is a machine learning library.
Flume -A flume is a channel that directs
water from a source to some other location where water is needed.
If I am ending this
article without putting a single word about BigQuery, than I am not giving justice
to it. BigQuery for Developers is a
RESTful web service that enables interactive analysis of massively large
datasets working in conjunction with Google Storage. Please take a note that it
is an Infrastructure as a Service (IaaS) that may be used complementary with MapReduce.
I will try to bring out
more and more, hidden but known truth of Big Data and Hadoop in my subsequent articles.
Till than keep reading!