Wednesday 26 December 2012

Big Data: An Introduction with Hive


Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

As with any database management system (DBMS), we can run our Hive queries in many ways. You can run them from a command line interface (known as the Hive shell), from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) application leveraging the Hive JDBC/ODBC drivers, or from what is called a HiveThrift Client. The Hive Thrift Client is much like any database client that gets installed on a user’s client machine (or in a middle tier of 3-tier architecture): it communicates with the Hive services running on the server. You can use the Hive Thrift Client within applications written in C++, Java, PHP, Python, or Ruby (much like you can use these client-side languages with embedded SQL to access a database such as DB2 or Informix.


Now you will think that we already have Pig a powerful and simple language than why should be look for Hive? The downside of Pig is that it is something new and we need to learn it and then master it.
Facebook folks have developed a runtime Hadoop support structure which is called Hive which allows anyone who is already familiar with SQL to control the Hadoop platform right out of the gate. Hive is a platform where SQL developers are allowed to write HQL (Hive Query Language) which are similar to standard SQL statements. HQL is limited in the commands it understands but it is still pretty useful.

HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster. We can run Hive queries in many ways. We can run them from

  1. Hive Shell (a command line interface)
  2. Java Database Connectivity (JDBC)
  3. Open Database Connectivity (ODBC)
  4. HiveThrift Client (The Hive Thrift Client is much like any database client that gets installed on a user’s client machine written in C++, Java, PHP, Python, or Ruby)


Hive Thrift Client can use these client-side languages with embedded SQL to access a database such as DB2 or Informix.

Let us consider below simple example. Here we create a FBComments table, populate it, and then query that table using Hive:

CREATE TABLE FBComments(from_user STRING, userid BIGINT, commenttext STRING,
recomments INT)
COMMENT 'This is Facebook comments table and a simple example'
STORED AS SEQUENCEFILE;
LOAD DATA INPATH 'hdfs://node/fbcommentdata' INTO TABLE FBComments;
SELECT from_user, SUM(recomments)
FROM FBComments
GROUP BY from_user;

After looking above code, Hive (HQL) looks very much similar to traditional database SQL code. There exist small difference and any SQL developer can point it out.Hive is based on Hadoop and MapReduce operations, there are few key differences.

 High Latency - Hadoop is intended for long sequential scans, we can expect queries to have a very high latency (many minutes). Hive would not be appropriate for applications that need very fast response times.

 Not Suitable for Transaction processing - Hive is read-based and therefore not suitable for transaction processing where you expect high percentage of write operations.


2 comments: