Hive
is a data warehouse system for Hadoop that facilitates easy data summarization,
ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible
file systems. Hive provides a mechanism to project structure onto this data and
query the data using a SQL-like language called HiveQL. At the same time this
language also allows traditional map/reduce programmers to plug in their custom
mappers and reducers when it is inconvenient or inefficient to express this
logic in HiveQL.
As
with any database management system (DBMS), we can run our Hive queries in many
ways. You can run them from a command line interface (known as the Hive shell),
from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC)
application leveraging the Hive JDBC/ODBC drivers, or from what is called a HiveThrift
Client. The Hive Thrift Client is much like any database client that gets
installed on a user’s client machine (or in a middle tier of 3-tier
architecture): it communicates with the Hive services running on the server.
You can use the Hive Thrift Client within applications written in C++, Java,
PHP, Python, or Ruby (much like you can use these client-side languages with
embedded SQL to access a database such as DB2 or Informix.
The
above sections are taken from http://manishku.blogspot.in/2012/12/big-data-and-hadoop-part-2.html
Now
you will think that we already have Pig a powerful and simple language than why
should be look for Hive? The downside of Pig is that it is something new and we
need to learn it and then master it.
Facebook
folks have developed a runtime Hadoop support structure which is called Hive
which allows anyone who is already familiar with SQL to control the Hadoop
platform right out of the gate. Hive is a platform where SQL developers are
allowed to write HQL (Hive Query Language) which are similar to standard SQL
statements. HQL is limited in the commands it understands but it is still
pretty useful.
HQL
statements are broken down by the Hive service into MapReduce jobs and executed
across a Hadoop cluster. We can run Hive queries in many ways. We can run them from
- Hive Shell (a command line interface)
- Java Database Connectivity (JDBC)
- Open Database Connectivity (ODBC)
- HiveThrift Client (The Hive Thrift Client is much like any database client that gets installed on a user’s client machine written in C++, Java, PHP, Python, or Ruby)
Hive
Thrift Client can use these client-side languages with embedded SQL to access a
database such as DB2 or Informix.
Let
us consider below simple example. Here we create a FBComments table, populate it, and then query that table using
Hive:
CREATE
TABLE FBComments(from_user STRING, userid BIGINT, commenttext STRING,
recomments
INT)
COMMENT
'This is Facebook comments table and a simple example'
STORED
AS SEQUENCEFILE;
LOAD
DATA INPATH 'hdfs://node/fbcommentdata' INTO TABLE FBComments;
SELECT
from_user, SUM(recomments)
FROM
FBComments
GROUP
BY from_user;
After
looking above code, Hive (HQL) looks very much similar to traditional database SQL
code. There exist small difference and any SQL developer can point it out.Hive
is based on Hadoop and MapReduce operations, there are few key differences.
High
Latency
- Hadoop is intended for long sequential scans, we can expect queries to have a
very high latency (many minutes). Hive would not be appropriate for
applications that need very fast response times.
Not
Suitable for Transaction processing - Hive is read-based and therefore not
suitable for transaction processing where you expect high percentage of write
operations.
Thanks much Jhon David!
ReplyDeletethanks for sharing.
ReplyDeleteBig Data Hadoop Training In Chennai | Big Data Hadoop Training In anna nagar | Big Data Hadoop Training In omr | Big Data Hadoop Training In porur | Big Data Hadoop Training In tambaram | Big Data Hadoop Training In velachery