Wednesday, 26 December 2012

Big Data: An Introduction with Hive

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

As with any database management system (DBMS), we can run our Hive queries in many ways. You can run them from a command line interface (known as the Hive shell), from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) application leveraging the Hive JDBC/ODBC drivers, or from what is called a HiveThrift Client. The Hive Thrift Client is much like any database client that gets installed on a user’s client machine (or in a middle tier of 3-tier architecture): it communicates with the Hive services running on the server. You can use the Hive Thrift Client within applications written in C++, Java, PHP, Python, or Ruby (much like you can use these client-side languages with embedded SQL to access a database such as DB2 or Informix.

Now you will think that we already have Pig a powerful and simple language than why should be look for Hive? The downside of Pig is that it is something new and we need to learn it and then master it.
Facebook folks have developed a runtime Hadoop support structure which is called Hive which allows anyone who is already familiar with SQL to control the Hadoop platform right out of the gate. Hive is a platform where SQL developers are allowed to write HQL (Hive Query Language) which are similar to standard SQL statements. HQL is limited in the commands it understands but it is still pretty useful.

HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster. We can run Hive queries in many ways. We can run them from

  1. Hive Shell (a command line interface)
  2. Java Database Connectivity (JDBC)
  3. Open Database Connectivity (ODBC)
  4. HiveThrift Client (The Hive Thrift Client is much like any database client that gets installed on a user’s client machine written in C++, Java, PHP, Python, or Ruby)

Hive Thrift Client can use these client-side languages with embedded SQL to access a database such as DB2 or Informix.

Let us consider below simple example. Here we create a FBComments table, populate it, and then query that table using Hive:

CREATE TABLE FBComments(from_user STRING, userid BIGINT, commenttext STRING,
recomments INT)
COMMENT 'This is Facebook comments table and a simple example'
LOAD DATA INPATH 'hdfs://node/fbcommentdata' INTO TABLE FBComments;
SELECT from_user, SUM(recomments)
FROM FBComments
GROUP BY from_user;

After looking above code, Hive (HQL) looks very much similar to traditional database SQL code. There exist small difference and any SQL developer can point it out.Hive is based on Hadoop and MapReduce operations, there are few key differences.

 High Latency - Hadoop is intended for long sequential scans, we can expect queries to have a very high latency (many minutes). Hive would not be appropriate for applications that need very fast response times.

 Not Suitable for Transaction processing - Hive is read-based and therefore not suitable for transaction processing where you expect high percentage of write operations.

Tuesday, 25 December 2012

Big Data - An Introduction to Pig

In my previous article, I have explained Big Data and Hadoop in details. In this article I would like to go little deeper with Pig. Pig is a high-level platform for creating Map Reduce programs used with Hadoop. Pig is made up of two components: the first is the language itself, which is called PigLatin and the second is a runtime environment where PigLatin programs are executed. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in JavaPython or JavaScript and then call directly from the language.

We know Pig was initially developed at Yahoo research at 2006. The whole intension behind developing Pig was to allow people using Hadoop to focus more on analyzing large data sets and spend less time to write mapper and reducer programs. Pigs eat almost anything, the Pig programming language is designed to handle any kind of data and for the very same reason Yahoo! named it Pig.

The first step in a Pig program is to LOAD the data you want to manipulate from HDFS. Then you run the data through a set of transformations (which, under the covers, are translated into a set of mapper and reducer tasks). Finally, you DUMP the data to the screen or you STORE the results in a file somewhere.

Let us talk about LOAD, TRANSFORM, DUMP and STORE in details.

The objects that are being worked on by Hadoop are stored in HDFS. In order for a Pig program to access this data, the program must first tell Pig what file (or files) it will use, and it is done through the LOAD 'data_file' command (where 'data_file' specifies either an HDFS file or directory). If a directory is specified, all the files in that directory will be loaded into the program. If the data is stored in a file format that is not natively accessible to Pig, you can optionally add the USING function to the LOAD statement to specify a user-defined function that can read in and interpret the data.

The transformation logic is where all the data manipulation happens. Here we can FILTER out rows that are not of interest, JOIN two sets of data files, GROUP data to build aggregations, ORDER results, and much more. The following is an example of a Pig program that takes a file composed of Facebook comments, selects only those comments that are in English, then groups them by the user who is commenting, and displays the sum of the number of re-comments of that user’s comments.

L = LOAD 'hdfs//node/facebook_comment';
FL = FILTER L BY iso_language_code EQ 'en';
G = GROUP FL BY from_user;
RT = FOREACH G GENERATE group, SUM(recomments);

If we don’t specify the DUMP or STORE command, the results of a Pig program are not generated. When we are debugging our Pig programs, we typically use the DUMP command to send the output to the screen. We simply change the DUMP call to a STORE call so that any results from running your programs are stored in a file for further processing or analysis when we go into production. Please note that DUMP command can be used anywhere in our program to dump intermediate result sets to the screen and actually we need it because it helps us in debugging.

How to run Pig program
Now when we are ready with our Pig program, than we need to run in the Hadoop environment. There are three ways to run a Pig program: 
  1.  Embedded in a script
  2.  Embedded in a Java program
  3.  From the GRUNT(Pig command line)

It doesn’t matter which of the three ways we run the program. The Pig runtime environment translates the program into a set of map and reduces tasks and runs them under our behalf. I will talk about Python in my next blog, till than happy reading.

Monday, 24 December 2012

Big Data And Hadoop Part 2

In my previous blog I have given more insight on Big Data and now we will take a closer look at Hadoop.


Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes (a thousand terabytes) of data. It is a top-level Apache project which is written in Java. Hadoop was inspired by Google’s work on its Google (distributed) File System (GFS) and the Map-Reduce programming paradigm.

Hadoop is designed to scan through large data sets to produce its results through a highly scalable, distributed batch processing system. People, who always misunderstand Hadoop, Please take a note that 
  • It is not about speed-of-thought response times
  • It is not real-time warehousing 

Hadoop a computing environment built on a top of distributed cluster system designed specifically for very large scale data operation. It is about discovery and making the once near-impossible possible from a scalability and analysis perspective. The Hadoop methodology is built around a function-to-data model where analysis programs are sent to data. One of the key components of Hadoop is the redundancy built into the environment and this redundancy provides fault tolerance to it.

Hadoop Component
Hadoop has three below component:

  1. Hadoop Distributed File System
  2. Programming paradigm – Map Reduce pattern
  3. Hadoop Common

Hadoop is generally seen as having two parts: a file system (the Hadoop Distributed File System) and a programming paradigm (Map-Reduce).

Components of Hadoop
Now let us take a closer look at Hadoop components.

The Hadoop Distributed File System (HDFS)
It’s possible to scale a Hadoop cluster to hundreds and even to thousands of nodes. Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster. The map and reduce functions execute on smaller subsets of larger data sets. Hadoop uses commonly available servers in a very large cluster, where each server has a set of inexpensive internal disk drives. For higher performance, MapReduce tries to assign workloads to these servers where the data to be processed is stored. This is known as data locality. For the very same reason SAN (storage area network) or NAS (network attached storage), is not recommended in Hadoop Environment.

The strength of Hadoop is that it has built-in fault tolerance and fault compensation capabilities. This is the same for HDFS, in that data is divided into blocks, and copies of these blocks are stored on other servers in the Hadoop cluster. That is, an individual file is actually stored as smaller blocks that are replicated across multiple servers in the entire cluster.
Let us take an example we have to store the Unique Registration Numbers (URN) for everyone in the India; then people with a last name starting with A might be stored on server 1, B on server 2, C on Server 3 and so on. In a Hadoop pieces of this URN would be stored across the cluster. In order to reconstruct the entire URN information, we would need the blocks from every server in the Hadoop cluster. To achieve availability as components fail, HDFS replicates these smaller pieces on to two additional servers by default. A data file in HDFS is divided into blocks, and the default size of these blocks for Apache Hadoop is 64 MB.

All of Hadoop’s data placement logic is managed by a special server called NameNode. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.All of the NameNode's information is stored in memory, which allows it to provide quick response times to storage manipulation or read requests. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. Backup Node is available from Hadoop 0.21+ version and used for HA (High Availability) in case NameNode fails.

Please take a note that there’s no need for MapReduce code to directly reference the NameNode. Interaction with the NameNode is mostly done when the jobs are scheduled on various servers in the Hadoop cluster. This greatly reduces communications to the NameNode during job execution, which helps to improve scalability of the solution. In summary, the NameNode deals with cluster metadata describing where files are stored;actual data being processed by MapReduce jobs never flows through the NameNode.

HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. It means that commands you might use in interacting with files (copying, deleting, etc.) are available in a different form with HDFS. There may be syntactical differences or limitations in functionality. To work around this, you may write your own Java applications to perform some of the functions, or learn the different HDFS commands to manage and manipulate files in the file system. 

Basics of MapReduce
MapReduce is the heart of Hadoop. It is this programming paradigm in which work is broken down into mapper and reducer tasks to manipulate data that is stored across a cluster of servers for massive parallelism. The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

The people are generally confused about MapReduce so I am trying to explain it by giving a simple example but in real world it is not so simple J

Assume we have three files, and each file contains two columns {a key (city) and a value (population) in Hadoop term} for the various months. Please see below:

Mumbai, 20000
New Delhi, 30000
Bangalore, 15000
Mumbai, 22000
New Delhi, 31000

Now we want to find the minimum population for each city across all of the data files. There may be the case that each file might have the same city represented multiple times. Using the MapReduce paradigm, we can break this down into three map tasks, where each mapper works on one of the three files and the mapper task goes through the data and returns the minimum population for each city.

For example, the results produced from one mapper task for the data above would look like this:
(Mumbai, 20000) (New Delhi, 30000) (Bangalore, 15000)

Let’s assume the other 2 mapper tasks produced the following intermediate results:

(Mumbai, 20000) (New Delhi, 31000) (Bangalore, 15000)
(Mumbai, 22000) (New Delhi, 30000) (Bangalore, 14000)

All three of these output streams would be fed into the reduce tasks, which combine the input results and output a single value for each city, producing a final result set as follows:
(Mumbai, 20000) (New Delhi, 30000) (Bangalore, 15000)
The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
  1. Client applications submit jobs to the Job tracker.
  2. The JobTracker talks to the NameNode to determine the location of the data
  3. The JobTracker locates TaskTracker nodes with available slots at or near the data
  4. The JobTracker submits the work to the chosen TaskTracker nodes.
  5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
  6. TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
  7. When the work is completed, the JobTracker updates its status.
  8. Client applications can poll the JobTracker for information.
The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.All MapReduce programs that run natively under Hadoop are written in Java, and it is the Java Archive file (jar) that’s distributed by the JobTracker to the various Hadoop cluster nodes to execute the map and reduce tasks.
Hadoop Common Components
The Hadoop Common Components are a set of libraries that support the various Hadoop subprojects. To interact with files in HDFS, we need to use the /bin/hdfs dfs <args> file system shell command interface, where args represents the command arguments we want to use on files in the file system.

Here are some examples of HDFS shell commands:

cat Copies the file to standard output (stdout).
chmod Changes the permissions for reading and writing to a given file or set of files.
chown Changes the owner of a given file or set of files.
copyFromLocal Copies a file from the local file system into HDFS.
copyToLocal Copies a file from HDFS to the local file system.
cp Copies HDFS files from one directory to another.
expunge Empties all of the files that are in the trash.
ls Displays a listing of files in a given directory.
mkdir Creates a directory in HDFS.
mv Moves files from one directory to another.
rm Deletes a file and sends it to the trash

Application Development in Hadoop
In below section, we will cover three Pig, Hive, and Jaql which are top 3 application development language currently.

Pig is a high-level platform for creating Map Reduce programs used with Hadoop. Pig was originally developed at Yahoo Research for researchers to have an ad-hoc way of creating and executing map-reduce jobs on very large data sets. Pig is made up of two components: the first is the language itself, which is called PigLatin and the second is a runtime environment where PigLatin programs are executed.

The first step in a Pig program is to LOAD the data you want to manipulate from HDFS. Then you run the data through a set of transformations (which, under the covers, are translated into a set of mapper and reducer tasks). Finally, you DUMP the data to the screen or you STORE the results in a file somewhere.

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

As with any database management system (DBMS), we can run our Hive queries in many ways. You can run them from a command line interface (known as the Hive shell), from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) application leveraging the Hive JDBC/ODBC drivers, or from what is called a HiveThrift Client. The Hive Thrift Client is much like any database client that gets installed on a user’s client machine (or in a middle tier of 3-tier architecture): it communicates with the Hive services running on the server. You can use the Hive Thrift Client within applications written in C++, Java, PHP, Python, or Ruby (much like you can use these client-side languages with embedded SQL to access a database such as DB2 or Informix.

Jaql has been designed to flexibly read and write data from a variety of data stores and formats. Its core IO functions are read and write functions that are parameterized for specific data stores (e.g., file system, database, or a web service) and formats (e.g., JSON, XML, or CSV). Jaql infrastructure is extremely flexible and extensible, and allows for the passing of data between the query interface and the application language like Java, JavaScript, Python, Perl, Ruby, etc.

Specifically, Jaql allows us to select, join, group, and filter data that is stored in HDFS, much like a blend of Pig and Hive. Jaql’s query language was inspired by many programming and query languages, including Lisp, SQL, XQuery, and Pig. Jaql is a functional, declarative query language that is designed to process large data sets. For parallelism, Jaql rewrites high-level queries, when appropriate, into “low-level” queries consisting of MapReduce jobs.

Before we move further on JAQL let us talk about JSON first. Now day’s developers prefer JSON as their choice for a data interchange format. There are 2 reasons behind that; first it’s easy for humans to read due to its structure, secondly it’s easy for applications to parse it. JSON is built on top of two types of structures. The first is a collection of name/value pairs. These name/value pairs can represent anything since they are simply text strings that could represent a record in a database, an object, an associative array, and more. The second JSON structure is the ability to create an ordered list of values much like an array, list, or sequence you might have in our existing applications.

Internally, the Jaql engines transforms the query into map and reduce tasks that can significantly reduce the application development time associated with analyzing massive amounts of data in Hadoop.

Other Hadoop Related Projects
Please see below the list of some other important Hadoop related projects, for further readings please refer to ( or google:

ZooKeeper  - It provides coordination services for distributed applications.
HBase - It is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets.
Cassandra – It is essentially a hybrid between a key-value and a row-oriented database.
Oozie - It is an open source project that simplifies workflow and coordination between jobs.
LuceneIt is an extremely popular open source Apache project for text search and is included in many open source projects.
Apache Avro -It is mainly used for data serialization.
Chukwa -A monitoring system specifically used for large distributed system.
Mahout - It is a machine learning library.
Flume -A flume is a channel that directs water from a source to some other location where water is needed.

If I am ending this article without putting a single word about BigQuery, than I am not giving justice to it. BigQuery for Developers is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. Please take a note that it is an Infrastructure as a Service (IaaS) that may be used complementary with MapReduce.

I will try to bring out more and more, hidden but known truth of Big Data and Hadoop in my subsequent articles. Till than keep reading!

Friday, 14 December 2012

Big Data and Hadoop Part 1

Big Data Definition

As per wiki Big Data usually includes data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time.

Big data is an opportunity to find visions in new and emerging types of data and content. It helps to make business more agile. Sometime we know that we can find answer to questions, but we can’t because of limitation of current infrastructure or data is not being captured in the current traditional systems. We can also find an answer to those questions which were not considered previously and beyond our reach. Now, Big Data gives this opportunity. It also enables us to identify explanation and pattern across industries and sectors.
Big Data stores everything: environmental data, financial data, medical data, surveillance data, log data, Insurance data, digital pictures and videos, cell phone GPS signals, etc. This list can go on and on and on.

Characteristics of Big Data

There are three characteristics of Big Data which are described below:

Enterprise data is growing and growing from Terabytes to Petabytes and from Petabytes to Zettabytes. Twitter alone generates more than 7 terabytes (TB) of data every day, Facebook 10 TB, and some enterprises generate terabytes of data every hour.

Year - Data Stored Volume
2000 - 800,000 Petabytes (PB)

Year - Expected Data Stored Volume
2020 35 Zettabytes (ZB)

Now days, Enterprises are facing massive volumes of data. Organizations don’t know how to manage this data and to get right meaning out of it. If we use the right technology, right platform and analyze almost all of the data than it can gain a better understanding of business, customers, as well as the marketplace. We can at least use the right technology to identify the useful data for business and customers. The data volumes has changed from terabytes to petabytes and in couple of years it will shift to zettabytes, and for sure all this data can’t be stored in traditional database systems. So the question is where we will store high volume of daily data and make it meaningful for business decision and the answer is move quickly to Big Data.

Big data can store any type of data, structured, semi-structured and unstructured data. All of it can be included in part of the decision-making. The volume associated with the Big Data brings new challenges. It is a challenge for data centers how to deal with it. It is not only the volume, its variety also a matter of concern. The data associated with sensors, smart devices and social media have made enterprise data very complex. It includes not only traditional relational data, but also raw, semi-structured, and unstructured data. Examples are like data from web pages, web log files (including click-stream data), search indexes, social media forums, e-mail, documents, sensor data, and so on. The biggest struggle for traditional systems is that to store and perform the required analytic to gain understanding from the contents of these logs. Much of the information being generated doesn’t fit to traditional database technologies. Very few companies currently understand it and moving forward and starting to understand the opportunities of Big Data.

Traditional structured data is not meant to handle variety of data. Now days, organization’s success depends upon its ability to draw perceptions from traditional and nontraditional data. Traditional data is structured, neatly formatted, and relational and fits into exact schema and we do all analysis on it. But do we know that it is just the 15~20 percent of the day to day data of any business. Earlier we didn't think of 80~85 percent of data which is unstructured or semi-structured. But now world is looking into this high volume of data as this may be responsible to bring out the truth that may help in taking business decision at the right time. Twitter and Facebook are the best example of it. Twitter feeds are stored as JSON format but the actual text is unstructured and it is difficult to make meaning out of it. Facebook uses video and images too often and it is not easy to store them in traditional databases. Some event changes dynamically which will not fit right into relational databases. The biggest advantages of Big data is that it enables us to analyze structured (traditional relational databases), semi-structured and unstructured data like text, sensor data, audio, video, transactional, and more and more.

As per our traditional understanding we say velocity is considered as how quickly the data is arrived, stored and retrieved. Now days, volume and variety of data have changed in collecting and storing manner. Sometimes even few seconds are too late for time sensitive process like catching fraud.

We need a new thinking way to define velocity. We should define velocity as the speed at which the data is flowing.

Sometimes, getting an edge over your competition can mean identifying a trend, problem, or opportunity within small fraction of time. Organizations must be able to analyze this data in near real time. Big Data scale streams computing is a new concept to go beyond traditional databases limitation of running queries against relatively static data. Let us take an example, the query “Show me all people living in the Asia Pacific Sunami region” would result in a single result set. This list can be used to give warning. With streams computing, we can execute a query like people who are currently “in the Asia Pacific Sunami region” and this can be continuously updated result set, because location information from GPS system is refreshed in real time.

Big Data requires perform analytic against the volume and variety of data while it is still in motion, not just after it is at rest. To conclude on Big Data characteristic we can say velocity has same importance as volume. It should combine both how fast data is generated as well how fast analytic is done.

Big Data & Hadoop
The Traditional warehouses are mostly ideal for analyzing structured data from various systems and producing visibility with known and relatively stable measurements. Hadoop based platform is well suited to deal with semi-structured and unstructured data, as well as when a data discovery process is needed. In addition Hadoop can be used for structured data too.
The truth and accuracy is the most desired feature of data. Traditional DW is meant to store structured data. DW departments pick and choose high-valued data and put it through rigorous cleansing and transformation processes as they know that data has a high known value per byte. In contrast, Big Data repositories rarely undergo (at least initially) the full quality control rigors of data before being injected into DW. The Unstructured data can’t be easily stored in a warehouse. A Big Data platform lets you store all of the data in its native business object format and get value out of it through massive parallelism on readily available components. Hadoop based repository scheme store the entire business entity and the fidelity of the Tweet, transaction, Facebook post. Data in Hadoop might seem of low value today but in fact in can be the key to bring out the answer to question which is not asked yet. So let data sit in Hadoop for a while until you discover its value this can be migrate to DW once its value is proven.

Please take a point that Big Data & Hadoop platform is not a replacement for warehouse. Actually both supplement each other.

When Big Data Is Suitable

  1. Big Data solutions are ideal for analyzing not only raw structured data, but semi-structured and unstructured data from a wide variety of sources.
  2. Big Data solutions are ideal when all, or most, of the data needs to be analyzed versus a sample of the data; or a sampling of data isn’t nearly as effective as a larger set of data from which to derive analysis.
  3. Big Data solutions are ideal for iterative and exploratory analysis when business measures on data are not predetermined.
  4. Is the reciprocal of the traditional analysis paradigm appropriate for the business task at hand? Better yet, can you see a Big Data platform complementing what you currently have in place for analysis and achieving synergy with existing solutions for better business outcomes?
  5. Big Data is well suited for solving information challenges that don’t natively fit within a traditional relational database approach for handling the problem at hand. It’s important that we understand that conventional database technologies are an important, and relevant, part of an overall analytic solution. In fact, they become even more vital when used in conjunction with your Big Data platform.

Big Data - Where It Can Fit In

IT Log Analytics
Log analytic is a common use case for Big Data project. We like to refer to all those logs and trace data that are generated by the operation of your IT solutions as data exhaust. Data exhaust has concentrated value, and we need to figure out a way to store and extract value from it. Some of the value derived from data exhaust is obvious and has been transformed into value-added click-stream data that records every gesture, click, and movement made on a web site. Google and Bing search use search based log data to do various analytical reporting. It is not limited to search log, it can be applied to firewall log and to also IT logs which are unreported and uncovered.

Trying to find correlation across massive (gigabytes) of data is a difficult task, but Big Data makes it easy. Now it is possible to identify previously unreported areas for performance optimization tuning. Big Data platform can be used to analyze 1TB of log data each day, with less than 5 minutes latency. Big Data platform finally offers the opportunity to get some new and better insights into the problems at hand.

The Fraud Detection Pattern
Fraud detection comes up a lot in the financial services vertical. Online auctions, insurance claims, any sort of financial transaction is involved, there we see a potential of misusing the data and it may lead to fraud. Here Big Data can help us to identify it and fix it quickly.

Conventional technologies are facing several challenges to detect fraud .The most common and repeating challenge is to store and compute fraud patterns data. Traditional Data Mining technique is used to detect fraud but they have limitation. First it works on less data. The fraud patterns can be cyclic and that can come and go in hours/days/weeks/months. By the time we discover new patterns, it’s too late and some damage has already been done.

Traditionally, in fraud cases, samples and models are used to identify customers that characterize a certain kind of profile by profiling a segment and not the granularity at an individual transaction or person level. We can forecast based on a segment, but it is obviously better to take decision based on an individual transaction. We need to work upon a larger set of data to achieve it and that is not possible with the traditional approach.

Typically, fraud detection works after a transaction gets stored only to get pulled out of storage and analyzed. With Big Data Streams, we can apply fraud detection models as the transaction is happening. It means when data is in motion.
The Social Media
Perhaps the most talked about Big Data usage pattern is social media and customer sentiment. It is very hot and happening topic in global market now days.

We can use Big Data to figure out what customers are saying about you your competitor. Furthermore, we can use it to figure out how customer sentiment impacts the business decisions. It can also be used to review product, price, campaign and placement.

Why are people saying what they are saying and behaving in the way they are behaving? Big Data can answer this question by linking behavior and the driver of that behavior which can’t be answered by traditional data. Big data can look at the interaction of what people are doing with their behaviors, current financial trends and actual transactions. For example the number of Twitter tweets per second per topic can give different insight of customer behavior.

Call and Chat Details
The time and quality resolution metrics and trending discontent patterns for a call center can show up weeks after the fact if we just depend upon traditional DW. This latency can be reduced by using Big Data.

Call centers with text and voice support always want to find better ways to process information to address business issue with lower latency. Simply it is a Big Data use case, where we can use analytics-in-motion and analytics-at-rest. Using in-motion analytics (Streams) means that we basically build our models and find out what’s interesting based upon the conversations that have been converted from voice to text or with voice analysis as the call is happening. We can build model by using rest analytics, after that promote them back into Streams. In this fashion we can examine and analyze the calls that are actually happening in real time. Big Data can also be used not only to see the customer churn rate but also to identify the vulnerable customers and reason for churn. Business can use this to reduce customer churn rate.

Risk: Patterns for Modeling and Management
Risk modeling and management is another big opportunity and common Big Data usage pattern. As per today’s financial markets, a lack of understanding risk can have devastating wealth creation effects. In addition, newly legislated regulatory requirements affect financial institutions worldwide to ensure that their risk levels fall within acceptable thresholds.

Traditional model uses 15 to 20 percent of the available structured data in their risk modeling which is similar as in fraud detection model. Traditional model experts know that there’s a lot of data that’s potentially underutilized and it can be used to determine business rules in risk model. The issue is that they don’t know where the relevant information can be found in the rest of the data. Additionally it can be too much expensive to bring underutilized data into current infrastructure. We already know that some data won’t even fit into traditional systems, so we have to look for new approach. Let’s think about what happens at the end of a trading day in a financial firm. We get a closing snapshot of the positions at the end of the day. Companies use this snapshot, and can derive insight and identify issues and concentrations using their models within a couple of hour. This can be reported back to regulators for internal risk control.

Let’s consider that financial services trend to move their risk model and dashboards to inter-day positions rather than just close-of-day positions. This is again a new challenge to traditional systems and it can’t be solved alone. Big Data can join hands with traditional system to solve this problem.

Energy Sector
The energy sector provides many Big Data use case challenges in how to deal with the massive volumes of sensor data from remote installations.

Many companies are using only a fraction of the data being collected, because they lack the infrastructure to store or analyze the available scale of data. Take for example of a typical oil drilling platform that can have 20,000 to 40,000 sensors on board. All of these sensors are streaming data about the health of the oil rig, quality of operations, and so on. Not every sensor is actively broadcasting at all times, but some are reporting back many times per second. Only 5~10 percent of those sensors are actively utilized. So 90~95 percentage of data are not being utilized in decision-making process. Analyze million and millions of electric meter data to predict better power consumption is also a difficult task. I see Big Data has a great opportunity to deal with this kind of difficult challenge.

Vestas, Denmark is an energy sector global leader whose vision is about the generation of clean energy, are using the IBM BigInsights (Big Data) platform as a method by which they can more profitably and efficiently generate even more clean energy.

This is part 1 on Big Data and I believe it is a good start to step forward with Big Data. The part 2 of current topic will take complete inside of Hadoop and its components. Stay Tuned.