Friday 14 December 2012

Big Data and Hadoop Part 1


Big Data Definition

As per wiki Big Data usually includes data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time.

Big data is an opportunity to find visions in new and emerging types of data and content. It helps to make business more agile. Sometime we know that we can find answer to questions, but we can’t because of limitation of current infrastructure or data is not being captured in the current traditional systems. We can also find an answer to those questions which were not considered previously and beyond our reach. Now, Big Data gives this opportunity. It also enables us to identify explanation and pattern across industries and sectors.
           
Big Data stores everything: environmental data, financial data, medical data, surveillance data, log data, Insurance data, digital pictures and videos, cell phone GPS signals, etc. This list can go on and on and on.

Characteristics of Big Data

There are three characteristics of Big Data which are described below:

Volume
Enterprise data is growing and growing from Terabytes to Petabytes and from Petabytes to Zettabytes. Twitter alone generates more than 7 terabytes (TB) of data every day, Facebook 10 TB, and some enterprises generate terabytes of data every hour.

Year - Data Stored Volume
2000 - 800,000 Petabytes (PB)

Year - Expected Data Stored Volume
2020 35 Zettabytes (ZB)

Now days, Enterprises are facing massive volumes of data. Organizations don’t know how to manage this data and to get right meaning out of it. If we use the right technology, right platform and analyze almost all of the data than it can gain a better understanding of business, customers, as well as the marketplace. We can at least use the right technology to identify the useful data for business and customers. The data volumes has changed from terabytes to petabytes and in couple of years it will shift to zettabytes, and for sure all this data can’t be stored in traditional database systems. So the question is where we will store high volume of daily data and make it meaningful for business decision and the answer is move quickly to Big Data.

Variety
Big data can store any type of data, structured, semi-structured and unstructured data. All of it can be included in part of the decision-making. The volume associated with the Big Data brings new challenges. It is a challenge for data centers how to deal with it. It is not only the volume, its variety also a matter of concern. The data associated with sensors, smart devices and social media have made enterprise data very complex. It includes not only traditional relational data, but also raw, semi-structured, and unstructured data. Examples are like data from web pages, web log files (including click-stream data), search indexes, social media forums, e-mail, documents, sensor data, and so on. The biggest struggle for traditional systems is that to store and perform the required analytic to gain understanding from the contents of these logs. Much of the information being generated doesn’t fit to traditional database technologies. Very few companies currently understand it and moving forward and starting to understand the opportunities of Big Data.

Traditional structured data is not meant to handle variety of data. Now days, organization’s success depends upon its ability to draw perceptions from traditional and nontraditional data. Traditional data is structured, neatly formatted, and relational and fits into exact schema and we do all analysis on it. But do we know that it is just the 15~20 percent of the day to day data of any business. Earlier we didn't think of 80~85 percent of data which is unstructured or semi-structured. But now world is looking into this high volume of data as this may be responsible to bring out the truth that may help in taking business decision at the right time. Twitter and Facebook are the best example of it. Twitter feeds are stored as JSON format but the actual text is unstructured and it is difficult to make meaning out of it. Facebook uses video and images too often and it is not easy to store them in traditional databases. Some event changes dynamically which will not fit right into relational databases. The biggest advantages of Big data is that it enables us to analyze structured (traditional relational databases), semi-structured and unstructured data like text, sensor data, audio, video, transactional, and more and more.

Velocity
As per our traditional understanding we say velocity is considered as how quickly the data is arrived, stored and retrieved. Now days, volume and variety of data have changed in collecting and storing manner. Sometimes even few seconds are too late for time sensitive process like catching fraud.

We need a new thinking way to define velocity. We should define velocity as the speed at which the data is flowing.

Sometimes, getting an edge over your competition can mean identifying a trend, problem, or opportunity within small fraction of time. Organizations must be able to analyze this data in near real time. Big Data scale streams computing is a new concept to go beyond traditional databases limitation of running queries against relatively static data. Let us take an example, the query “Show me all people living in the Asia Pacific Sunami region” would result in a single result set. This list can be used to give warning. With streams computing, we can execute a query like people who are currently “in the Asia Pacific Sunami region” and this can be continuously updated result set, because location information from GPS system is refreshed in real time.

Big Data requires perform analytic against the volume and variety of data while it is still in motion, not just after it is at rest. To conclude on Big Data characteristic we can say velocity has same importance as volume. It should combine both how fast data is generated as well how fast analytic is done.

Big Data & Hadoop
The Traditional warehouses are mostly ideal for analyzing structured data from various systems and producing visibility with known and relatively stable measurements. Hadoop based platform is well suited to deal with semi-structured and unstructured data, as well as when a data discovery process is needed. In addition Hadoop can be used for structured data too.
The truth and accuracy is the most desired feature of data. Traditional DW is meant to store structured data. DW departments pick and choose high-valued data and put it through rigorous cleansing and transformation processes as they know that data has a high known value per byte. In contrast, Big Data repositories rarely undergo (at least initially) the full quality control rigors of data before being injected into DW. The Unstructured data can’t be easily stored in a warehouse. A Big Data platform lets you store all of the data in its native business object format and get value out of it through massive parallelism on readily available components. Hadoop based repository scheme store the entire business entity and the fidelity of the Tweet, transaction, Facebook post. Data in Hadoop might seem of low value today but in fact in can be the key to bring out the answer to question which is not asked yet. So let data sit in Hadoop for a while until you discover its value this can be migrate to DW once its value is proven.

Please take a point that Big Data & Hadoop platform is not a replacement for warehouse. Actually both supplement each other.

When Big Data Is Suitable

  1. Big Data solutions are ideal for analyzing not only raw structured data, but semi-structured and unstructured data from a wide variety of sources.
  2. Big Data solutions are ideal when all, or most, of the data needs to be analyzed versus a sample of the data; or a sampling of data isn’t nearly as effective as a larger set of data from which to derive analysis.
  3. Big Data solutions are ideal for iterative and exploratory analysis when business measures on data are not predetermined.
  4. Is the reciprocal of the traditional analysis paradigm appropriate for the business task at hand? Better yet, can you see a Big Data platform complementing what you currently have in place for analysis and achieving synergy with existing solutions for better business outcomes?
  5. Big Data is well suited for solving information challenges that don’t natively fit within a traditional relational database approach for handling the problem at hand. It’s important that we understand that conventional database technologies are an important, and relevant, part of an overall analytic solution. In fact, they become even more vital when used in conjunction with your Big Data platform.

Big Data - Where It Can Fit In

IT Log Analytics
Log analytic is a common use case for Big Data project. We like to refer to all those logs and trace data that are generated by the operation of your IT solutions as data exhaust. Data exhaust has concentrated value, and we need to figure out a way to store and extract value from it. Some of the value derived from data exhaust is obvious and has been transformed into value-added click-stream data that records every gesture, click, and movement made on a web site. Google and Bing search use search based log data to do various analytical reporting. It is not limited to search log, it can be applied to firewall log and to also IT logs which are unreported and uncovered.

Trying to find correlation across massive (gigabytes) of data is a difficult task, but Big Data makes it easy. Now it is possible to identify previously unreported areas for performance optimization tuning. Big Data platform can be used to analyze 1TB of log data each day, with less than 5 minutes latency. Big Data platform finally offers the opportunity to get some new and better insights into the problems at hand.

The Fraud Detection Pattern
Fraud detection comes up a lot in the financial services vertical. Online auctions, insurance claims, any sort of financial transaction is involved, there we see a potential of misusing the data and it may lead to fraud. Here Big Data can help us to identify it and fix it quickly.

Conventional technologies are facing several challenges to detect fraud .The most common and repeating challenge is to store and compute fraud patterns data. Traditional Data Mining technique is used to detect fraud but they have limitation. First it works on less data. The fraud patterns can be cyclic and that can come and go in hours/days/weeks/months. By the time we discover new patterns, it’s too late and some damage has already been done.

Traditionally, in fraud cases, samples and models are used to identify customers that characterize a certain kind of profile by profiling a segment and not the granularity at an individual transaction or person level. We can forecast based on a segment, but it is obviously better to take decision based on an individual transaction. We need to work upon a larger set of data to achieve it and that is not possible with the traditional approach.

Typically, fraud detection works after a transaction gets stored only to get pulled out of storage and analyzed. With Big Data Streams, we can apply fraud detection models as the transaction is happening. It means when data is in motion.
  
The Social Media
Perhaps the most talked about Big Data usage pattern is social media and customer sentiment. It is very hot and happening topic in global market now days.

We can use Big Data to figure out what customers are saying about you your competitor. Furthermore, we can use it to figure out how customer sentiment impacts the business decisions. It can also be used to review product, price, campaign and placement.

Why are people saying what they are saying and behaving in the way they are behaving? Big Data can answer this question by linking behavior and the driver of that behavior which can’t be answered by traditional data. Big data can look at the interaction of what people are doing with their behaviors, current financial trends and actual transactions. For example the number of Twitter tweets per second per topic can give different insight of customer behavior.

Call and Chat Details
The time and quality resolution metrics and trending discontent patterns for a call center can show up weeks after the fact if we just depend upon traditional DW. This latency can be reduced by using Big Data.

Call centers with text and voice support always want to find better ways to process information to address business issue with lower latency. Simply it is a Big Data use case, where we can use analytics-in-motion and analytics-at-rest. Using in-motion analytics (Streams) means that we basically build our models and find out what’s interesting based upon the conversations that have been converted from voice to text or with voice analysis as the call is happening. We can build model by using rest analytics, after that promote them back into Streams. In this fashion we can examine and analyze the calls that are actually happening in real time. Big Data can also be used not only to see the customer churn rate but also to identify the vulnerable customers and reason for churn. Business can use this to reduce customer churn rate.

Risk: Patterns for Modeling and Management
Risk modeling and management is another big opportunity and common Big Data usage pattern. As per today’s financial markets, a lack of understanding risk can have devastating wealth creation effects. In addition, newly legislated regulatory requirements affect financial institutions worldwide to ensure that their risk levels fall within acceptable thresholds.

Traditional model uses 15 to 20 percent of the available structured data in their risk modeling which is similar as in fraud detection model. Traditional model experts know that there’s a lot of data that’s potentially underutilized and it can be used to determine business rules in risk model. The issue is that they don’t know where the relevant information can be found in the rest of the data. Additionally it can be too much expensive to bring underutilized data into current infrastructure. We already know that some data won’t even fit into traditional systems, so we have to look for new approach. Let’s think about what happens at the end of a trading day in a financial firm. We get a closing snapshot of the positions at the end of the day. Companies use this snapshot, and can derive insight and identify issues and concentrations using their models within a couple of hour. This can be reported back to regulators for internal risk control.

Let’s consider that financial services trend to move their risk model and dashboards to inter-day positions rather than just close-of-day positions. This is again a new challenge to traditional systems and it can’t be solved alone. Big Data can join hands with traditional system to solve this problem.

Energy Sector
The energy sector provides many Big Data use case challenges in how to deal with the massive volumes of sensor data from remote installations.

Many companies are using only a fraction of the data being collected, because they lack the infrastructure to store or analyze the available scale of data. Take for example of a typical oil drilling platform that can have 20,000 to 40,000 sensors on board. All of these sensors are streaming data about the health of the oil rig, quality of operations, and so on. Not every sensor is actively broadcasting at all times, but some are reporting back many times per second. Only 5~10 percent of those sensors are actively utilized. So 90~95 percentage of data are not being utilized in decision-making process. Analyze million and millions of electric meter data to predict better power consumption is also a difficult task. I see Big Data has a great opportunity to deal with this kind of difficult challenge.

Vestas, Denmark is an energy sector global leader whose vision is about the generation of clean energy, are using the IBM BigInsights (Big Data) platform as a method by which they can more profitably and efficiently generate even more clean energy.

This is part 1 on Big Data and I believe it is a good start to step forward with Big Data. The part 2 of current topic will take complete inside of Hadoop and its components. Stay Tuned.


1 comment: