Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Data Mining in the Swamp: Taming Unruly Data With Cloud Computing

Data Mining in the Swamp: Taming Unruly Data With Cloud Computing

Business Intelligence is all about making better decisions from the data you have. However, all too often, the data you have is difficult to process by typical BI tools. These failures generally come in two areas:

  1. The data is too voluminous to be properly digested by your BI system.
  2. The data records are messy, inconsistent and difficult to join together.

Each of these problems is commonplace, and relatively easy to solve in isolation. High‐volume datasets can be mastered by simply (if expensively) throwing more hardware and software at the problem—larger servers, cluster licenses, faster networks, bigger memories, faster disks, etc. Messy data can be cleansed with the appropriate use of script and SQL logic to make records consistent and well‐defined.

But what do you do when datasets are both large and messy? As we have learned from a recent project with a large financial institution, large amounts of data that are difficult to correlate can bring down even a state‐of‐the‐art BI system. But large volumes of messy data are a fact of life—indeed; it's probably the bulk of the data in the enterprise. Add to that the complexity and cost associated with trying to tame it, the business case for analyzing it is overwhelmed.

“large volumes of messy data are a fact of life—indeed; it's probably the bulk of the data in the enterprise”

Enter the Cloud

One of the primary concepts in cloud computing is low‐cost scalability—systems that can grow to handle larger volumes of users and data by adding more low‐cost hardware. Google's entire infrastructure is built on this approach of distributing work out to thousands of inexpensive servers, instead of relying on centralized "supercomputers" to provide the horsepower.

The scalability strategy that Google uses is called MapReduce. The MapReduce model provides a conceptual framework for dividing work up into small, manageable sets that can be distributed across 1 or 10 or 100 or 1000 or even 10000 servers, which can all work in parallel. This technology can be used with BI to meet the challenge of large‐scale, messy data, but you can’t use Google’s infrastructure to run your own MapReduce system. Luckily, there’s Hadoop – an open source implementation of the Google MapReduce system.

Even though it’s technically still in“Beta”, Hadoop is in use at many large organizations, including:

  • Amazon
  • Yahoo
  • Facebook
  • Adobe
  • The New York Times
  • AOL
  • Twitter
  • Rackspace

Introducing Hadoop

In 2004, Google published papers describing their Google File System and MapReduce algorithms. Doug Cutting, a Yahoo employee and Open Source Evangelist, partnered with a friend to create Hadoop (amed after his son’s stuffed elephant), an opensource implementation of GFS and MapReduce.

In essence, Hadoop was a software system that could handle arbitrarily large amounts of data using a distributed file system, and distribute it to be worked on by an arbitrary number of workers, using MapReduce. Adding more storage or more workers is simply a matter of connecting new machines to the network—there is no need for larger devices or specialized disks or specialized networking.

The two main parts of Hadoop are:

  1. HDFS
  2. MapReduce


HDFS (Hadoop Distributed File System) is a system for managing files that runs "on top of" standard computers and standard operating systems. When a file is loaded into HDFS, the master “Name Node” invisibly breaks these files into large chunks, and stores them in multiple places (for redundancy) on the native file systems of the computers in the "cluster".

There's no requirement that the disks be the same size or that the computers be the same as the others in the "cluster". When a file is retrieved from HDFS, the Name Node fetches the chunks from the appropriate machines, re‐assembles it and delivers it to the caller (This is a simplification of the actual process, which is a lot more technically sophisticated).


MapReduce is a three‐step process that provides a structure for analyzing data and manipulating it in a scalable way. The three steps are:

  1. Map
  2. Shuffle/Sort
  3. Reduce


Raw data is translated/standardized/manipulated ‐ usually in fairly "lightweight" ways. The output of the Map step is a key‐value pair, which represents some sort of unique or nearly‐unique key (in many cases, you want the same key to be used for multiple records, for grouping purposes) and then whatever data is needed later (in the Reduce step) for the value.


The "secret sauce" of Hadoop is the distributed sort, where the records are all sorted by key (using either a default alphabetical sort or another comparator of your choice). Once records are sorted by key, all the records with the same key are sent to the same Reduce processor ‐ essentially this represents a way to group data intelligently.


In the last step, these groups of records with the same keys are handed one‐by‐one to the Reduce task. Sometimes, the Reduce step will cache all of the records, so it can operate on the entire group at one time (often to perform aggregations). Other translations and manipulations may occur here ‐ for example, the data might be output in a format that's easier to import into a database. Finally, the resulting data is written back out to HDFS, and the job is done.




How Does This Help BI?

Imagine you have data where some of the dimensions are well defined, but others change over time, in non‐trivial ways. Imagine that the data is in many different places, in many different formats, and you want to create a "holistic" view of the data. Last, but not least, imagine that the size of the overall dataset is so large that it will swamp the capabilities of your BI tool. How do you solve this problem?

The Traditional Way (Take a deep breath)

You can write translators for the different datasets ‐ but if the dataset is large, those translators will take a long time to run. So you consider manually splitting these large datasets into smaller sets, but then, of course, you have to get the data onto all of the computers, get the scripts running, and the computers need to have enough storage for the subset of the data. Once you clean the data, you have to rejoin the cleansed data together again from the multiple machines, and if you need to sort the data to help you aggregate it, you're going to have to find a sort solution that works on the huge volumes of data that you're dealing with. Odds are, you'll have to sort smaller subsets of the overall dataset, and then find ways to merge the subsets back together. Then you still haven't dealt with the fact that you need to aggregate it, so you have to write more scripts, divide the data into subsets again, and make sure you got all the records that belong in a group on the same machine.

Every step in the above scenario is error‐prone, complex, difficult to predict and, if you have to do this process on a regular basis, probably maddeningly tedious.

“every step in the [traditional way] is error‐prone, complex, difficult to predict and…maddeningly tedious”

Hadoop to the Rescue

 Instead, consider this option:

  1. Load all the datasets into the HDFS. Hadoop will take care of how to partition the data, where to put it, and it also handles redundancy. (In other words, you don't need to use RAID in your Hadoop solution)
  2. Write Map jobs that will take the data from each of the formats, and clean them, organizing the data into a general format.
  3. Specify how to sort the data to properly group it
  4. Write Reduce jobs to aggregate the data ‐ averaging and summing columns as needed, and then outputting the final aggregated data into a SQL‐friendly format
  5. Now you run the Map/Sort/Reduce process on the data.
  6. At the end of this process, you pull the aggregated data out of HDFS, and load it into your BI system.

You'll need to perform some quality checking on the output, but if the steps have been done properly, you have a repeatable, scalable process for generating aggregated data, with minimal manual intervention.

Some Current Uses of Hadoop:

  • Product Search Index generation
  • Data Aggregation & Rollup
  • Data mining for ad targeting
  • ETL
  • Analyzing & Storing Logs
  • Data Analytics
  • RDF Indexing

Where Hadoop Doesn't Fit

Hadoop is a tool like any other, and it is not applicable to every problem. Some areas where Hadoop is not the right solution:

Highly Interdependent Data

Hadoop is not well suited for data where each record is heavily dependent on a number of other records. For example, consider weather forecasting ‐ predicting what's going to happen to a storm front over time requires a view of pretty much the entire dataset at once. This is the realm for supercomputers.

Ad‐Hoc, "Casual" BI

Hadoop provides a framework for data analysis, but it usually requires a fairly sophisticated user to create queries and aggregations. And, of course, there's little‐to‐no support for visualizations, etc. Hadoop's SQL‐related sub‐projects, such as Hive and Pig help mitigate some of this, but casual reporting is still somewhat difficult.

Real‐time Processing

Hadoop is designed to trade off startup speed forscalability and parallelism ‐ in other words, Hadoop is more like a locomotive than a sports car ‐ it takes a fairly long time to get everything set up, but once it's moving, it's doing a lot of work.

Dependencies on Other Systems

If you set up a 10,000 node Hadoop cluster to process a huge dataset, and one of the steps involves a query to a database on an old computer in a dusty corner of the datacenter, that database server is going to be a bottleneck for the entire cluster. Hadoop jobs work best when they have few (or best case, none) dependencies on external systems. There are various tricks and strategies that can mitigate this problem, but in general, remove as much external dependency as possible from your Hadoop jobs.

“Hadoop jobs work best when they have few (or best case, none) dependencies on external systems”


In terms of Business Intelligence, Hadoop is a tool that makes the ETL process easier, and can bring the size and quality of data under control. It provides low‐cost scalability, a reliable and easily expandable file system, and a framework for dramatically increasing the scope and robustness of your data‐mining and data‐analysis business strategies. In situations where your data is too large, too messy or both, Hadoop can help you get it under control, and focus on your business, instead of focusing on one‐off IT infrastructure data analysis projects.

Want to Learn More?

Case Study

MATRIX provided technical and architectural support for a large financial institution that was attempting to reconcile data between multiple bank accounts, representing seven (7) years worth of account history.

The Scenario

The system had more than 50 servers, more than 80 cores, and contained over two (2) petabytes of accessible HDFS storage.
Multiple Hadoop Map/Reduce jobs were run in series to manipulate the data:

  • Cleanse and Align the different account formats
  • Add additional information to each record
  • Reconcile duplicate accounts over time using a fuzzy logic subsystem
  • Aggregate data from the accounts on a month‐by‐month basis

The Results

Without any optimization, the system was capable of processing one (1) month of data in approximately 30 minutes. Including data loading and testing, the full run took approximately ten (10) days of computing time, over the course of several weeks.

About the Author – John Brothers

John is a veteran software architect/developer with 18 years of professional software development experience in multiple industries, including high‐energy physics, telephony, Internet, transaction management, health care and data visualizations. He has worked for large and small organizations in a number of roles, including developer, sales engineer, architect, director of development and CTO. He holds two patents in the area of Visitor‐based networking.

John is an experienced "hands‐on" agile coach and engineer, with a strong background in the integration of agile tools for CI, automated testing, etc. He is expert with development in Java/J2EE, Ruby on Rails, Groovy, Grails and Flex.


MATRIX is a leading full‐service IT staffing and professional services firm, providing top quality IT candidates to fill both contract consulting and permanent positions, and professional services engagements. Privately‐held, MATRIX last year had revenues of $165 million. Headquartered in Atlanta, we have offices nationwide with more than 200 internal employees and 1,400 staff contract consultants. In 2008, MATRIX was named one of the “Best 50 Small and Medium Companies” to Work for in America.

Rate this Article