Hadoop-as-a-Service from Amazon, Cloudera, Microsoft and IBM
Companies rely more and more on big data when making their decisions. Amazon, Cloudera, and IBM have announced their Hadoop-as-a-Service offerings, while Microsoft promises to do the same next year.
Amazon was the first to offer AWS Elastic MapReduce back in 2009, running Apache Hadoop on EC2 and S3. Like many other IaaS offerings coming from Amazon, this service provides the minimum hardware and software necessary to run analytics on big data, leaving a lot to the customer in terms of configuring and programming against the framework, a daunting task requiring lots of expertise. Providing that required skill is available, a company can set up and successfully run Hadoop jobs, as New York Times demonstrated by converting 11 million images, representing public articles published between 1851 to 1922, to 1.5 TB of PDF documents by running a 24 hours Hadoop job on 100 Amazon EC2 instances at a very low price.
Cloudera takes Amazon’s MapReduce service a step further in the right direction offering CDH3, a tuned Hadoop AMI that includes many additional software products helping with administering and running complex jobs on Hadoop, such as: Apache Mahout, Flume, Sqoop, Pig, Oozie, Hive, HBase, ZooKeeper, Whirr, and others, most of them if not all being open source projects. One of the remaining problem remains the sheer amount of expertise and resources needed to install, configure and run this package, the CDH3 Installation Guide (PDF) having no less than 175 pages of guidelines on setting up all sorts of components from the JDK to CDH3, Snappy and all the other parts of the system.
Microsoft has recently announced at PASS Summit 2011 they will provide Hadoop-as-a-service integrated into Windows Azure and SQL Server some time in 2012 for companies interested in crunching large amounts of data on their platform. There are few details available except that Microsoft promised to maintain compatibility with Apache Hadoop codebase and to contribute back to the open source project. They have also made available a Sqoop-based SQL Server-Hadoop Connector which makes possible bidirectional data transfer between SQL tables and Hadoop’s HDFS which is absolutely necessary since Hadoop needs to hold data in its own file system in order to be efficient in processing lots of data.
Another player announced this month is IBM who offers to run Hadoop on their SmartCloud Enterprise using IBM InfoSphere BigInsights software. BigInsights comes in two editions, Basic, which is free and useful for evaluation projects, and Enterprise for production purposes. IBM’s solution seems to be the most mature so far being based on Watson technology, an AI system that beat two of the best Jeopardy! players this year. Watson is not just answering questions by running Hadoop on a large cluster of nodes, but it includes over 100 techniques to “analyze natural language, identify sources, find and generate hypotheses, find and score evidence, and merge and rank hypotheses”, so it is not just a platform to run big data jobs but also provides intelligence on how to address data and interpret it, which is one of the most difficult parts in dealing with it.
Like Cloudera’s solution, IBM’s BigInsights includes beside Hadoop a number of open source programs, such as
- Pig, a high-level programming language and runtime environment for Hadoop
- Hive, a data warehouse infrastructure designed to support batch queries and analysis of files managed by Hadoop
- HBase, a column-oriented data storage environment designed to support large, sparsely populated tables in Hadoop
- Flume, a facility for collecting and loading data into Hadoop
- Lucene, text search and indexing technology
- Avro, data serialization technology
- ZooKeeper, a coordination service for distributed applications
- Oozie, workflow/job orchestration technology
BigInsights also includes custom made technology developed by IBM: a text analysis engine, a data exploration tool for business analysts, integration with enterprise software and Hadoop enhancements to make it simpler to administrate and to improve performance.
BigInsights does not replace online analytical processing (OLAP), or online transaction processing (OLTP) applications, but it can be integrated with these in order to “filter through high volumes of raw data and combine the results with structured data stored in your DBMS or warehouse”.
IBM’s Hadoop solution is up and running and can be tested by customers.
Another solution worth mentioning is EMC Greenplum Analytics Workbench, a +1,000 node cluster running Hadoop integration tests, and provided by EMC in partnership with Intel, Mellanox Technologies, Micron, Seagate, SuperMicro, Switch, and VMware. Greenplum is not offering Hadoop-as-a-service but rather providing a platform of over 10,000 virtual nodes and 24 PB of storage to test Hadoop itself.
According to a 2011 TDWI survey, 34% of the companies use big data analytics to help them making decisions. Big data and Hadoop seem to be playing an important role in the future.