Competition between Real-time Hadoop Implementations Heats Up

| by Boris Lublinsky Follow 1 Followers on Feb 25, 2013. Estimated reading time: 1 minute |

Ever since the publication of Google’s Dremel paper, the Hadoop community has been trying to build similar functionality around Hadoop. First it was open Dremel, which is currently part of Apache Drill, which has become an Apache Incubator project last year. Then, also last year, Cloudera introduced Impala, which is currently in beta state and is part of the Cloudera Hadoop distribution - CDH 4.1.

The new contender in this space is the Stinger Initiative, introduced by Hortonworks last week into Apache's incubation process, being aimed to:

Enabling Hive to answer human-time use cases (i.e. queries in the 5-30 second range) such as big data exploration, visualization, and parameterized reports without needing to resort to yet another tool to install, maintain and learn can deliver a lot of value to the large community of users with existing Hive skills and investments.

Hortonworks aims to reach this goal by:

·         Making Hive more SQL compliant, including supporting SQL types which are missing in Hive and subqueries in the ‘where’ clause.

·         Optimizing Hive’s execution plan

·         Supporting new Hadoop columnar format - called ORCFile (similar to Dremel, Drill and Cloudera’s Trevini)

·         Introduction of the new runtime framework Tez, based on YARN

Tez, Hindi for “speed”, currently under incubation vote in Apache, is:

… general-purpose, highly customizable framework that simplifies creation of data-processing tasks across both small scale (low-latency) and large-scale (high throughput) workloads in Hadoop. It generalizes the MapReduce paradigm to a more powerful framework by providing the ability to execute a complex DAG (directed acyclic graph) of tasks for a single job so that projects in the Apache Hadoop ecosystem such as Apache Hive, Apache Pig and Cascading can meet requirements for human-interactive response times and extreme throughput at petabyte scale (clearly MapReduce has been a key driver in achieving this).

All three current “real-time” Hadoop query implementations - Drill, Impala and now Stinger are either already or will shortly become open source projects and will be able to leverage community support and input to solve important problem of real-time Hadoop querying.   


Rate this Article

Adoption Stage

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

What about the new Greenplum Hadoop distribution (PivotalHD)? by Christian Tzolov

Hi Boris,

How would you compare the new Greenpum Haddop platform ( to the rest of Dremel-inspired projects? Greenplum claims to have 10x better performance (for real-time queries) over the other Hadoop distributions:

"Based on its tests, EMC is claiming response time improvements ranging from 10x to 600x faster than SQL interfaces for Hadoop. EMC provided its own benchmarks comparing Hawq to Hive as well as Cloudera's Impala."

What about Berkeley shark and spark? by Michael Kimber

Two questions:

1. Why does no one mention Berkeley shark and spark?

Both seem to be in active commercial usage and improve hadoop performance to real-time levels form the looks of it

2. Why so little collaboration in this space? if you add in Salesforces recent open sourcing of Phoenix (Hbase SQL) that's about 5 attempts do the a similar thing!

SQL compliant? by 臧 秀涛

SQL complaint->SQL compliant?

Re: What about the new Greenplum Hadoop distribution (PivotalHD)? by Boris Lublinsky

Its a timing issue, sorry
Abel put a post here

Re: What about Berkeley shark and spark? by Boris Lublinsky

There are 4 purely on Hadoop, plus Berkley, which is not Hadoop but Spark, plus Amazons new offering Redshift, this space is really crowded. I think that the reason is that many people see this as a crown jewel and everyone is using it as a main differention.

Re: SQL compliant? by Boris Lublinsky

Thanks. Fixed

Re: What about Berkeley shark and spark? by Faisal Waris

I believe Spark leverages Scala's delimited continuations to skip materialization of intermediate results for much faster throughput than Hadoop. This is a case where a language feature makes a significant difference to the final product. Another reason to move up from Java to something better.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

7 Discuss