Competition between Real-time Hadoop Implementations Heats Up
Ever since the publication of Google’s Dremel paper, the Hadoop community has been trying to build similar functionality around Hadoop. First it was open Dremel, which is currently part of Apache Drill, which has become an Apache Incubator project last year. Then, also last year, Cloudera introduced Impala, which is currently in beta state and is part of the Cloudera Hadoop distribution - CDH 4.1.
The new contender in this space is the Stinger Initiative, introduced by Hortonworks last week into Apache's incubation process, being aimed to:
Enabling Hive to answer human-time use cases (i.e. queries in the 5-30 second range) such as big data exploration, visualization, and parameterized reports without needing to resort to yet another tool to install, maintain and learn can deliver a lot of value to the large community of users with existing Hive skills and investments.
Hortonworks aims to reach this goal by:
· Making Hive more SQL compliant, including supporting SQL types which are missing in Hive and subqueries in the ‘where’ clause.
· Optimizing Hive’s execution plan
· Supporting new Hadoop columnar format - called ORCFile (similar to Dremel, Drill and Cloudera’s Trevini)
· Introduction of the new runtime framework Tez, based on YARN
Tez, Hindi for “speed”, currently under incubation vote in Apache, is:
… general-purpose, highly customizable framework that simplifies creation of data-processing tasks across both small scale (low-latency) and large-scale (high throughput) workloads in Hadoop. It generalizes the MapReduce paradigm to a more powerful framework by providing the ability to execute a complex DAG (directed acyclic graph) of tasks for a single job so that projects in the Apache Hadoop ecosystem such as Apache Hive, Apache Pig and Cascading can meet requirements for human-interactive response times and extreme throughput at petabyte scale (clearly MapReduce has been a key driver in achieving this).
All three current “real-time” Hadoop query implementations - Drill, Impala and now Stinger are either already or will shortly become open source projects and will be able to leverage community support and input to solve important problem of real-time Hadoop querying.
What about the new Greenplum Hadoop distribution (PivotalHD)?
by
Christian Tzolov
How would you compare the new Greenpum Haddop platform (www.greenplum.com/blog/topics/hadoop/introducin...) to the rest of Dremel-inspired projects? Greenplum claims to have 10x better performance (for real-time queries) over the other Hadoop distributions:
"Based on its tests, EMC is claiming response time improvements ranging from 10x to 600x faster than SQL interfaces for Hadoop. EMC provided its own benchmarks comparing Hawq to Hive as well as Cloudera's Impala."
What about Berkeley shark and spark?
by
Michael Kimber
1. Why does no one mention Berkeley shark and spark?
shark.cs.berkeley.edu/
Both seem to be in active commercial usage and improve hadoop performance to real-time levels form the looks of it
2. Why so little collaboration in this space? if you add in Salesforces recent open sourcing of Phoenix (Hbase SQL) that's about 5 attempts do the a similar thing!
Re: What about the new Greenplum Hadoop distribution (PivotalHD)?
by
Boris Lublinsky
Abel put a post here www.infoq.com/news/2013/02/Pivotal-HD-SQL-Hadoop
Re: What about Berkeley shark and spark?
by
Boris Lublinsky
Re: What about Berkeley shark and spark?
by
Faisal Waris
Educational Content
Writing Usable APIs in Practice
Giovanni Asproni May 19, 2013
Concurrency in Clojure
Stuart Halloway May 17, 2013
Confessions of an Agile Addict
Ole Friis Østergaard May 16, 2013





Hello stranger!
You need to Register an InfoQ account or Login to post comments. But there's so much more behind being registered.Get the most out of the InfoQ experience.
Tell us what you think