Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Competition between Real-time Hadoop Implementations Heats Up

Competition between Real-time Hadoop Implementations Heats Up

Leia em Português

This item in japanese

Ever since the publication of Google’s Dremel paper, the Hadoop community has been trying to build similar functionality around Hadoop. First it was open Dremel, which is currently part of Apache Drill, which has become an Apache Incubator project last year. Then, also last year, Cloudera introduced Impala, which is currently in beta state and is part of the Cloudera Hadoop distribution - CDH 4.1.

The new contender in this space is the Stinger Initiative, introduced by Hortonworks last week into Apache's incubation process, being aimed to:

Enabling Hive to answer human-time use cases (i.e. queries in the 5-30 second range) such as big data exploration, visualization, and parameterized reports without needing to resort to yet another tool to install, maintain and learn can deliver a lot of value to the large community of users with existing Hive skills and investments.

Hortonworks aims to reach this goal by:

·         Making Hive more SQL compliant, including supporting SQL types which are missing in Hive and subqueries in the ‘where’ clause.

·         Optimizing Hive’s execution plan

·         Supporting new Hadoop columnar format - called ORCFile (similar to Dremel, Drill and Cloudera’s Trevini)

·         Introduction of the new runtime framework Tez, based on YARN

Tez, Hindi for “speed”, currently under incubation vote in Apache, is:

… general-purpose, highly customizable framework that simplifies creation of data-processing tasks across both small scale (low-latency) and large-scale (high throughput) workloads in Hadoop. It generalizes the MapReduce paradigm to a more powerful framework by providing the ability to execute a complex DAG (directed acyclic graph) of tasks for a single job so that projects in the Apache Hadoop ecosystem such as Apache Hive, Apache Pig and Cascading can meet requirements for human-interactive response times and extreme throughput at petabyte scale (clearly MapReduce has been a key driver in achieving this).

All three current “real-time” Hadoop query implementations - Drill, Impala and now Stinger are either already or will shortly become open source projects and will be able to leverage community support and input to solve important problem of real-time Hadoop querying.   


Rate this Article