BT

Your opinion matters! Please fill in the InfoQ Survey!

Interactive SQL in Apache Hadoop with Impala and Hive

| by Alex Giamas Follow 3 Followers on Feb 07, 2014. Estimated reading time: 2 minutes |

A note to our readers: As per your request we have developed a set of features that allow you to reduce the noise, while not losing sight of anything that is important. Get email and web notifications by choosing the topics you are interested in.

Two open source projects, Impala and Hive as part of the Stinger project are competing for top spot in the race for interactive SQL in Big Data deployments. Cloudera recently announced that Impala is 6 to 69 times faster than Hive 0.12 and outperformed an unnamed DBMS by an average of two times. Being able to use interactive SQL in Hadoop clusters could mean that data only needs to enter HDFS and can then get processed and analyzed without the need for further data transfer.

Using a modified subset of the industry standard TPC-DS benchmark, Cloudera claims that Impala is not only faster than Hive which also runs on Hadoop, but also faster than one DBMS using native columnar storage. Dirk de Roos of IBM, among others, has criticized these results for using a subset of the TPC-DS query set instead of the full set. In another point he made, using a single fact table in Cloudera’s tests instead of the six that TPC-DS uses could have also helped achieve better performance.

Other than raw speed, business users of Hadoop need the flexibility that SQL tools and standards offer. Supporting ANSI SQL like Cascading Lingual does, can help business intelligence tools using ODBC interchangeably work with Hive and Impala. Impala’s sub-query support, aggregate and windowed functions is behind Hive’s support. Also, installing Impala in an existing Hadoop cluster also means a whole new set of processes running in the cluster, whereas Hive living in the JVM can coexist in the same environment Hadoop runs on.

Hive 0.12 is codenamed Stinger phase 2 by Hortonworks. Early adopters can install a technical preview for Stinger phase 3. In Stinger phase 3 Hive works with Tez, the application framework built on top of Apache Hadoop NextGen MapReduce(YARN). Another improvement in the performance area is vectorized query execution which is analogous to Impala’s runtime code generation. As explained better by Microsoft’s HDInsight engineer Eric Hanson, vectorized query execution can improve performance in CPU intensive query scenarios. Stinger phase 3 can provide performance boost and better resource utilization in several use cases.

In the open source interactive SQL technology landscape, a new contester from the same company that open sourced Hive five years ago, came up recently. Facebook Presto is based on ANSI SQL and promises ad hoc analysis at interactive speed in a petabyte scale. The code is available on GitHub.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT