BT
x Your opinion matters! Please fill in the InfoQ Survey about your reading habits!

Open Source SQL-in-Hadoop Solutions: Where Are We?

by Michael Hausenblas on Dec 10, 2013 |

With Facebook recently releasing Presto as open source, the already crowded SQL-in-Hadoop market just became a tad more intricate. A number of open source tools are competing for the attention of developers: Hortonworks Stinger initiative around Hive, Apache Drill, Apache Tajo, Cloudera’s Impala, Salesforce’s Phoenix (for HBase) and now Facebook’s Presto.

Organizations already using Hadoop in production are demanding interactive SQL query support and a smooth integration with existing BI tools. Vijay Madhavan (eBay) states in his blog post SQL in Hadoop landscape:

Most of the current map-reduce based systems for analysis including current versions of Hive, Pig, Cascading work well in the non-interactive and batch SLA domain. Many products are attempting to support real-time and interactive SLAs by offering interactive "SQL in Hadoop" solutions.

Use cases for SQL-in-Hadoop solutions include supporting interactive ad-hoc queries, supporting reporting/visualization using BI systems like MicroStrategy/Tableau, and multi-source data (e.g.: behavioral data in HDFS must be joined to demographic data in an RDBMS or other source).

Many of these SQL-in-Hadoop solutions have certain aspects in common:

  1. On the metadata level it seems that HCatalog/Hive Metastore establishes itself as the de-facto standard for managing schemata across different datasources.
  2. Then, there are certain data formats, such as Parquet and ORC, which—for selected workloads—are becoming increasingly popular and more widely used in the wild.
  3. Most of the solutions seem to support a wide range of ANSI SQL (in different versions: 1992, 1999, 2003).

Above points should help users to move between different SQL-in-Hadoop solutions without too much migration headache.

But, there are also some notable differences as shown below:

  • Some of the solutions are Apache-backed and with that community-based (Stinger, Drill, Tajo) while others are owned by single entities (Impala, Phoenix, Presto).
  • Further, some are limited in terms of datasources they can query to the Hadoop ecosystem, while others are from an architectural perspective more flexible and also allow to query relational databases and NoSQL data stores in-situ (Presto, Drill).
  • Another difference is the operations allowed on the data: some are pure (distributed) query engines while others permit update operations.

In the past 10 to 18 months more and more people and commercial entities have decided to give it a try and realised a low-latency, ad-hoc SQL access to data stored in Hadoop. However, due to overlapping use cases and preferences in terms of environments there is likely room for more than one SQL-in-Hadoop solution, in the long run.

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT