Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Open Source SQL-in-Hadoop Solutions: Where Are We?

Open Source SQL-in-Hadoop Solutions: Where Are We?

This item in japanese

Lire ce contenu en français


With Facebook recently releasing Presto as open source, the already crowded SQL-in-Hadoop market just became a tad more intricate. A number of open source tools are competing for the attention of developers: Hortonworks Stinger initiative around Hive, Apache Drill, Apache Tajo, Cloudera’s Impala, Salesforce’s Phoenix (for HBase) and now Facebook’s Presto.

Organizations already using Hadoop in production are demanding interactive SQL query support and a smooth integration with existing BI tools. Vijay Madhavan (eBay) states in his blog post SQL in Hadoop landscape:

Most of the current map-reduce based systems for analysis including current versions of Hive, Pig, Cascading work well in the non-interactive and batch SLA domain. Many products are attempting to support real-time and interactive SLAs by offering interactive "SQL in Hadoop" solutions.

Use cases for SQL-in-Hadoop solutions include supporting interactive ad-hoc queries, supporting reporting/visualization using BI systems like MicroStrategy/Tableau, and multi-source data (e.g.: behavioral data in HDFS must be joined to demographic data in an RDBMS or other source).

Many of these SQL-in-Hadoop solutions have certain aspects in common:

  1. On the metadata level it seems that HCatalog/Hive Metastore establishes itself as the de-facto standard for managing schemata across different datasources.
  2. Then, there are certain data formats, such as Parquet and ORC, which—for selected workloads—are becoming increasingly popular and more widely used in the wild.
  3. Most of the solutions seem to support a wide range of ANSI SQL (in different versions: 1992, 1999, 2003).

Above points should help users to move between different SQL-in-Hadoop solutions without too much migration headache.

But, there are also some notable differences as shown below:

  • Some of the solutions are Apache-backed and with that community-based (Stinger, Drill, Tajo) while others are owned by single entities (Impala, Phoenix, Presto).
  • Further, some are limited in terms of datasources they can query to the Hadoop ecosystem, while others are from an architectural perspective more flexible and also allow to query relational databases and NoSQL data stores in-situ (Presto, Drill).
  • Another difference is the operations allowed on the data: some are pure (distributed) query engines while others permit update operations.

In the past 10 to 18 months more and more people and commercial entities have decided to give it a try and realised a low-latency, ad-hoc SQL access to data stored in Hadoop. However, due to overlapping use cases and preferences in terms of environments there is likely room for more than one SQL-in-Hadoop solution, in the long run.

Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Sql on Hadoop

    by Sonam Gupta,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Thanks for your post! With SQL-on-Hadoop technologies, it's possible to access big data stored in Hadoop by using the familiar SQL language. Users can plug in almost any reporting or analytical tool to analyze and study the data. Before SQL-on-Hadoop, accessing big data was restricted to the happy few. You had to have in-depth knowledge of technical application programming interfaces, such as the ones for the Hadoop Distributed File System, MapReduce or HBase, to work with the data. Now, thanks to SQL-on-Hadoop, everyone can use his favorite tool. For an organization, that opens up big data to a much larger audience, which can increase the return on its big data investment. More at

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p