Open Source SQL-in-Hadoop Solutions: Where Are We?
With Facebook recently releasing Presto as open source, the already crowded SQL-in-Hadoop market just became a tad more intricate. A number of open source tools are competing for the attention of developers: Hortonworks Stinger initiative around Hive, Apache Drill, Apache Tajo, Cloudera’s Impala, Salesforce’s Phoenix (for HBase) and now Facebook’s Presto.
Organizations already using Hadoop in production are demanding interactive SQL query support and a smooth integration with existing BI tools. Vijay Madhavan (eBay) states in his blog post SQL in Hadoop landscape:
Most of the current map-reduce based systems for analysis including current versions of Hive, Pig, Cascading work well in the non-interactive and batch SLA domain. Many products are attempting to support real-time and interactive SLAs by offering interactive "SQL in Hadoop" solutions.
Use cases for SQL-in-Hadoop solutions include supporting interactive ad-hoc queries, supporting reporting/visualization using BI systems like MicroStrategy/Tableau, and multi-source data (e.g.: behavioral data in HDFS must be joined to demographic data in an RDBMS or other source).
Many of these SQL-in-Hadoop solutions have certain aspects in common:
- On the metadata level it seems that HCatalog/Hive Metastore establishes itself as the de-facto standard for managing schemata across different datasources.
- Then, there are certain data formats, such as Parquet and ORC, which—for selected workloads—are becoming increasingly popular and more widely used in the wild.
- Most of the solutions seem to support a wide range of ANSI SQL (in different versions: 1992, 1999, 2003).
Above points should help users to move between different SQL-in-Hadoop solutions without too much migration headache.
But, there are also some notable differences as shown below:
- Some of the solutions are Apache-backed and with that community-based (Stinger, Drill, Tajo) while others are owned by single entities (Impala, Phoenix, Presto).
- Further, some are limited in terms of datasources they can query to the Hadoop ecosystem, while others are from an architectural perspective more flexible and also allow to query relational databases and NoSQL data stores in-situ (Presto, Drill).
- Another difference is the operations allowed on the data: some are pure (distributed) query engines while others permit update operations.
In the past 10 to 18 months more and more people and commercial entities have decided to give it a try and realised a low-latency, ad-hoc SQL access to data stored in Hadoop. However, due to overlapping use cases and preferences in terms of environments there is likely room for more than one SQL-in-Hadoop solution, in the long run.
Martin Thompson Jul 27, 2014