Phoenix: Running SQL Queries on Apache HBase [Updated]
Salesforce.com has open sourced Phoenix, a Java layer enabling developers to run SQL queries on Apache HBase. Update: An interview on Phoenix with James Taylor, Lead Developer at Salesforce.com, has been appended.
In addition to myriads of SQL, NoSQL and NewSQL databases, Salesforce.com has announced project Phoenix, an SQL layer on top of Apache HBase, a column-oriented Big Data store. Entirely written in Java, the code is available on GitHub, and provides a client-embeddable JDBC driver.
Phoenix is used internally by Salesforce.com for low latency queries in the order of milliseconds for simple queries or seconds when tens of millions of rows are processed, according to the project’s description. Phoenix it is not used for map-reduce jobs as HBase is made for, but rather for accessing HBase data through a standardized language.
According to project creators, Phoenix beats Hive for a simple query spanning 10M-100M rows. It is also much faster than Impala and OpenTSDB for similar queries using HBase’s API, co-processors and custom filters:
The Phoenix query engine transforms your SQL query into one or more HBase scans, and orchestrates their execution to produce standard JDBC result sets. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.
Some of the most interesting features are:
- Embedded JDBC driver implements the majority of java.sql interfaces, including the metadata APIs.
- Allows columns to be modeled as a multi-part row key or key/value cells.
- Full query support with predicate push down and optimal scan key formation.
- DDL support: CREATE TABLE, DROP TABLE, and ALTER TABLE for adding/removing columns.
- Versioned schema repository. Snapshot queries use the schema that was in place when data was written.
- DML support: UPSERT VALUES for row-by-row insertion, UPSERT SELECT for mass data transfer between the same or different tables, and DELETE for deleting rows.
- Limited transaction support through client-side batching.
- Single table only - no joins yet and secondary indexes are a work in progress.
- Follows ANSI SQL standards whenever possible
Phoenix’ code was open sourced under a BSD-like license.
Following is an interview with James Taylor, the lead developer of Phoenix.
InfoQ: Why would you provide an SQL interface to a Non-SQL data store? There are lots of other SQL solutions out there.
JT: Existing SQL solution aren't typically horizontally scalable, so they hit a wall when the data size gets too big. As far as why we put a SQL interface on HBase, a NoSQL data store, there are several reasons:
- Use of a well-understood language like SQL makes it easier for people to use HBase. Rather than learn another proprietary API, they can just use the language they're used to to read and write their data.
- Writing in a higher level language like SQL reduces the amount of code you have to write. For example, using Phoenix you could write a query like this to get at your web stats data (and I wouldn't want to guess at how many lines that would be using the native HBase APIs, but it would be a lot):
- TRUNC(DATE,'DAY') DAY,
- SUM(CORE) TOTAL_CPU_Usage,
- MIN(CORE) MIN_CPU_Usage,
- MAX(CORE) MAX_CPU_Usage
- FROM WEB_STAT
- WHERE DOMAIN LIKE 'Salesforce%'
- GROUP BY TRUNC(DATE,'DAY');
- Putting an abstract like SQL between data access and runtime execution allows for lots of optimizations to take place when the query is executed. For example, in the case of a GROUP BY query, we can take advantage of a nifty feature in HBase called coprocessors. This feature allows us to execute Phoenix code on the HBase server. So the aggregation can be done on the server side, instead of the client side, thus reducing the amount of data exchanged between the client and server dramatically. In addition, Phoenix will parallelize the running of the GROUP BY on the client, by chunking up the scans based on ranges of your row key. By doing the work in parallel, you'll get your results back much quicker. All these optimizations take place without the user having to do anything but issue a query.
- Using an industry standard API like JDBC for Phoenix allows us to take advantage of existing tooling that works through this API. For example, you can use an off-the-shelf SQL client like SQuirrel (http://squirrel-sql.sourceforge.net/) to connection to and run SQL against your HBase cluster. See our Getting Started section here for more information: https://github.com/forcedotcom/phoenix/blob/master/README.md
InfoQ: Is there any performance evaluation? Does it provide faster response times? Does it scale better?
JT: You can see our performance comparison against some other NoSQL products/projects here: https://github.com/forcedotcom/phoenix/wiki/Performance
We haven't published any benchmarks of Phoenix versus existing relational technology (though you'd likely find others that have compared HBase against them), but as the number of rows and the width of rows increases, the NoSQL solutions will shine. It also depends a lot on *how* you're using a relational database: in a multi-tenant manner as is the case at Salesforce.com or a single tenant manner. HBase is very good at co-locating related data, depending on how you form your row keys, so this can have clear advantages for some multi-tenant use cases.
InfoQ: When do you think you will add joins?
JT: Joins are on our roadmap here: https://github.com/forcedotcom/phoenix/wiki#wiki-roadmap. We have some ground work already in place. Can't give you a firm timeline, because they are too many variables in-play, but we'll get them in there.