DataBricks Announces Spark SQL for Manipulating Structured Data Using Spark
With the advent of Hadoop, building a data warehouse for processing big data became easier. However, querying Hadoop still required writing MapReduce jobs, which in turn required specialized development skills and a non-trivial amount of effort. Hive solved this problem by providing a familiar SQL querying engine on top of Hadoop, that translates SQL into MapReduce jobs. Spark provides a similar SQL querying engine called Shark. Shark still relies on Hive for query-planning, but uses Spark instead of Hadoop during the physical execution phase. With Spark SQL, DataBricks is investing into an alternative SQL engine, one that is divorced from Hive.
InfoQ reached out to Michael Armbrust and Reynold Xin, software engineers at DataBricks, to learn more about Spark SQL.
InfoQ: When did the development of Spark SQL start?
Michael and Reynold: It all started when Michael joined Databricks back in November. He is the one spearheading the effort.
InfoQ: What were the main reasons that you decided to shift away from Hive and Shark, in favor of native Spark SQL implementation?
Michael and Reynold: We are moving away from Shark’s current execution engine that depends on Hive and will use Spark SQL as the new execution engine for Shark. The main reason for the shift is to improve maintainability of the code base and in turn enable rapid development so our users can benefit from both performance optimizations and a better feature set.
Together the team has written five or six query engines, including ones that are powering the most important databases in the world. This is our attempt to “do it right.”
The current Hive query optimizer has a lot of complexity built-in to address the limitations of MapReduce, and many of those simply don’t apply in the context of Spark. By designing our own query optimizer and execution framework, we have substantially simplified the design. Spark SQL presents a very clean abstraction for building query optimizers and distributed query engines, allowing us to rapidly improve the performance of the engine.
InfoQ: Tell us more about Catalyst. Is this the SQL Execution Engine behind Spark SQL?
Michael and Reynold: Catalyst is actually an implementation-agnostic optimizer framework, not an execution engine. It is developed as part of Spark SQL, but we envision other query engines might use it in the future. Spark SQL is a complete execution engine that leverages Catalyst.
Building a query optimizer is a pretty daunting task. Catalyst incorporates years of advanced research in building simpler yet more powerful query optimizers and makes this daunting task much easier. In Catalyst, optimization rules that would take thousands of lines in other systems are now in the tens of lines. Using simple, composable rules makes it much easier to reason about the correctness of the system. This simplicity also makes the system easier to maintain, and more importantly, to improve.
Michael and Reynold: SparkSQL enables several new ways for developers to interact with their data. First, it allows them to apply schemas to their existing RDDs. This addition of schema information makes it possible to run SQL queries over RDDs, in addition to Spark’s already powerful repertoire of transformations.
This functionality is only one part of the story. Spark SQL also allows developers to read and write data from existing sources (Hive and Parquet now; Avro, HBase, Cassandra, etc in the future).
InfoQ: According to your blog-post, Spark SQL is available in Spark 1.0 as Alpha. When do you anticipate 1.0 will be released?
Michael and Reynold: We are getting really close for code freeze for Spark 1.0. If the QA period goes smoothly, Spark 1.0 should be released towards the end of this month.