BT

New Early adopter or innovator? InfoQ has been working on some new features for you. Learn more

DataBricks Announces Spark SQL for Manipulating Structured Data Using Spark

| by Matt Kapilevich on Apr 19, 2014. Estimated reading time: 3 minutes |

DataBricks, the company behind Apache Spark, has announced a new addition into the Spark ecosystem called Spark SQL. Spark SQL is separate from Shark, and does not use Hive under the hood.

With the advent of Hadoop, building a data warehouse for processing big data became easier. However, querying Hadoop still required writing MapReduce jobs, which in turn required specialized development skills and a non-trivial amount of effort. Hive solved this problem by providing a familiar SQL querying engine on top of Hadoop, that translates SQL into MapReduce jobs. Spark provides a similar SQL querying engine called Shark. Shark still relies on Hive for query-planning, but uses Spark instead of Hadoop during the physical execution phase. With Spark SQL, DataBricks is investing into an alternative SQL engine, one that is divorced from Hive.

InfoQ reached out to Michael Armbrust and Reynold Xin, software engineers at DataBricks, to learn more about Spark SQL.

InfoQ: When did the development of Spark SQL start?

Michael and Reynold: It all started when Michael joined Databricks back in November. He is the one spearheading the effort.

InfoQ: What were the main reasons that you decided to shift away from Hive and Shark, in favor of native Spark SQL implementation?

Michael and Reynold: We are moving away from Shark’s current execution engine that depends on Hive and will use Spark SQL as the new execution engine for Shark. The main reason for the shift is to improve maintainability of the code base and in turn enable rapid development so our users can benefit from both performance optimizations and a better feature set.

Together the team has written five or six query engines, including ones that are powering the most important databases in the world. This is our attempt to “do it right.”

The current Hive query optimizer has a lot of complexity built-in to address the limitations of MapReduce, and many of those simply don’t apply in the context of Spark. By designing our own query optimizer and execution framework, we have substantially simplified the design. Spark SQL presents a very clean abstraction for building query optimizers and distributed query engines, allowing us to rapidly improve the performance of the engine.

InfoQ: Tell us more about Catalyst. Is this the SQL Execution Engine behind Spark SQL?

Michael and Reynold: Catalyst is actually an implementation-agnostic optimizer framework, not an execution engine. It is developed as part of Spark SQL, but we envision other query engines might use it in the future. Spark SQL is a complete execution engine that leverages Catalyst.

Building a query optimizer is a pretty daunting task. Catalyst incorporates years of advanced research in building simpler yet more powerful query optimizers and makes this daunting task much easier. In Catalyst, optimization rules that would take thousands of lines in other systems are now in the tens of lines. Using simple, composable rules makes it much easier to reason about the correctness of the system.  This simplicity also makes the system easier to maintain, and more importantly, to improve.

InfoQ: You can create Spark RDDs with Spark SQL natively. Do you anticipate this becoming one of the primary methods for creating an input RDD?

Michael and Reynold: SparkSQL enables several new ways for developers to interact with their data.  First, it allows them to apply schemas to their existing RDDs.  This addition of schema information makes it possible to run SQL queries over RDDs, in addition to Spark’s already powerful repertoire of transformations.

This functionality is only one part of the story. Spark SQL also allows developers to read and write data from existing sources (Hive and Parquet now; Avro, HBase, Cassandra, etc in the future). 

InfoQ: According to your blog-post, Spark SQL is available in Spark 1.0 as Alpha. When do you anticipate 1.0 will be released?

Michael and Reynold: We are getting really close for code freeze for Spark 1.0. If the QA period goes smoothly, Spark 1.0 should be released towards the end of this month.

 

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and dont miss out on content that matters to you

BT