“Drilling” Through the Big Data
With the recent explosion of everything related to Hadoop, it is no surprise that new projects/implementations related to the Hadoop ecosystem keep appearing. The Apache Drill project is a distributed system for interactive analysis of large-scale datasets, inspired by Google's Dremel. Drill is not trying to replace existing Big Data batch processing frameworks, such as Hadoop MapReduce or stream processing frameworks, such as S4 or Storm. It rather fills the existing void – real-time interactive processing of large data sets.
Similar to Dremel, the Drill implementation is based on the processing of nested, tree-like data. In Dremel this data is based on protocol buffers – nested schema-based data model. Drill is planning to extend this data model by adding additional schema-based implementations, for example, Apache Avro and schema-less data models such as JSON and BSON. In addition to a single data structure, Drill is also planning to support “baby joins” – joins to the small, loadable in memory, data structures.
While Dremel is based on a single SQL-like query language with nested data support, Drill is planning to introduce several (pluggable) query languages. Drill implementation will include:
· DrQL – a SQL-like query language for nested data which is compatible with Google BigQuery/Dremel. As a result BigQuery applications will work with Drill
· Mongo Query Language
In order to achieve high performance query processing, Drill is planning to introduce a specialized distributed extensible execution engine (for example, similar to Dryad). This engine will provide data locality, fault tolerance and column-based and row-based hierarchical processing.
Drill’s initial goals are to specify the detailed requirements and architecture, and then develop the initial implementation including the execution engine and DrQL.
Currently, significant work has been completed to identify the initial requirements and define the overall system architecture. The next step is to implement the major Drill’s components. Several components have been prototyped as part of other projects such as the parser from OpenDremel.
InfoQ had a chance to discuss Drill with one of the project’s core developers - Ted Dunning.
InfoQ: What about Drill that made you interested in this new project?
Dunning: This new project is an opportunity to build a community around a computational model that has been neglected so far in the open source community. This community will be based on the consensus of those interested in this kind of computing and is really even more important than any of the code ultimately produced by the project. There is an Apache saying that community is more important than code. We have definitely seen this with Hadoop where having the community and the consensus around interfaces has allowed a vibrant marketplace to spring up. We need to do the same with Dremel-like computations. It will not do for a single company to develop this sort of capability in private and just spring the result on the world. We need to build the consensus and community in the same way that it was built with Hadoop.
InfoQ: Which main applications do you see for Drill?
Dunning: I think that Drill will be highly complementary to map-reduce and real-time computation. These other kinds of computations can produce flat tables that are ready for the sort of ad hoc exploratory analysis that Drill style queries do well. This will facilitate the creation of dashboards with strong drill-down capabilities as well as nice visualizations.
I also think that running Drill on a single machine against data in non-columnar formats makes a lot of sense as well. This could wind up being an exciting alternative to scripting languages like AWK or Perl or Python where all you want to do is do aggregations or simple analyses on flat files.
InfoQ: Where do you see the main complexities/challenges in Drill implementation?
Dunning: Well, there are three main phases involved in executing a Drill query. First, you need to parse the query. This is more than just syntactic parsing since you also have to deal with some of the semantic subtleties that inhere in any real-world parsing language. Secondly, you need to change the logical plan produced by the parser into a physical plan that takes into account how many workers are available, what format and storage substrate files are in and on and what sorts of capabilities the execution engine has. Finally, you have to actually execute the physical plan efficiently. This may include compilation of parts of the physical plan into native code using some sort of JIT compiler. Cross cutting all of these phases, you have the problem of how to coordinate the execution of these phases in a clustered environment.
In my view, the parser is probably nearly solved. An advanced query optimizer called Optiq is also already available, although it doesn’t understand nested data yet. That leaves the execution engine as the remaining major part of code, along with a lot of glue along the way. My guess is that we will have a very naïve execution engine at first but will have progressively more advanced engines over time.
InfoQ: Which companies are currently interested in Drill development?
Dunning: It is quite a list and growing rapidly. We have had participation on the mailing list and in the meetups from employees of all of the major makers of commercial Hadoop distributions. We have also had interest from other companies like Concurrent and Drawn to Scale and Big Data Craft and Hadapt. In fact, Big Data Craft even sent a representative from Israel to the first Drill User Group to present the progress of the OpenDremel project (which is now being folded into the Drill project). We have also had participation and interest from large companies like Intel and Twitter. The Bay Area Drill User group now has over 200 members and the developer mailing list is at about the same level.
So the interest has been very high from all kinds of companies.
I should point out; however, that participation in an Apache project is always by individuals. Who employs or supports the efforts of those individuals is outside the scope of Apache contribution. Some companies like to give the impression that they own or control certain Apache projects. Apache Drill is not going to be like that and will welcome the participation of all individuals.
InfoQ: When do you think first Drill code will be available?
Dunning: That is up to the community. We will have the parser from Open Dremel in place very shortly. A modified planner and query optimizer should follow shortly after that. When a usable release Is available is anyone’s guess. That said, there is lots for anyone to do if they would like to contribute or participate in the project.