Concurrent Releases Lingual, a SQL DSL for Hadoop
Cascading is a popular application framework - a pattern language for enterprise data workflows. Cascading allows to define complex data processing flows and create sophisticated data oriented frameworks. These frameworks can be used as Domain Specific Languages (DSLs) for scripting.
The latest addition to Cascading extensions is Lingual, the new SQL - based DSL combining power of Optiq - a dynamic data management framework - and cascading Hadoop-based execution. The purpose of Lingual is to ease Hadoop entrance barrier for developers and data analysts familiar with SQL, JDBC and traditional BI tools. It provides what the company calls “true SQL for Cascading and Hadoop”.
According to Cascading CTO and Founder Chris Wensel, the Lingual’s goal is to provide an ANSI-standard SQL interface that is designed to play well with all of the big name Hadoop distributions running on site or in cloud environments. This will allow a “cut and paste” capability for existing ANSI SQL code from traditional data warehouses so users can access data that’s locked away on a Hadoop cluster. It’s also possible to query and export data from Hadoop right into a wide range of BI tools.
With Lingual, companies can now leverage existing skill sets and product investments by carrying them over to Hadoop via a standards-based technology. Analysts and developers familiar with SQL, JDBC or traditional BI tools, can now instantly and easily create and run Big Data applications on Hadoop while gaining significant productivity and time-to-market benefits.
Lingual is not going to provide sub-second response times on a petabyte of data on a Hadoop cluster. Rather, the company’s goal is to provide the ability to easily move applications onto Hadoop—the challenge there is really around moving from a relational or MPP database over to Hadoop.
Lingual distribution includes:
· ANSI standard SQL parser and optimizer build on top of Cascading framework
· Relational catalog view into large-scale unstructured data
· SQL shell to test and submit queries to Hadoop
· JDBC driver to integrate with the existing BI tools and application servers.
InfoQ had a chance to discuss Lingual with Chris K Wensel, CTO and Founder of Concurrent, Inc.
InfoQ: Lingual looks very similar to Apache Hive. Can you describe the main advantages of Lingual compared to Hive?
Wensel: The primary goal of Lingual is to focus on ANSI compatibility. Hadoop is never used alone, you either need to get data off the HDFS bit-bucket into alternative tools like R or Mondrian, or you need to move existing workloads onto Hadoop to leverage its cost/performance benefits. In both cases you likely already know SQL, the “app” or query you are migrating is in SQL or, if not more importantly, the tools you are using only know SQL. So it’s very important to offer a standards-based SQL interface.
To achieve this we have lots of tests. We currently have extracted 6,000 complex SQL queries from the Mondrian test suite, and we already have 90% coverage and plan to add more from popular tools.
Lingual is not intended to be an ad-hoc query tool that provides human scale response times. For that, we recommend using a proper distributed MPP style database. I don’t recommend using Hadoop for what it wasn’t intended for.
That said, we do offer a standards compliant JDBC Driver, and ways to test queries against local data using Cascading’s “local mode” which has no Hadoop dependencies to speed up testing.
Above the goal of ANSI compliance, Lingual runs on top of Cascading, so any improvements to Cascading, or any new “planners” other than those provided for Hadoop and local in memory processing will be inherited by Lingual along with Cascading’s existing robustness, flexibility, extensibility, and familiarity by those companies that have already standardized on Cascading for computation.
InfoQ: From the existing description it is not clear how Lingual defines and maintains a relational catalog. Can you describe some of the implementation details? Does it require a special prepared files or uses mechanisms similar to Hive SerDe to provide mapping between existing data and table definitions?
Wensel: There will be a built in “single user” catalog in initial release. HCatalog integration and/or an alternative will be provided in the near term. Currently the meta-data catalog is a trivial (optionally human editable) JSON document that can be stored on a local file system or on HDFS (even S3), which allows for basic sharing.
As for reading/writing data, Lingual will support all the integrations Cascading supports (and Cascalog, Scalding, etc). This is all managed by the Lingual “catalog “ command line interface.
There is no need to use any “proprietary to Cascading“ format to query data.
A user simply registers a file from the command line as a Table, and any meta-data (columns and types) will be discovered from the file if possible. Moving forward we will make it trivial to add new supported data formats from the catalog tool as well.
InfoQ: What is a security model for Lingual? Is it based on file access permissions? Is it supported by the JDBC driver?
Wensel: There are no current plans to extend Hadoop’s current security model.