Concurrent Releases Pattern, a Machine Learning DSL for Hadoop
Cascading is a popular application framework - a pattern language for enterprise data workflows. Cascading allows users to define complex data processing flows and create sophisticated data oriented frameworks. These frameworks can be used as Domain Specific Languages (DSLs) for scripting.
The latest addition to Cascading extensions is Pattern, a new machine learning DSL combining the power of PMML - an XML-based markup language developed by the Data Mining Group (DMG) to provide a way for applications to define models related to predictive analytics and data mining and to share those models between PMML-compliant applications and cascading Hadoop-based workflow execution. The purpose of Pattern is to provide a common execution platform for many popular analytics frameworks such as SAS, R, Microstrategy, Oracle, etc., that allow to export predictive models in PMML.
According to Cascading:
Currently, Pattern is supported on both local Hadoop clusters and on AWS cloud using EMR Hadoop version.
The machine learning algorithms supported by Pattern include:
- Random Forest algorithm
- K-Means Clustering
- Hierarchical Clustering
- Linear Regression
- Logistic Regression
Other work in progress includes support for:
InfoQ had a chance to discuss Pattern with Chris K. Wensel, CTO and Founder of Concurrent, Inc.
InfoQ: Can you outline the major differences between Pattern and Apache Mahout, which is currently one of the most popular Machine Learning Library for Apache?
Wensel: First Pattern supports PMML. That is, you can export a model from R as PMML, and Pattern will convert the PMML to a Cascading application.
Second, Pattern is based on Cascading. So debugging a Pattern application is the same as debugging a Cascading application. And when dealing with data at scale, you do a lot of debugging.
Third, you can include custom Cascading and Lingual (ANSI SQL on Cascading) work loads in the same application as the one running your PMML or custom ML models crafted by hand.
Just one application that can do ETL, data cleansing via SQL, scoring, and integrate with remote data sources that you can hand off to operations to put into production, along with unit tests, and the built in safety nets Cascading provides.
Seriously, it can’t get simpler than that. Well, it can, but we will announce that later this year.
InfoQ: Can you explain in more details how Pattern works with R? What will one do in R and at which point he will switch to Pattern?
Wensel: R is great for creating models. But R does not run efficiently on Hadoop, but it does support PMML, a standards-based XML language for representing complex machine learning models.
So, export your model from R into PMML, pass the PMML to Pattern.
Additionally, R works great with the Lingual JDBC driver. So you can actually pull data out of Hadoop, using Lingual into R to help test and craft models.
Here we have closed the loop. Hadoop -> Lingual -> SQL -> R -> PMML -> Pattern -> Hadoop. The parts you have to manage on a day-to-day basis are based on Cascading.
InfoQ: You emphasize Test Driven Development (TDD) nature of Pattern. Can you elaborate on specific TDD support in Pattern?
Wensel: Pattern consists of the PMML to Cascading parser, and a set of machine learning APIs for different model types. The models themselves can be run independently of Hadoop at speed from a JUnit test. Or, the PMML can be read and the results compared against a known data set to confirm the scores are what are expected.