Mahout to Get Self-Optimizing Matrix Algebra Interface with Pluggable Backends for Spark and Flink

At the recent GOTO conference in Berlin, Mahout committer Sebastian Schelter outlined recent advances in Mahout's ongoing effort to create a scalable foundation for data analysis that is as easy to use as R or Python.

Schelter said the main goal is to provide a simple Scala based DSL (domain specific language) similar to the matrix notation in languages like R, but with the possibility to store large matrices distributed over a cluster and run computations in parallel.

The resulting library, Schelter said, will provide local and distributed matrices that can be used with one another seamlessly. The Mahout team designed the library such that it is not tied to a specific platform but will rather have a pluggable backend that can target different platforms.

Currently, development for Apache Spark has progressed most, but Apache Flink, another incubating next generation Big Data platform, will also be considered, Schelter said.

One key aspect of this new architecture is also that it is possible to treat operations differently based, for example, on the matrix sizes involved, leading to far reaching optimization potential. The main design goal according to Schelter is to allow data scientists to write scalable code without having to worry too much about parallelization. A demo page gives a first impression of the resulting interface.

Apache Mahout originally started as a project to implement a number of machine learning algorithms on top of Hadoop. It covers algorithms for classification, clustering, recommendation, and learning document models. So far these algorithms were based on Hadoop and the MapReduce computation model, instead of the more flexible models like the one provided by the Apache Spark project. Apache Spark has begun to develop its own machine learning library called mllib, which covers fewer algorithms than Mahout right now but claims on their project homepage to be much faster due to improvements like moving computations into memory and better support for iterative algorithms.

The move of Mahout away from relying on MapReduce alone comes at a time when there is a wide variety of alternative approaches to distributed computing.

Google itself has started to explore alternative computation schemes already some time ago, including Percolator, which allowed for more incremental updates of the Google search data base, and Pregel, a system built for distributed execution of graph computations. Pregel has in turn lead to open source projects like Apache Giraph and Stanford's GPS.

A further alternative toolbox which provides a wide variety of distributed implementations of machine learning algorithms is GraphLab developed at the Carnegie Mellon University.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Infrastructure topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter