Mahout to Get Self-Optimizing Matrix Algebra Interface with Pluggable Backends for Spark and Flink

| by Mikio Braun Follow 0 Followers on Nov 21, 2014. Estimated reading time: 2 minutes |

At the recent GOTO conference in Berlin, Mahout committer Sebastian Schelter outlined recent advances in Mahout's ongoing effort to create a scalable foundation for data analysis that is as easy to use as R or Python.

Schelter said the main goal is to provide a simple Scala based DSL (domain specific language) similar to the matrix notation in languages like R, but with the possibility to store large matrices distributed over a cluster and run computations in parallel.

The resulting library, Schelter said, will provide local and distributed matrices that can be used with one another seamlessly. The Mahout team designed the library such that it is not tied to a specific platform but will rather have a pluggable backend that can target different platforms.

Currently, development for Apache Spark has progressed most, but Apache Flink, another incubating next generation Big Data platform, will also be considered, Schelter said.

One key aspect of this new architecture is also that it is possible to treat operations differently based, for example, on the matrix sizes involved, leading to far reaching optimization potential. The main design goal according to Schelter is to allow data scientists to write scalable code without having to worry too much about parallelization. A demo page gives a first impression of the resulting interface.

Apache Mahout originally started as a project to implement a number of machine learning algorithms on top of Hadoop. It covers algorithms for classification, clustering, recommendation, and learning document models. So far these algorithms were based on Hadoop and the MapReduce computation model, instead of the more flexible models like the one provided by the Apache Spark project. Apache Spark has begun to develop its own machine learning library called mllib, which covers fewer algorithms than Mahout right now but claims on their project homepage to be much faster due to improvements like moving computations into memory and better support for iterative algorithms.

The move of Mahout away from relying on MapReduce alone comes at a time when there is a wide variety of alternative approaches to distributed computing.

Google itself has started to explore alternative computation schemes already some time ago, including Percolator, which allowed for more incremental updates of the Google search data base, and Pregel, a system built for distributed execution of graph computations. Pregel has in turn lead to open source projects like Apache Giraph and Stanford's GPS.

A further alternative toolbox which provides a wide variety of distributed implementations of machine learning algorithms is GraphLab developed at the Carnegie Mellon University.

Rate this Article

Adoption Stage

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread


Login to InfoQ to interact with what matters most to you.

Recover your password...


Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.


More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.


Stay up-to-date

Set up your notifications and don't miss out on content that matters to you