Mahout 0.3: Open Source Machine Learning

The Machine Learning Open Source Project Apache Mahout has announced its 0.3 version on March, adding functionality, stability and performance. InfoQ spoke with co-founder and commiter Grant Ingersoll and commiter Ted Dunning, from the Apache Mahout project.

The need for machine-learning techniques like clustering, collaborative filtering, and categorization has steadily increased the last decade along with the number of solutions needing algorithms to transform vast amounts of raw data into relevant information.

The Mahout Project as introduced by Grant Ingersoll addresses:

Clustering together documents in a context aware method enables you to choose to focus on specific clusters and stories without needing to wade through a lot of unrelated ones
Recommendations (AKA Collaborative Filtering) is often used to recommend consumer items such as books, music, and movies, but it is also used in other applications where multiple actors need to collaborate to narrow down data
Pattern matching (Naïve Bayes Classifier and others) is used to label unseen documents. When a new document is classified, the words in the document are looked up in the model, probabilities are calculated, and the best result is output, usually along with a score indicating the confidence in the accuracy of the result
The Mahout project is facilitated by Apache Hadoop, for scaling purposes.

Another important aspect of the Mahout solution is the set of tools for creating vector representations of textual data. This is the first step in enabling Mahout learning algorithms process a data base.

The Mahout project was started by several people involved in the Apache Lucene (the open source search project) community with an active interest in machine learning algorithms for clustering and categorization. The community was initially driven by Ng et al.'s paper "Map-Reduce for Machine Learning on Multicore" but has since evolved to cover much broader machine-learning approaches.

The new Apache Mahout release highlights:

New: math and collections modules based on the high performance Colt library
Faster Frequent Pattern Growth (FPGrowth) using FP-bonsai pruning
Parallel Dirichlet process clustering (model-based clustering algorithm)
Parallel co-occurrence based recommender
Parallel text document to vector conversion using LLR based ngram generation
Parallel Lanczos SVD (Singular Value Decomposition) solver
Shell scripts for easier running of algorithms, utilities and examples

When asked what the most exiting feature in this release is, Ingersoll replied:

The addition of distributed Singular Value Decomposition (SVD) is pretty exciting as well as many utilities to make it easier for people to get their content into Mahout… the most exciting feature is actually a non-tangible one… the demonstration of the Mahout community reaching a critical mass of contributors and users. In the life of any open source project, the early stages can be very tenuous with just one or two people doing most of the work and if any one of those people stops or even slows down, the project can whither on the vine. I believe that Mahout has passed that threshold and has many people now actively contributing to build something truly exciting.

Future plans for the Mahout project include:

The release of the 1.0 version is coming up later this year
A stable set of API’s starting from the 1.0 release and onwards
Online learning capabilities such as Stochastic Gradient Descent (SGD) algorithm implementation
Support Vector Machine (SVM) algorithm implementation

The implementations of the SGD and the SVM will be applicable to document mining and other applications that relate to text or repeated categorical data. Of particular interest is the fact that the SGD system will be introducing the ability to build interaction variables on the fly.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Enterprise Architecture topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter