Mahout 0.3: Open Source Machine Learning
The Machine Learning Open Source Project Apache Mahout has announced its 0.3 version on March, adding functionality, stability and performance. InfoQ spoke with co-founder and commiter Grant Ingersoll and commiter Ted Dunning, from the Apache Mahout project.
The need for machine-learning techniques like clustering, collaborative filtering, and categorization has steadily increased the last decade along with the number of solutions needing algorithms to transform vast amounts of raw data into relevant information.
The Mahout Project as introduced by Grant Ingersoll addresses:
- Clustering together documents in a context aware method enables you to choose to focus on specific clusters and stories without needing to wade through a lot of unrelated ones
- Recommendations (AKA Collaborative Filtering) is often used to recommend consumer items such as books, music, and movies, but it is also used in other applications where multiple actors need to collaborate to narrow down data
- Pattern matching (Naïve Bayes Classifier and others) is used to label unseen documents. When a new document is classified, the words in the document are looked up in the model, probabilities are calculated, and the best result is output, usually along with a score indicating the confidence in the accuracy of the result
- The Mahout project is facilitated by Apache Hadoop, for scaling purposes.
Another important aspect of the Mahout solution is the set of tools for creating vector representations of textual data. This is the first step in enabling Mahout learning algorithms process a data base.
The Mahout project was started by several people involved in the Apache Lucene (the open source search project) community with an active interest in machine learning algorithms for clustering and categorization. The community was initially driven by Ng et al.'s paper "Map-Reduce for Machine Learning on Multicore" but has since evolved to cover much broader machine-learning approaches.
The new Apache Mahout release highlights:
- New: math and collections modules based on the high performance Colt library
- Faster Frequent Pattern Growth (FPGrowth) using FP-bonsai pruning
- Parallel Dirichlet process clustering (model-based clustering algorithm)
- Parallel co-occurrence based recommender
- Parallel text document to vector conversion using LLR based ngram generation
- Parallel Lanczos SVD (Singular Value Decomposition) solver
- Shell scripts for easier running of algorithms, utilities and examples
When asked what the most exiting feature in this release is, Ingersoll replied:
The addition of distributed Singular Value Decomposition (SVD) is pretty exciting as well as many utilities to make it easier for people to get their content into Mahout… the most exciting feature is actually a non-tangible one… the demonstration of the Mahout community reaching a critical mass of contributors and users. In the life of any open source project, the early stages can be very tenuous with just one or two people doing most of the work and if any one of those people stops or even slows down, the project can whither on the vine. I believe that Mahout has passed that threshold and has many people now actively contributing to build something truly exciting.
Future plans for the Mahout project include:
- The release of the 1.0 version is coming up later this year
- A stable set of API’s starting from the 1.0 release and onwards
- Online learning capabilities such as Stochastic Gradient Descent (SGD) algorithm implementation
- Support Vector Machine (SVM) algorithm implementation
The implementations of the SGD and the SVM will be applicable to document mining and other applications that relate to text or repeated categorical data. Of particular interest is the fact that the SGD system will be introducing the ability to build interaction variables on the fly.