Apache Mahout: Highly Scalable Machine Learning Algorithms
The Apache Mahout project, a set of highly scalable machine-learning libraries, recently announced it's first public release. InfoQ spoke with Grant Ingersoll, co-founder of Mahout and a member of the technical staff at Lucid Imagination, to learn more about this project and machine learning in general.
When asked to describe Mahout in more detail, Ingersoll said:
Mahout is a library aimed at delivering scalable machine learning tools under the Apache license. Our goal is to build a healthy, active community of users and contributors around practical, scalable, production-ready machine learning algorithms like, but not limited to, clustering, classification and collaborative filtering. We use Hadoop as a way of delivering on the scalability promise for many of the implementations, but we are not solely dependent on it. Many machine learning algorithms simply do not fit the Map Reduce model, so we will employ other means when appropriate.
Personally speaking, I hope Mahout does for machine learning what Apache Lucene and Solr has done for search. Namely, make it easy for anyone to build a production-quality, intelligent application that scales to fit their needs just as Lucene and Solr have made it possible for anyone to build a scalable search application. We have a ways to go in this regard, but the 0.1 release is a good first step in that direction.
In describing what machine learning was, Ingersoll quoted Introduction To Machine Learning by Ethem Alpaydin, "Machine Learning is programming computers to optimize a performance criterion using example data or past experience".
Major features which are included in the initial release of Mahout are:
- Taste Collaborative Filtering - Based on the Taste project which was incorporated into Mahout, including examples and demo applications
- Distributed Clustering Implementations - Several clustering algorithms such as k-Means, Fuzzy k-Means, Dirchlet, Mean-Shift and Canopy are provided, along with examples of how to use each
- Naive Bayes Implementations - Implementations of both traditional Bayesian and Complementary Bayesian classification are included
- Distributed Watchmaker Implementation - A distributed fitness function implementation using the Watchmaker library, along with usage examples
- Apache Hadoop Integration - Most implementations are built on top of Hadoop for scalability
- Basic Matrix and Vector Tools - Sparse and dense implementations of both matrices and vectors are provided
A comprehensive feature list is also available.
When asked to describe sample applications for some of these algorithms, Ingersoll indicated that the Taste filtering provided recommendations of items that a user would like based on their preferences, such as movie recommendations. Clustering is used to group together arbitrary data into categories of similar items, with the grouping of similar news stories being an example of this. Classification is another, with the most common example being the classification of email as either junk mail or not. The use of Mahout on the Amazon Elastic MapReduce cloud was also touched upon, with Ingersoll mentioning that work to get Mahout running on the cloud is in progress and that Mahout is a natural fit for the cloud:
Many of the big players in search and social networking are already using Map Reduce (and other distributed approaches) and machine learning to drive their applications. Mahout, in the long run, should make the ability to build these types of applications even easier and cheaper by reducing the startup costs and licensing fees associated with obtaining machine learning capabilities and know-how. Furthermore, by working to build a community of users where anyone is welcome to contribute, we think we will be around for a long time.
When asked about future plans for Mahout, Ingersoll said:
First and foremost is getting Mahout known so that people can try it out and give us feedback to improve it. Because it is open source, it is sometimes difficult to know exactly what is going to happen because so many great ideas come from seemingly out of the blue; however, I can tell you my personal wish list:
- More demos and documentation, especially info on how to run on EC2
- More algorithms. I'd particularly like to see a linear regression implementation and neural networks implementations, amongst others, because those are familiar to a lot of people.
- Solidify the API's so that we can work towards a 1.0 release such that people can reliably upgrade to a new release without having to make major changes to their code.
- Obtain a variety of performance metrics so that people can know what they are likely to see in their implementation.