InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

Apache Mahout: Highly Scalable Machine Learning Algorithms

Posted by Ryan Slobojan on Apr 23, 2009

Sections
Architecture & Design,
Development,
Operations & Infrastructure
Topics
Cloud Computing ,
Java
Tags
Mahout ,
MapReduce ,
Hadoop

The Apache Mahout project, a set of highly scalable machine-learning libraries, recently announced it's first public release. InfoQ spoke with Grant Ingersoll, co-founder of Mahout and a member of the technical staff at Lucid Imagination, to learn more about this project and machine learning in general.

When asked to describe Mahout in more detail, Ingersoll said:

Mahout is a library aimed at delivering scalable machine learning tools under the Apache license. Our goal is to build a healthy, active community of users and contributors around practical, scalable, production-ready machine learning algorithms like, but not limited to, clustering, classification and collaborative filtering. We use Hadoop as a way of delivering on the scalability promise for many of the implementations, but we are not solely dependent on it. Many machine learning algorithms simply do not fit the Map Reduce model, so we will employ other means when appropriate.

Personally speaking, I hope Mahout does for machine learning what Apache Lucene and Solr has done for search. Namely, make it easy for anyone to build a production-quality, intelligent application that scales to fit their needs just as Lucene and Solr have made it possible for anyone to build a scalable search application. We have a ways to go in this regard, but the 0.1 release is a good first step in that direction.

In describing what machine learning was, Ingersoll quoted Introduction To Machine Learning by Ethem Alpaydin, "Machine Learning is programming computers to optimize a performance criterion using example data or past experience".

Major features which are included in the initial release of Mahout are:

  • Taste Collaborative Filtering - Based on the Taste project which was incorporated into Mahout, including examples and demo applications
  • Distributed Clustering Implementations - Several clustering algorithms such as k-Means, Fuzzy k-Means, Dirchlet, Mean-Shift and Canopy are provided, along with examples of how to use each
  • Naive Bayes Implementations - Implementations of both traditional Bayesian and Complementary Bayesian classification are included
  • Distributed Watchmaker Implementation - A distributed fitness function implementation using the Watchmaker library, along with usage examples
  • Apache Hadoop Integration - Most implementations are built on top of Hadoop for scalability
  • Basic Matrix and Vector Tools - Sparse and dense implementations of both matrices and vectors are provided

A comprehensive feature list is also available.

When asked to describe sample applications for some of these algorithms, Ingersoll indicated that the Taste filtering provided recommendations of items that a user would like based on their preferences, such as movie recommendations. Clustering is used to group together arbitrary data into categories of similar items, with the grouping of similar news stories being an example of this. Classification is another, with the most common example being the classification of email as either junk mail or not. The use of Mahout on the Amazon Elastic MapReduce cloud was also touched upon, with Ingersoll mentioning that work to get Mahout running on the cloud is in progress and that Mahout is a natural fit for the cloud:

Many of the big players in search and social networking are already using Map Reduce (and other distributed approaches) and machine learning to drive their applications. Mahout, in the long run, should make the ability to build these types of applications even easier and cheaper by reducing the startup costs and licensing fees associated with obtaining machine learning capabilities and know-how. Furthermore, by working to build a community of users where anyone is welcome to contribute, we think we will be around for a long time.

When asked about future plans for Mahout, Ingersoll said:

First and foremost is getting Mahout known so that people can try it out and give us feedback to improve it. Because it is open source, it is sometimes difficult to know exactly what is going to happen because so many great ideas come from seemingly out of the blue; however, I can tell you my personal wish list:
  1. More demos and documentation, especially info on how to run on EC2
  2. More algorithms. I'd particularly like to see a linear regression implementation and neural networks implementations, amongst others, because those are familiar to a lot of people.
  3. Solidify the API's so that we can work towards a 1.0 release such that people can reliably upgrade to a new release without having to make major changes to their code.
  4. Obtain a variety of performance metrics so that people can know what they are likely to see in their implementation.

No comments

Watch Thread Reply

Educational Content

10 tips on how to prevent business value risk

One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.

Interview: Software Systems Architecture: Working With Stakeholders Using Viewpoints and Perspectives

InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.

Beauty Is in the Eye of the Beholder

Alex Papadimoulis discusses ugly code, where it comes from, how to avoid it, and how to get rid of it.

Architecting Visa for Massive Scale and Continuous Innovation

John Davies examines Visa’s architecture and shows how enterprises have architected complex integrations incorporating Hadoop, memcached, Ruby on Rails, and others to deliver innovative solutions.

Max Protect: Scalability and Caching at ESPN.com

Sean Comerford unveils ESPN.com’s architecture, what components are used and why, and the current changes the website goes through.

The Seven Deadly Sins of Enterprise Agile Adoption

Are there repeated patterns of failure on Enterprise Agile Enablement efforts? Sanjiv and Arlen discuss Seven Deadly Sins to avoid when adopting Agile in an enterprise.

Questions for an Enterprise Architect

Erik Dörnenburg answers: What is Enterprise and Evolutionary Architecture?, discussing 4 issues: Turning strategy into execution, Ensuring conformance, Where do the architects sit? Buying or building?

Wrap Your SQL Head Around Riak MapReduce

Sean Cribbs explains what Map-Reduce and Riak are, why and how to use Map-Reduce with Riak, and how to convert SQL queries into their Map-Reduce equivalents.