InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

Mahout 0.3: Open Source Machine Learning

Posted by Gilad Manor on Apr 19, 2010

Sections
Enterprise Architecture,
Operations & Infrastructure,
Architecture & Design,
Development
Topics
Java EE ,
Open source Java ,
Java ,
Languages ,
Websphere ,
Programming ,
IBM ,
Application Servers ,
Agile in the Enterprise ,
Companies ,
Architecture ,
Agile ,
Data Warehousing ,
Mahout ,
Machine Learning ,
Enterprise Architecture ,
Data Warehouse ,
Hadoop

The Machine Learning Open Source Project Apache Mahout has announced its 0.3 version on March, adding functionality, stability and performance. InfoQ spoke with co-founder and commiter Grant Ingersoll  and commiter Ted Dunning, from the Apache Mahout project.

The need for machine-learning techniques like clustering, collaborative filtering, and categorization has steadily increased the last decade along with the number of solutions needing algorithms to transform vast amounts of raw data into relevant information.

The Mahout Project as introduced by Grant Ingersoll addresses:

  • Clustering together documents in a context aware method enables you to choose to focus on specific clusters and stories without needing to wade through a lot of unrelated ones
  • Recommendations (AKA Collaborative Filtering) is often used to recommend consumer items such as books, music, and movies, but it is also used in other applications where multiple actors need to collaborate to narrow down data
  • Pattern matching (Naïve Bayes Classifier and others) is used to label unseen documents. When a new document is classified, the words in the document are looked up in the model, probabilities are calculated, and the best result is output, usually along with a score indicating the confidence in the accuracy of the result
  • The Mahout project is facilitated by Apache Hadoop, for scaling purposes.

Another important aspect of the Mahout solution is the set of tools for creating vector representations of textual data. This is the first step in enabling Mahout learning algorithms process a data base.

The Mahout project was started by several people involved in the Apache Lucene (the open source search project) community with an active interest in machine learning algorithms for clustering and categorization. The community was initially driven by Ng et al.'s paper "Map-Reduce for Machine Learning on Multicore" but has since evolved to cover much broader machine-learning approaches.

The new Apache Mahout release highlights:

  • New: math and collections modules based on the high performance Colt library
  • Faster Frequent Pattern Growth (FPGrowth) using FP-bonsai pruning
  • Parallel Dirichlet process clustering (model-based clustering algorithm)
  • Parallel co-occurrence based recommender
  • Parallel text document to vector conversion using LLR based ngram generation
  • Parallel Lanczos SVD (Singular Value Decomposition) solver
  • Shell scripts for easier running of algorithms, utilities and examples

When asked what the most exiting feature in this release is, Ingersoll replied:

The addition of distributed Singular Value Decomposition (SVD) is pretty exciting as well as many utilities to make it easier for people to get their content into Mahout… the most exciting feature is actually a non-tangible one… the demonstration of the Mahout community reaching a critical mass of contributors and users.  In the life of any open source project, the early stages can be very tenuous with just one or two people doing most of the work and if any one of those people stops or even slows down, the project can whither on the vine.  I believe that Mahout has passed that threshold and has many people now actively contributing to build something truly exciting.

Future plans for the Mahout project include:

  • The release of the 1.0 version is coming up later this year
  • A stable set of API’s starting from the 1.0 release and onwards
  • Online learning capabilities such as Stochastic Gradient Descent (SGD) algorithm implementation
  • Support Vector Machine (SVM) algorithm implementation

The implementations of the SGD and the SVM will be applicable to document mining and other applications that relate to text or repeated categorical data.  Of particular interest is the fact that the SGD system will be introducing the ability to build interaction variables on the fly.

 

  • This article is part of a featured topic series on Agile
We already used WEKA by Summer Daniel Posted
  1. Back to top

    We already used WEKA

    by Summer Daniel

    We already used WEKA,but still waiting for Mahout's new features!

Educational Content

Evolution in Data Integration From EII to Big Data

Approaches to integrating data are changing with emergence of cloud computing.

Winning Hearts and Minds: How to Embed UX from Scratch in a Large Organization

Michele Ide-Smith presents the lessons learned in the process of introducing UX principles and techniques into a large organization through a series of small steps.

LMAX Disruptor: 100K TPS at Less than 1ms Latency

Dave Farley and Martin Thompson discuss solutions for doing low-latency high throughput transactions based on the Disruptor concurrency pattern.

Thoughts on Test Automation in Agile

Rajneesh Namta shares his thoughts, experiences, and some of the critical lessons learned while implementing software test automation on a recent Agile project.

Actor Interaction Patterns

Dale Schumacher presents several patterns of actor interaction that can be used in collaborative programs written in any language.

Scalaz: Functional Programming in Scala

Rúnar Bjarnason discusses Scalaz, a Scala library of pure data structures, type classes, highly generalized functions, and concurrency abstractions to perform functional programming in Scala.

Faster, Better, Higher – But How?

One of the main challenges when designing software architecture is considering quality attributes. Not only their design turns out to be difficult, but also the specification of these attributes.

Software Naturalism - Embracing the Real Behind the Ideal

Michael Feathers analyzes real code bases concluding that code is not nearly as beautiful as designers aspire to, discussing the everyday decisions that alter the code bit by bit.