BT

Your opinion matters! Please fill in the InfoQ Survey!

Lucene 2.3: Large indexing performance improvements, new machine-learning project

| by Ryan Slobojan Follow 0 Followers on Jan 24, 2008. Estimated reading time: 2 minutes |

The Apache Lucene project, a high-performance full-featured text search engine library written entirely in Java, released version 2.3 today. InfoQ spoke with committer and Project Management Committee (PMC) member Grant Ingersoll to learn more about this release and the future plans for Lucene.

Ingersoll indicated that the largest change in this release is a new indexing algorithm, which uses new in-memory models to achieve large speed improvements. According to Ingersoll, simply switching the existing Lucene 2.2 JAR for a Lucene 2.3 JAR resulted in speed-ups of 500% in indexing performance in several tests which were performed. Other changes include:

  • Improved index management - long pauses which were occasionally seen during indexing due to merging of internal index files have been eliminated, and other approaches to managing the indexing process are now easy to implement
  • Object pooling - Document, Field and Token instances can now be reused during indexing analysis, which both speeds up analysis and reduces the number of allocations during indexing
  • IndexReader reopening - Reopening an IndexReader to capture the latest changes in an index is now much faster with the new reopen() method, which loads in only those index segments which have changed rather than reloading the entire index
  • Easier IndexWriter tuning - The setMaxBufferedDocs method has been supplanted by the more intuitive setRAMBufferSizeMB method

In addition, 2.3 is intended to be a drop-in replacement for 2.2, with no recompilation required. A comprehensive changelog is also available.

Ingersoll also discussed the future plans for Lucene, saying that the next release would be 2.9. The 2.9 release will be a relatively minor, with items being marked as deprecated and other clean-up being performed in preparation for Lucene 3.0. The 3.0 version will be a major release which will involve moving the codebase to JDK 5 as the minimum supported codebase - the other major features of 3.0 are yet to be determined.

The Lucene community as a whole was also discussed, with Ingersoll indicating that Lucene and Solr have a strong integration, and that Nutch, Tika and Hadoop also enjoyed a fair amount of intercommunication. Ingersoll also described a new project named Mahout which he is in the process of launching:

That will be a separate project, but may be beneficial to Lucene users. There are currently some patches in JIRA for Lucene that implement ML algorithms. The goal of this project is to provide commercial quality, large scale machine learning (ML) algorithms built on Hadoop under an Apache license. I have seen a fair amount of interest already, and hope to have this project underway in the coming month.

Ingersoll said that, by creating Mahout, he hoped to "further unlock the mysteries of Google and companies like it to provide these capabilities to the masses and spur on new innovation in the space" -- for those with an interest in this new project, there are both a project plan and an incubator proposal available.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Correction by Ryan Slobojan

After following up with Grant after the publishing of this item, I learned two things:
  • * A 2.4 release is now planned before 2.9
  • * Mahout was accepted as an Apache Lucene subproject this week (see lucene.apache.org/mahout/)

Ryan Slobojan

Lucene.Net by Mohan Kumar

Any releases on Lucene.Net?

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

2 Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT