InfoQ

News

Lucene 2.2: Payloads, Function queries, and more speed

Posted by Ryan Slobojan on Jul 06, 2007 10:35 AM

Community
Java
Topics
Search,
Open Source
Tags
Lucene
Lucene Java 2.2 is now available. Lucene is a high-performance, full-featured text search engine library written entirely in Java. There are several new features in this release, including:
  • Payloads - Allows you to associate arbitrary binary data with any term in the index
  • Function queries - Gives more control over how document scores are calculated (Incorporated from Solr)
  • "Point-in-time" searching over NFS - Brings snapshot-like functionality to NFS
  • New pre-analyzed field API - Lets you handle pre-analyzed Document fields without dummy analyzer code
  • Public Maven releases - The latest release of all Lucene modules are now available through the Maven repository

InfoQ spoke with Grant Ingersoll, a committer and Project Management Committee (PMC) member for the Lucene project, to learn more about this release. During the discussion, he asked InfoQ to make it clear that his views and comments are his alone, and are not the official views of the Lucene PMC.

InfoQ learned that the 2.2 release of Lucene marks a shift towards a shorter, quarterly release cycle. Ingersoll believes that these more frequent releases will introduce several benefits, including making bug fixes and new features available to the community more rapidly. The release process has also been streamlined with Maven support improved so that future releases will be available more quickly to Maven users.

InfoQ asked Ingersoll to describe the Payloads feature in more detail, and he said:

Payloads are a new feature to allow the storage of information in the index on a term by term basis. For instance, when indexing web pages, it may be useful to store extra information about a particular word, such as an associated URL or weighting factor based on some analysis of the text. In more advanced applications, it might be useful to store the part of speech of a word in order to score nouns as being more important than other parts of speech. My talk at ApacheCon Europe this year has a few slides on payloads [for those that] are interested.
He also described the new Function queries which originated in Solr as:
The new search function package (org.apache.lucene.search.function) allows developers to use the actual content of a field in scoring a document. For instance, if you stored latitude and longitude information in fields on a document, you could then use the information in these fields to affect the ranking of a document. That is, if you were doing a search for Starbucks, you could rank those locations nearer to the user (assuming you know their location) higher in the results than those farther away. Another example might be to use price or margin information to affect the ranking (i.e. score products higher that have bigger margins for your company. Not saying I agree with this ethically, but it can be done)

Ingersoll was then asked what users could expect from the next release of Lucene. He indicated that there will be significant improvements in indexing performance as a result of some new memory management techniques led by Michael McCandless. He also mentioned that the recent releases of Lucene have added a number of performance enhancements, and that users will want to try them out for themselves. Finally, Ingersoll noted that Java 5 support and more flexibility in the indexing process are potential future features of Lucene.

A full changelog is available, listing all of the bugfixes, features and optimizations which are in this release. As with previous releases of Lucene, 2.2 is able to read and import indexes from previous versions of Lucene, however once converted the index is no longer readable by earlier versions of Lucene (e.g. 2.1).

2 comments

Reply

MG4J by Vic C Posted Jul 7, 2007 7:47 AM
Re: MG4J by berkay NiQuiL Posted Jun 30, 2008 12:12 PM
  1. Back to top

    MG4J

    Jul 7, 2007 7:47 AM by Vic C

    Which is faster: Solr, MG4j, Lucene, SQL text src? .V

  2. Back to top

    Re: MG4J

    Jun 30, 2008 12:12 PM by berkay NiQuiL

Exclusive Content

Agile Project Management: Lessons Learned at Google

In this presentation filmed during QCon 2007, Jeff Sutherland, the creator of Scrum, talks about his visit at Google to do an analysis of Google's first implementation of Scrum.

AtomServer – The Power of Publishing for Data Distribution

In this article, Bryon Jacob and Chris Berry introduce AtomServer, their implementation of a full-fledged Atom Store based on Apache Abdera, which is now available as open source.

An Introduction to Virtualization

It is easy to think that virtualization applies only to servers. In reality the recent resurgence of the concept is also being applied to networking, storage, and application infrastructure.

REST Anti-Patterns

In this article, Stefan Tilkov explains some of the most common anti-patterns found in applications that claim to follow a "RESTful" design and suggests ways to avoid them.

Choosing between Routing and Orchestration in an ESB

In this article, Adrien Louis and Marc Dutoo discuss the differences and relative merits of using orchestration vs. routing in a typical ESB setup, and discuss various implementation options.

Enterprise Batch Processing with Spring

Wayne Lund discusses batch processing, Spring Batch objectives and features, scenarios for usage, Spring Batch architecture, scaling, example code, failures and retrying, and the future roadmap.

User Story Estimation Techniques

Developer Jay Fields draws on his experiences as a ThoughtWorks consultant to describe effective user story estimation techniques.

Security (CAS and OpenID) with Ruby

In this talk from QCon SF 2007, Justin Gehtland explains two open solutions to distributed identity and their Rails integration components: OpenID (using ruby-openid) and CAS (using rubycas-client).