Lucene 2.2: Payloads, Function queries, and more speed

Lucene Java 2.2 is now available. Lucene is a high-performance, full-featured text search engine library written entirely in Java. There are several new features in this release, including:

Payloads - Allows you to associate arbitrary binary data with any term in the index
Function queries - Gives more control over how document scores are calculated (Incorporated from Solr)
"Point-in-time" searching over NFS - Brings snapshot-like functionality to NFS
New pre-analyzed field API - Lets you handle pre-analyzed Document fields without dummy analyzer code
Public Maven releases - The latest release of all Lucene modules are now available through the Maven repository

InfoQ spoke with Grant Ingersoll, a committer and Project Management Committee (PMC) member for the Lucene project, to learn more about this release. During the discussion, he asked InfoQ to make it clear that his views and comments are his alone, and are not the official views of the Lucene PMC.

InfoQ learned that the 2.2 release of Lucene marks a shift towards a shorter, quarterly release cycle. Ingersoll believes that these more frequent releases will introduce several benefits, including making bug fixes and new features available to the community more rapidly. The release process has also been streamlined with Maven support improved so that future releases will be available more quickly to Maven users.

InfoQ asked Ingersoll to describe the Payloads feature in more detail, and he said:

Payloads are a new feature to allow the storage of information in the index on a term by term basis. For instance, when indexing web pages, it may be useful to store extra information about a particular word, such as an associated URL or weighting factor based on some analysis of the text. In more advanced applications, it might be useful to store the part of speech of a word in order to score nouns as being more important than other parts of speech. My talk at ApacheCon Europe this year has a few slides on payloads [for those that] are interested.

He also described the new Function queries which originated in Solr as:

The new search function package (org.apache.lucene.search.function) allows developers to use the actual content of a field in scoring a document. For instance, if you stored latitude and longitude information in fields on a document, you could then use the information in these fields to affect the ranking of a document. That is, if you were doing a search for Starbucks, you could rank those locations nearer to the user (assuming you know their location) higher in the results than those farther away. Another example might be to use price or margin information to affect the ranking (i.e. score products higher that have bigger margins for your company. Not saying I agree with this ethically, but it can be done)

Ingersoll was then asked what users could expect from the next release of Lucene. He indicated that there will be significant improvements in indexing performance as a result of some new memory management techniques led by Michael McCandless. He also mentioned that the recent releases of Lucene have added a number of performance enhancements, and that users will want to try them out for themselves. Finally, Ingersoll noted that Java 5 support and more flexibility in the indexing process are potential future features of Lucene.

A full changelog is available, listing all of the bugfixes, features and optimizations which are in this release. As with previous releases of Lucene, 2.2 is able to read and import indexes from previous versions of Lucene, however once converted the index is no longer readable by earlier versions of Lucene (e.g. 2.1).

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Search topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter