InfoQ

News

Lucene 2.2: Payloads, Function queries, and more speed

Posted by Ryan Slobojan on Jul 06, 2007 10:35 AM

Community
Java
Topics
Open Source ,
Search
Tags
Lucene
Lucene Java 2.2 is now available. Lucene is a high-performance, full-featured text search engine library written entirely in Java. There are several new features in this release, including:
  • Payloads - Allows you to associate arbitrary binary data with any term in the index
  • Function queries - Gives more control over how document scores are calculated (Incorporated from Solr)
  • "Point-in-time" searching over NFS - Brings snapshot-like functionality to NFS
  • New pre-analyzed field API - Lets you handle pre-analyzed Document fields without dummy analyzer code
  • Public Maven releases - The latest release of all Lucene modules are now available through the Maven repository

InfoQ spoke with Grant Ingersoll, a committer and Project Management Committee (PMC) member for the Lucene project, to learn more about this release. During the discussion, he asked InfoQ to make it clear that his views and comments are his alone, and are not the official views of the Lucene PMC.

InfoQ learned that the 2.2 release of Lucene marks a shift towards a shorter, quarterly release cycle. Ingersoll believes that these more frequent releases will introduce several benefits, including making bug fixes and new features available to the community more rapidly. The release process has also been streamlined with Maven support improved so that future releases will be available more quickly to Maven users.

InfoQ asked Ingersoll to describe the Payloads feature in more detail, and he said:

Payloads are a new feature to allow the storage of information in the index on a term by term basis. For instance, when indexing web pages, it may be useful to store extra information about a particular word, such as an associated URL or weighting factor based on some analysis of the text. In more advanced applications, it might be useful to store the part of speech of a word in order to score nouns as being more important than other parts of speech. My talk at ApacheCon Europe this year has a few slides on payloads [for those that] are interested.
He also described the new Function queries which originated in Solr as:
The new search function package (org.apache.lucene.search.function) allows developers to use the actual content of a field in scoring a document. For instance, if you stored latitude and longitude information in fields on a document, you could then use the information in these fields to affect the ranking of a document. That is, if you were doing a search for Starbucks, you could rank those locations nearer to the user (assuming you know their location) higher in the results than those farther away. Another example might be to use price or margin information to affect the ranking (i.e. score products higher that have bigger margins for your company. Not saying I agree with this ethically, but it can be done)

Ingersoll was then asked what users could expect from the next release of Lucene. He indicated that there will be significant improvements in indexing performance as a result of some new memory management techniques led by Michael McCandless. He also mentioned that the recent releases of Lucene have added a number of performance enhancements, and that users will want to try them out for themselves. Finally, Ingersoll noted that Java 5 support and more flexibility in the indexing process are potential future features of Lucene.

A full changelog is available, listing all of the bugfixes, features and optimizations which are in this release. As with previous releases of Lucene, 2.2 is able to read and import indexes from previous versions of Lucene, however once converted the index is no longer readable by earlier versions of Lucene (e.g. 2.1).

MG4J by Vic C Posted Jul 7, 2007 7:47 AM
  1. Back to top

    MG4J

    Jul 7, 2007 7:47 AM by Vic C

    Which is faster: Solr, MG4j, Lucene, SQL text src? .V

Educational Content

Bindings, Platforms, and Innovation

This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.

Orchestrating Long Running Activities with JBoss / JBPM

This article explores the use of JBoss and jBPM to implement design solutions that effectively address the issue of orchestrating long running activities.

Neo4j - The Benefits of Graph Databases

This presentation covers the use of graph databases as an optimal solution for data that is difficult to fit in static tables, rapidly evolving data or data that has a lot of optional attributes.

Realistic about Risk: Software development with Real Options

This session introduces Real Options and shows how it can help in running your project. Real Options is a decision-making process that can be used to manage risk.

Communication Flexibility Using Bindings

This article discusses the use of bindings on services and references (including the instance of non-configured bindings) as the means to implement SCA communications in a Web and SOA environment.

Writing DSLs in Groovy

After a short introduction to DSLs, Scott Davis plays with the keyboard showing how to approach the creation of a DSL by typing working snippets of Groovy code that get executed.

Scaling Agile with C/ALM (Collaborative Application Lifecycle Management)

IBM Rational and InfoQ present, Scaling Agile with C/ALM, an eBook showing organizations how to become “finely tuned software delivery machines” by enabling team integration and scaling.

Concurrent Programming with Microsoft F#

Amanda Laucher presents a real life enterprise application written in F#. She shows actual code snippets, explaining design decisions and suggesting how to use some of the F# constructs.