InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

Lucene 2.2: Payloads, Function queries, and more speed

Posted by Ryan Slobojan on Jul 06, 2007

Sections
Architecture & Design,
Development,
Operations & Infrastructure
Topics
Open Source ,
Search ,
Java
Tags
Lucene
Lucene Java 2.2 is now available. Lucene is a high-performance, full-featured text search engine library written entirely in Java. There are several new features in this release, including:
  • Payloads - Allows you to associate arbitrary binary data with any term in the index
  • Function queries - Gives more control over how document scores are calculated (Incorporated from Solr)
  • "Point-in-time" searching over NFS - Brings snapshot-like functionality to NFS
  • New pre-analyzed field API - Lets you handle pre-analyzed Document fields without dummy analyzer code
  • Public Maven releases - The latest release of all Lucene modules are now available through the Maven repository

InfoQ spoke with Grant Ingersoll, a committer and Project Management Committee (PMC) member for the Lucene project, to learn more about this release. During the discussion, he asked InfoQ to make it clear that his views and comments are his alone, and are not the official views of the Lucene PMC.

InfoQ learned that the 2.2 release of Lucene marks a shift towards a shorter, quarterly release cycle. Ingersoll believes that these more frequent releases will introduce several benefits, including making bug fixes and new features available to the community more rapidly. The release process has also been streamlined with Maven support improved so that future releases will be available more quickly to Maven users.

InfoQ asked Ingersoll to describe the Payloads feature in more detail, and he said:

Payloads are a new feature to allow the storage of information in the index on a term by term basis. For instance, when indexing web pages, it may be useful to store extra information about a particular word, such as an associated URL or weighting factor based on some analysis of the text. In more advanced applications, it might be useful to store the part of speech of a word in order to score nouns as being more important than other parts of speech. My talk at ApacheCon Europe this year has a few slides on payloads [for those that] are interested.
He also described the new Function queries which originated in Solr as:
The new search function package (org.apache.lucene.search.function) allows developers to use the actual content of a field in scoring a document. For instance, if you stored latitude and longitude information in fields on a document, you could then use the information in these fields to affect the ranking of a document. That is, if you were doing a search for Starbucks, you could rank those locations nearer to the user (assuming you know their location) higher in the results than those farther away. Another example might be to use price or margin information to affect the ranking (i.e. score products higher that have bigger margins for your company. Not saying I agree with this ethically, but it can be done)

Ingersoll was then asked what users could expect from the next release of Lucene. He indicated that there will be significant improvements in indexing performance as a result of some new memory management techniques led by Michael McCandless. He also mentioned that the recent releases of Lucene have added a number of performance enhancements, and that users will want to try them out for themselves. Finally, Ingersoll noted that Java 5 support and more flexibility in the indexing process are potential future features of Lucene.

A full changelog is available, listing all of the bugfixes, features and optimizations which are in this release. As with previous releases of Lucene, 2.2 is able to read and import indexes from previous versions of Lucene, however once converted the index is no longer readable by earlier versions of Lucene (e.g. 2.1).

MG4J by Vic C Posted
  1. Back to top

    MG4J

    by Vic C

    Which is faster: Solr, MG4j, Lucene, SQL text src?

    .V

Educational Content

Beauty Is in the Eye of the Beholder

Alex Papadimoulis discusses ugly code, where it comes from, how to avoid it, and how to get rid of it.

Architecting Visa for Massive Scale and Continuous Innovation

John Davies examines Visa’s architecture and shows how enterprises have architected complex integrations incorporating Hadoop, memcached, Ruby on Rails, and others to deliver innovative solutions.

Max Protect: Scalability and Caching at ESPN.com

Sean Comerford unveils ESPN.com’s architecture, what components are used and why, and the current changes the website goes through.

The Seven Deadly Sins of Enterprise Agile Adoption

Are there repeated patterns of failure on Enterprise Agile Enablement efforts? Sanjiv and Arlen discuss Seven Deadly Sins to avoid when adopting Agile in an enterprise.

Questions for an Enterprise Architect

Erik Dörnenburg answers: What is Enterprise and Evolutionary Architecture?, discussing 4 issues: Turning strategy into execution, Ensuring conformance, Where do the architects sit? Buying or building?

Wrap Your SQL Head Around Riak MapReduce

Sean Cribbs explains what Map-Reduce and Riak are, why and how to use Map-Reduce with Riak, and how to convert SQL queries into their Map-Reduce equivalents.

Polyglot Persistence for Java Developers - Moving Out of the Relational Comfort Zone

Chris Richardson shows how he ported a relational database to three NoSQL data stores: Redis, Cassandra and MongoDB.

The Golden Circle – Why How What

Jean Tabaka challenges the audience to reflect on what Agile practices they are employing, how they are using them, ending with the questions “Why have their organization chosen to go Agile?