Apache Lucene 2.9 Released
Much of the emphasis in Lucene 2.9 is around performance improvements, many of which result from low level internal infrastructure changes to the way Lucene manages its index. Lucene's index database is composed of a number of separate "segments" each stored in an individual file. When you add documents to the index, new segments may be created which are added incrementally and can be merged. Lucene caches the field information for sorting in its FieldCache, but loading the field cache in Lucene 2.4 and earlier has been an expensive operation particularly since 2.4 regularly reloads the entire field cache. During preparation for release 2.9 the Lucene team noted that segments typically change infrequently – they change when you do a merge or a delete for example but older segments tend to stay static. The field cache has therefore been modified such that it only reloads parts where the segment has changed.
Lucene has also been inefficient when loading FieldCaches over multiple segments. In addition, version 2.9 has bypassed the need for loading FieldCaches over multiple segments, another historical inefficiency, by managing the internal FieldCache per segment. The results of these changes are impressive. Lucid Imagination's Mark Miller ran a simple benchmark using 5,000,000 unique strings and saw a performance improvement of around 15 times against Lucene 2.4:
Lucene 2.4: 150.726s
Lucene 2.9: 9.695s
Other significant performance enhancements include re-opening times. Lucene 2.9 introduces a new IndexWriter.getReader() method which returns a reader that searches the full index, including any uncommitted changes in the current IndexWriter session, providing near real-time searching. It is also possible to "warm" these segments so they are ready to be used immediately using IndexWriter.setMergedSegmentWarmer().
Another area which has seen a significant overhaul is numeric handling, particularly in the context of range queries such as "find me all CDs between 0.5 and 9.99 GBP". Since Lucene's search is largely text based prior to 2.9, Lucene's numeric handling presented a series of string based encodings which operated at full precision. This often produced a large number of unique terms which Lucene would then have to iterate to build up a complete set of results. Many developers have written their own custom encoders to work around this but with version 2.9 Lucene now supports native encoding for numbers. Field and Query classes use a precision step to determine how much precision to encode when indexing and searching. This can greatly reduce the number of terms that need to be searched and can have a significant impact on query response times.
In addition version 2.9 introduces new query types with more scalable multi-term queries (wildcard, prefix, etc.), and new analysers for Persian, Arabic and Chinese. There is also improved Unicode support, a new query parser framework, and support for geospatial searches allowing filtering and sorting of documents based on distance (for example find all dry cleaners within 5 miles of my house). There are many more changes and improvements. A complete list can be found here.
While Lucene generally maintains full backwards compatibility between major versions, Lucene 2.9 has a variety of breaks that are spelled out in the 'Changes in backwards compatibility policy' section of CHANGES.txt. As such migration to 2.9 is likely to require a re-compile and proper regression testing and other appropriate due diligence. Re-compiling against 2.9 will also highlight any calls to deprecated methods allowing developers to update their applications in preparation for version 3. This is advisable since Lucene 3.0 will drop support for Java 1.4 and remove all functionality marked as deprecated in the 2.9 version.
Ronny Kohavi Dec 12, 2013