Lucene 3.5 and Solr 3.5 - Substantial RAM Reduction, SearcherManager, Deep Paging Support
The Lucene PMC (Project Management Committee) has announced the availability of Apache Lucene 3.5.0 and Apache Solr 3.5.0. Lucene is a high-performance, full-featured text search library. Solr is a standalone search server that uses Lucene at its core for indexing and search.
The major changes for the Lucene 3.5.0 release are:
- Lower Memory Consumption. There is a substantial reduction (3-5X) of memory needed to hold the terms index. This has been achieved by creating a more memory efficient data structure for holding terms.
- Deep Paging Support. Added IndexSearcher.searchAfter which returns results after a specified ScoreDoc. You can pass the last document on the previous page to the searchAfter method to get to the next page of results.
- SearcherManager. The org.apache.lucene.search.SearcherManager class has been added to simplify the sharing and reopening of IndexSearcher across multiple search threads. Underlying IndexReader instances are safely closed if not referenced anymore, using the IndexReader's reference counts. The acquire method is used to retrieve an IndexSearcher and the release method is used to close the retrieved IndexSearcher.
- SearcherLifetimeManager. The org.apache.lucene.search.SearcherLifetimeManager class has been added to provide a consistent view of the index across multiple requests. It simplifies the usage of the same IndexSearcher instance between requests, which provides a better user experience when paging or drilling down/up on search results.
- IndexWriter.optimize() Deprecated. IndexWriter.optimize has been deprecated and renamed to forceMerge. This is to discourage the use of this method since it is a very costly operation and only justified if the index is static.
- IndexReader.reopen() Renamed. IndexWriter.reopen has been replaced by openIfChanged. IndexReader.openIfChanged returns null if there are no changes in the index. This method is typically less costly than opening a new IndexReader.
- NGramPhraseQuery. org.apache.lucene.search.NGramPhraseQuery is a PhraseQuery which is optimized for n-gram phrase queries. This can speed up queries 30-50% when n-gram analysis is used.
To see the full list of changes in Lucene 3.5, please visit the Lucene 3.5 Release Notes.
The major changes for the Solr 3.5.0 release are:
- Lucene 3.5.0. Fixes and enhancements from Lucene 3.5.0, most notably the substantial reduction of memory needed for holding the term index.
- Distributed Result Grouping. Support for distributed search result grouping, also called field collapsing. This feature limits the number of documents shown for each "group", defined as the unique values in a field, and now works with distributed search.
- Language Detection. The new contrib module "langid" adds the ability to detect the language of a document before indexing, so appropriate decisions can be made. It is implemented as an UpdateRequestProcessor using Apache Tika's LanguageIdentifier or Cybozu's language-detection library.
- Numeric sortMissingFirst and sortMissingLast Support. Numeric types including Trie field types and dates now support sortMissingFirst and sortMissingLast.
- HunspellStemFilterFactory. Added support for Lucene's HunspellStemmerFilter which supports stemming for 99 languages. Hunspell is originally an advanced spell checker most famously used in the OpenOffice suite and is used in Solr for stemming.
- hl.q parameter. The optional hl.q parameter has been added, and if specified, overrides the q parameter in the Highlighter (HighlightComponent).
The see the full list of changes in Solr 3.5, please visit the Solr 3.5 Release Notes.
InfoQ asked Yonik Seeley, creator of Apache Solr, Chief Open Source Architect and Co-Founder at Lucid Imagination a few questions about recent Lucene and Solr release.
What are the changes in this release that most Lucene or Solr users can take advantage of immediately?
Lucene/Solr users will benefit from much lower memory usage for the terms index and improved vector highlighting. Solr has added support for distributed grouping and the ability to sort numeric fields with missing values last or first. Lucene has also added optimizations for deep paging.
Do you recommend developers to use the new SearcherManager by default?
The new Lucene SearcherManager may be a good starting point for someone developing a new Lucene based project, but there's no need to migrate working custom searcher manager code. For Solr users, searcher management is simply an internal implementation detail that has existed since inception.
What's next for Lucene and Solr 3.6?
It's always difficult to tell in open source, as there is no official roadmap. The majority of groundbreaking changes have been occuring in "trunk", which will be 4.0 when it's released. Lucene has completely revamped indexing with codec support, and Solr is morphing into a NoSQL data store with advanced distributed indexing capabilities.
When will we see a production ready release of Solr with NRT features?
LucidWorks, our commercial distribution of Apache Solr, is based on a stable version of trunk (4.0-dev) and does have NRT capabilities. It's unclear at this time of NRT functionality will be back-ported to the Solr 3.x line. Lucene/Solr 4.0 should be released at some point in 2012.
To get started, you can download Lucene 3.5 and Solr 3.5 from one of the Apache Mirror sites. For Maven users, use the groupId org.apache.lucene, artifactId pattern lucene-*, and version 3.5.0 for Lucene. You can also subscribe to the Lucene and Solr mailing lists for up to date information.
Todd Montgomery Dec 19, 2014