Apache Solr: Lucene Based Server Provides Highly Scalable Enterprise Search

Apache Solr is a Lucene-based enterprise search server that delivers out-of-the-box indexing and query capabilities in a portable war file. Users interact with Solr via an HTTP interface, submitting content for indexing and making queries using XML documents and HTTP GET parameters. Solr also provides a master-slave index replication mechanism to allow query load to be distributed in a large-scale environment.

Solr was initially developed at CNET Networks and was donated to the Apache Software Foundation in 2006. It is currently used for search applications on several high-traffic public websites. Community reports have been good, with users reporting indices with several million documents performing quite well.

Solr's feature set is broken down into several subsystems:

Schema

Defines the field types and fields of documents
Dynamic Fields enables on-the-fly addition of new fields
Explicit types eliminates the need for guessing types of fields
External file-based configuration of stopword lists, synonym lists, and protected word lists
Many additional text analysis components including word splitting, regex and sounds-like filters

Query

HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby)
Sort by any number of fields
Highlighted context snippets
Constant scoring range and prefix queries - no idf, coord, or lengthNorm factors, and no restriction on the number of terms the query matches.
Function Query - influence the score by a function of a field's numeric value or ordinal
Date Math - specify dates relative to "NOW" in queries and updates

Core

Pluggable query handlers and extensible XML data format
Document uniqueness enforcement based on unique key field
Batches updates and deletes for high performance
User configurable commands triggered on index changes
Correct handling of numeric types for both sorting and range queries

Caching

Pluggable Cache implementations
Autowarming of cache in background (The most recently accessed items in the caches of the current searcher are re-populated in the new searcher, enabing high cache hit rates across index/searcher changes.)
Fast/small filter implementation
User level caching with autowarming support

Replication

Efficient distribution of index parts that have changed via rsync transport
Pull strategy allows for easy addition of searchers
Configurable distribution interval allows tradeoff between timeliness and cache utilization

Admin Interface

Comprehensive statistics on cache utilization, updates, and queries
Text analysis debugger, showing result of every stage in an analyzer
Web Query Interface w/ debugging output

Version 1.2 was released last week, adding several new features:

This is the first release since Solr graduated from the Incubator, bringing many new features, including CSV/delimited-text data loading, time based autocommit, faster faceting, negative filters, a spell-check handler, sounds-like word filters, regex text filters, and more flexible plugins.

A two part series of articles was also recently published on developerWorks that walk through the process of installing, configuring, using, and tuning Solr in more detail.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Enterprise Architecture topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter