BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Apache Solr: Lucene Based Server Provides Highly Scalable Enterprise Search

Apache Solr: Lucene Based Server Provides Highly Scalable Enterprise Search

This item in japanese

Bookmarks
Apache Solr is a Lucene-based enterprise search server that delivers out-of-the-box indexing and query capabilities in a portable war file. Users interact with Solr via an HTTP interface, submitting content for indexing and making queries using XML documents and HTTP GET parameters. Solr also provides a master-slave index replication mechanism to allow query load to be distributed in a large-scale environment.

Solr was initially developed at CNET Networks and was donated to the Apache Software Foundation in 2006. It is currently used for search applications on several high-traffic public websites. Community reports have been good, with users reporting indices with several million documents performing quite well.

Solr's feature set is broken down into several subsystems:

Schema
  • Defines the field types and fields of documents
  • Dynamic Fields enables on-the-fly addition of new fields
  • Explicit types eliminates the need for guessing types of fields
  • External file-based configuration of stopword lists, synonym lists, and protected word lists
  • Many additional text analysis components including word splitting, regex and sounds-like filters
Query
  • HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby)
  • Sort by any number of fields
  • Highlighted context snippets
  • Constant scoring range and prefix queries - no idf, coord, or lengthNorm factors, and no restriction on the number of terms the query matches.
  • Function Query - influence the score by a function of a field's numeric value or ordinal
  • Date Math - specify dates relative to "NOW" in queries and updates
Core
  • Pluggable query handlers and extensible XML data format
  • Document uniqueness enforcement based on unique key field
  • Batches updates and deletes for high performance
  • User configurable commands triggered on index changes
  • Correct handling of numeric types for both sorting and range queries
Caching
  • Pluggable Cache implementations
  • Autowarming of cache in background (The most recently accessed items in the caches of the current searcher are re-populated in the new searcher, enabing high cache hit rates across index/searcher changes.)
  • Fast/small filter implementation
  • User level caching with autowarming support
Replication
  • Efficient distribution of index parts that have changed via rsync transport
  • Pull strategy allows for easy addition of searchers
  • Configurable distribution interval allows tradeoff between timeliness and cache utilization
Admin Interface
  • Comprehensive statistics on cache utilization, updates, and queries
  • Text analysis debugger, showing result of every stage in an analyzer
  • Web Query Interface w/ debugging output
Version 1.2 was released last week, adding several new features:
This is the first release since Solr graduated from the Incubator, bringing many new features, including CSV/delimited-text data loading, time based autocommit, faster faceting, negative filters, a spell-check handler, sounds-like word filters, regex text filters, and more flexible plugins.
A two part series of articles was also recently published on developerWorks that walk through the process of installing, configuring, using, and tuning Solr in more detail.

Rate this Article

Adoption
Style

BT