BT

Apache Solr: Lucene Based Server Provides Highly Scalable Enterprise Search

by James Kao on Jun 11, 2007 |
Apache Solr is a Lucene-based enterprise search server that delivers out-of-the-box indexing and query capabilities in a portable war file. Users interact with Solr via an HTTP interface, submitting content for indexing and making queries using XML documents and HTTP GET parameters. Solr also provides a master-slave index replication mechanism to allow query load to be distributed in a large-scale environment.

Solr was initially developed at CNET Networks and was donated to the Apache Software Foundation in 2006. It is currently used for search applications on several high-traffic public websites. Community reports have been good, with users reporting indices with several million documents performing quite well.

Solr's feature set is broken down into several subsystems:

Schema
  • Defines the field types and fields of documents
  • Dynamic Fields enables on-the-fly addition of new fields
  • Explicit types eliminates the need for guessing types of fields
  • External file-based configuration of stopword lists, synonym lists, and protected word lists
  • Many additional text analysis components including word splitting, regex and sounds-like filters
Query
  • HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby)
  • Sort by any number of fields
  • Highlighted context snippets
  • Constant scoring range and prefix queries - no idf, coord, or lengthNorm factors, and no restriction on the number of terms the query matches.
  • Function Query - influence the score by a function of a field's numeric value or ordinal
  • Date Math - specify dates relative to "NOW" in queries and updates
Core
  • Pluggable query handlers and extensible XML data format
  • Document uniqueness enforcement based on unique key field
  • Batches updates and deletes for high performance
  • User configurable commands triggered on index changes
  • Correct handling of numeric types for both sorting and range queries
Caching
  • Pluggable Cache implementations
  • Autowarming of cache in background (The most recently accessed items in the caches of the current searcher are re-populated in the new searcher, enabing high cache hit rates across index/searcher changes.)
  • Fast/small filter implementation
  • User level caching with autowarming support
Replication
  • Efficient distribution of index parts that have changed via rsync transport
  • Pull strategy allows for easy addition of searchers
  • Configurable distribution interval allows tradeoff between timeliness and cache utilization
Admin Interface
  • Comprehensive statistics on cache utilization, updates, and queries
  • Text analysis debugger, showing result of every stage in an analyzer
  • Web Query Interface w/ debugging output
Version 1.2 was released last week, adding several new features:
This is the first release since Solr graduated from the Incubator, bringing many new features, including CSV/delimited-text data loading, time based autocommit, faster faceting, negative filters, a spell-check handler, sounds-like word filters, regex text filters, and more flexible plugins.
A two part series of articles was also recently published on developerWorks that walk through the process of installing, configuring, using, and tuning Solr in more detail.

Hello stranger!

You need to Register an InfoQ account or to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Clustered Indexes by Orion Letizi

There's been some talk recently about clustering Solr indexes with Terracotta. We've gotten Lucene clustered and we have a config module for it. It may also work with Solr. It's on our list of cool things to try.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

1 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2013 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT