InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

Apache Solr: Extensible, Clustered Search Server Built on Lucene

Posted by Ryan Slobojan on Nov 11, 2008

Sections
Development,
Architecture & Design
Topics
Search ,
Java
Tags
Apache Solr ,
Lucene

The Apache Solr project, an open source enterprise search server based on Apache Lucene, recently released version 1.3. InfoQ spoke with Solr creator Yonik Seeley to learn more about this release, and also about what capabilities Solr offers to end users.

Seeley began by describing the target audience as "Pretty much anyone that needs a search box, faceted browsing (guided navigation) or a combination of the two", and identified the key features of Solr as:

  • Standards-based open interfaces - XML, JSON and HTTP are supported for querying the Solr search server and retrieving results
  • Easy administration - Solr servers can be administered via an HTML interface, server statistics are exposed via JMX, and Solr configuration is done via XML
  • Faceted search - query results are automatically broken into categories
  • Integrated hit highlighting - matching words are automatically highlighted in the search results
  • Scalability - fast incremental updates and snapshot distribution/replication to other servers
  • Extensible plugin architecture - new capabilities (such as custom request processors and query result formatting) can be easily added into a Solr server as a plugin

Seeley also indicated that the major new features in this release are:

  • Distributed search - indexes can now be transparently broken into multiple shards, a single Solr server can now support multiple indexes with their own configuration and schema, and major configuration changes can be made without bringing down the Solr server
  • Expanded query capabilities - This includes a new Java client (SolrJ) and several new features such as direct configuration of which documents appear first for specific queries, more-like-this, search timeouts, date faceting and spell checking
  • Enhanced data import tool - databases and other structured data sources can now be imported, and mapping and transformation can be done on the imported values
  • More custom extension points - there is a new update processor chain which allows modification or redirecting of documents during indexing, a search component chain which modifies or adds to query results, customer query parsers and pluggable functions
  • Performance enhancements - greatly increased indexing speed, a binary response format and a much faster delete-by query have been incorporated

A comprehensive changelog is also available.

Seeley spoke in more detail about the scaling, capacity and relevance features of Solr, saying:

Solr is already deployed with collection sizes in the hundreds of millions of documents, and with the addition of distributed search, Solr should be able to handle billion document collections.

Solr has excellent full text relevancy, building on Lucene and easily providing term proximity boosting, recent document boosting, editorial boosting, and even custom scoring based on arbitrary functions of numeric field values.

AOL is using Solr to power it's channels: Music, NFL Sports, AOL Recipes, Reference Center, Real Estate and Autos being several examples. Solr also powers the search features of Netflix, Zappos, Gamespot, and the Internet Archive. There are *many* other big users I'm aware of that haven't publicly stated their use.

When asked about future plans for Solr, Seeley indicated that greater scalability, easier configuration and management of large cluster, location-based and realtime search and a refactoring to use Spring for configuration of plugins was on the horizon. Seeley also pointed out a mailing list post in which he discussed the future plans for Solr in more detail, in particular around the 2.0 timeframe.

No comments

Watch Thread Reply

Educational Content

New-age Transactional Systems - Not Your Grandpa's OLTP

John Hugg discusses high volume transaction processing applications with high and low frequency profiles, and how VoltDB can be used for that purpose.

Cool Code

Kevlin Henney examines code samples to see what can be learned from them starting from the premise that one won’t write great code unless he knows how to read it.

Collaboration: At the Extremities of Extreme

Jason Ayers share the observations he made watching a team of developers collaborating in real time on the same code base, pushing XP, pair programming and continuous integration to their extremes.

Yesod Web Framework

Michael Snoyman presents Yesod, a web framework written in Haskell and containing a web server, templating, ORM, libraries (templating, gravatar, etc.).

Transactions without Transactions

Richard Kreuter and Kyle Banker on how to avoid classical RDBMS transactional systems by using compensation mechanisms, transactional messaging or transactional procedures.

Attila Szegedi on JVM and GC Performance Tuning at Twitter

Attila Szegedi talks about performance tuning Java and Scala programs at Twitter: how to approach GC problems, the importance of asynchronous I/O, when to use MySQL/Cassandra/Redis, and much more.

10 tips on how to prevent business value risk

One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.

Interview: Software Systems Architecture: Working With Stakeholders Using Viewpoints and Perspectives

InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.