BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles “Elasticsearch in Action” – Book Review and Authors Interview

“Elasticsearch in Action” – Book Review and Authors Interview

Bookmarks

Businesses today are overwhelmed by the quantity of data flowing through and produced by its systems. Big Data technologies have focused on how to store and process large quantities of data. While this is very important for batch data processing, one often needs to make informed decisions in real time based on the respective data. This is where search engines can help, one of the most dynamic industries according to siteber.com:

The Search Engines industry has grown so much in the past couple years that it has cemented itself as one of the most innovative industries in the United States.

As a result, the two leading open source search engines today, Elasticsearch and Apache Solr, are enjoying a wider use and attract more attention then before. Gheorghe, Hinman and Russo’s new “Elasticsearch in Action” is a great book, providing concise step by step introduction to search along with explanations on how this functionality is implemented in Elasticsearch and numerous code examples. The book also covers in depth Elasticsearch administration and configuration, emphasizing tradeoffs between indexing and search performance. 

The book’s main content is divided into two parts: “Core functionality” and “Advanced functionality,” and is followed by several appendixes covering very important features but less generic (used only in special classes of applications).

The first part of the book describes the core Elasticsearch building blocks and their functionality. It covers the main Elasticsearch features and the approaches to modeling and indexing data so that it can be searched and analyzed based on the application’s requirements.

  • Chapter 1 provides an introduction to common search engine features and the way they are implemented in Elasticsearch. The chapter ends with instructions on Elasticsearch installation, which is used throughout the book for running examples.
  • Chapter 2 covers data organization in an Elasticsearch server, including the logical data model – the way a search application interacts with Elasticsearch -, and the physical layout – the way the server handles data internally-. It then shows how this data models are used for typical operations - indexing documents, searching them, analyzing data via aggregations, and scaling out to multiple nodes.
  • Chapter 3 covers the details of getting data in and out of Elasticsearch and maintaining it: indexing, updating, and deleting documents. It describes details of the indexing process by looking at the document’s fields; what they contain and what happens when you’re writing them.
  • The details of full-text search are covered in chapter 4 which describes the important types of queries and filters, their inner working, and tradeoffs of various search approaches. It ends by presenting some of the most commonly used filters and queries and explaining their applicability for different use cases.
  • Chapter 5 explains how analysis breaks down the text from both documents and queries into the tokens that are used for searching. It introduces different kinds of analyzers provided by Elasticsearch and explains how to build your own analyzer in order to fully utilize Elasticsearch’s full text search potential. It also introduces ngrams, shingles and stemming, all of which play an important role in search’s flexibility.
  • Chapter 6 focuses on computing the relevance, also known as the score, which defines how relevant the document is to the original query. It describes the factors affecting a document’s score and the ways to manipulate them using different scoring algorithms, boosting a particular query or field. It also shows how the Elasticsearch’ API was used to compute the score.
  • Chapter 7 shows how to use aggregations to perform real-time data analytics. Aggregations in Elasticsearch solve this problem by loading the documents matching a search criteria and doing all sorts of computations, such as counting the terms of a string field or calculating the average on a numeric field. Aggregations are divided in two main categories: metric and bucket. Metric aggregations refer to the statistical analysis of a group of documents, resulting in metrics such as the minimum value, maximum value, standard deviation, and more. Bucket aggregations divide matching documents into one or more containers (buckets) and then give you the number of documents in each bucket.
  • Finally, chapter 8, describes how to support relationships in Elasticsearch. It discusses several common approaches provided by Elasticsearch to deal with relationships including object types, nested documents, parent-child relationships and general denormalization. It explains how to use each approach and its pros and cons.

The second part of the book describes how to bring Elasticsearch into production. It provides additional information on how each feature works, and its impact on performance and scalability:

  • Chapter 9 covers scaling out Elasticsearch to multiple nodes. It describes the mechanics of adding, removing and decommissioning nodes, processes of master selection and shards moving. It also dives into scaling strategies including best practices of  sharding and replicating indices, for example, oversharding or using time-based indices to ensure that today’s design can cope with next year’s data. It also explains how to use aliases and routing for improved cluster flexibility and scaling. Finally it shows how to use Elasticsearch’s API to display cluster’s state and health.
  • Chapter 10 describes more ways to improve the performance of a cluster. It starts by showing how to group multiple requests, such as index, update, delete, get, and search, into a single HTTP call. This grouping can speed up application performance by a huge margin by minimizing the amount of network roundtrips. It then shows how Elasticsearch deals with Lucene (the underlying search library) segments: how refreshes, flushes, merge policies, and store settings work and how they influence index and search performance and tradeoffs between index and search performance. Finally, it discusses caching which is a big factor in Elasticsearch’s speed. It describes the details of the filter cache and how to use filters to make the best use of it. It also covers the shard query cache and how to leave enough room for the operating system to cache indices while still leaving enough heap size for Elasticsearch.
  • Monitoring and administration of a production cluster is covered in chapter 11. It covers additional setting approaches, simplifying cluster administration and important metrics that should be monitored in production. Finally, it discusses backing up and restoring a cluster’s data.

Additional information about Elasticsearch is covered by six appendixes:

  • Appendix A is about geospatial search which makes an application location-aware. Elasticsearch supports storing locations – points and polygons – and common spatial operations such as distance between points, point containment by a shape and shapes overlapping. The appendix demonstrates how searches can leverage this functionality.
  • Appendix B shows how to manage Elasticsearch plug-ins which are a powerful way to extend or enhance the functionality that Elasticsearch provides out of the box. The appendix introduces 2 types of plug-ins - site plug-ins and code plug-ins. A site plug-in is one that provides no additional functionality; it simply provides a web page served by Elasticsearch. A code plug-in is any plug-in that includes JVM code that Elasticsearch executes. This appendix explains how to install, use and remove plug-ins.
  • Highlighting indicates why a document results from a query by emphasizing matching terms, giving the user an idea on what the document is about, and also showing its relationship to the query. Support for highlighting in Elasticsearch is presented in appendix C. It covers the highlighting options and their implementation.
  • The Elasticsearch community offers a wide array of monitoring plug-ins that make it easier to manage cluster state and indices, and to perform queries via an attractive user interfaces. These plug-ins are covered in Appendix D.
  • Appendix E explains how to use the Elasticsearch’s percolator which typically is defined as “search upside down”.  The percolator allows to index queries instead of documents. This registers the query in memory so that it can be quickly run later. With the percolator you send a document to Elasticsearch instead of a query. This is called percolating a document, basically indexing it into a small in-memory index. Registered queries are run against the small index so Elasticsearch finds out which queries match.
  • Finally, appendix F explains how to use different suggesters in order to implement did-you-mean and auto complete functionality. The basic functionality described here includes terms and phrase suggestion, completion and context suggestion. The appendix describes the implementation of various suggesters and Elasticsearch’s APIs exposing this functionality.

Manning provided InfoQ readers with an excerpt from chapter 8 of the book – “Relationships between documents”

InfoQ has interviewed book’s authors to discuss more about Elasticsearch’s implementation and usage.

InfoQ:  In your book, when discussing typical Elasticsearch use cases, one of the options that you are proposing is “Adding Elasticsearch to an existing system”. This option assumes adding Elasticsearch to the existing data storage. Due to the delay of data availability in Elasticsearch, this can create racing conditions between data in different storages. This is especially dangerous when data is deleted, but is still returned as a result of Elasticsearch query. Any suggestions on how to deal with a situation like this?

Roy Russo: The simplest answer for this scenario is to manage the process at the application level by wrapping calls to delete from separate datastores within one try-catch block… much like you would perform on write. So first, the call to Elasticsearch to delete the record, and the subsequent call to the primary data store occurs after. If either fail, exception handling logic takes over. There are no transactions available, so exception handling will likely have to incorporate a manual “rollback”.

Since most of the Elasticsearch usage I’ve seen in the wild is based on time-series data, scenarios like this aren’t common. Unless in rare circumstances, most software isn’t regularly having to delete records from a time-series store. Alternatively one can use Elasticsearch as the primary and sole data store, but that brings up all sorts of other healthy debates, I’m not prepared to go in to.

InfoQ:  In his Elasticsearch vs. Solr article Otis Gospodnetić cites that Solr is more open source compared to Elasticsearch which controlled by a single company. Do you agree with this assessment and is this a problem for Elasticsearch users?

Roy Russo: First, let me say that I’m a big fan of Otis’ work at Sematext, but I don’t agree with his observation. Both projects are distributed under the Apache Software License, both accept commits from the community, and both benefit from transparency in issues, pull requests, and discussions. In any event, given the choice between professional open source and amateur open source, it is left up to the reader to choose which is a best-fit for the next mission critical deployment.

Elasticsearch users benefit from a well-funded company offering support services, plugins, and add-on products such as: Watcher - an alerting mechanism, Shield - Auth/Authz support, and Hadoop integration. In a short amount of time, Elastic has turned this open source project in to a product. There is a distinction here, as having a company drive the product adds quality assurance processes, product management to decide on feature addition and craft roadmaps, and a structured engineering team to execute. I know that in my current role deploying large clusters in mission-critical situations, knowing the product is supported by a  well-financed and professional team of engineers provides confidence.

InfoQ:  With every node in an Elasticsearch cluster providing an entry point for the REST APIs, does Elasticsearch provides any load balancing implementation ensuring that a given client will not overwhelm a specific node?

Roy Russo: Although one could get crafty configuring different node types (master, client, data) for a half-solution, if a truly fault-tolerant system is needed, I would advise users to “front” Elasticsearch with their own load balancer of choice. Nginx is the commonly preferred load balancer / reverse-proxy used. Because of that, there are plenty of tutorials and examples online demonstrating different load balancing schemes (round-robin, least-connected, etc.) and even using Nginx to restrict access (Auth/AuthZ) to certain API endpoints or HTTP methods.

InfoQ:  In your book you are showing only REST APIs that work with Elasticsearch. At the same time there are several Java clients for Elasticsearch. Any recommendation on which one to use?

Roy Russo: For a full-featured implementation, the officially support Java client is the best option. When using the official client, your software is not communicating via the REST API, but instead your software acts as another node joining the cluster. The benefit of being attached to the cluster in that fashion, is that it removes an extra hop for every request, as operations are automatically routed to the relevant node. For Spring users, Spring Data Elasticsearch may be preferred due to it adding many of the convenience features in ORM that are available. Spring Data Elasticsearch contains the official Elasticsearch Java client as a dependency, so expect it to behave much the same way with the added benefits of Spring Data.

InfoQ:  Description of a parent-child relationship implementation is slightly confusing. First you are writing then parent documents are not subject to any special indexing, but child documents have reference to parents. But then you are saying that a child document has reference to parent. I would assume that this is an ID reference, but you are writing later that a parent can be added later. So how is the reference specified in this case? Also, you are writing later about has-child field in the parent document. When is this field populated?

Lee Hinman: The reference that the child has to a parent document is via the special 'parent' parameter on the URL, in case the parent document doesn't exist when indexing a child document, Elasticsearch will still allow you to specify the 'parent' parameter of a parent that does not yet exist. Parent-child relationships are specified in a field of the child document (not the parent document, which is why it's okay for the parent not to exist when indexing a child), and the id mapping cache is loaded automatically as needed for has_child or has_parent queries.

InfoQ:  When describing searching in the multi node cluster you are saying that “by default, primary and replica shards get hit by searches in round-robin.” Is locality a factor in this case? If, for example, one of the replicas is present in a node that accepted a request, will this replica be used?

Roy Russo: Yes, if a primary or replica exist on the current node, it will be used. However, it’s important to realize that a query to a cluster is broadcast to every shard in the index. Those shards execute the query locally and report back the matching documents.

InfoQ:  In your book you describe the implementation of concurrency control by using a version number for each document. Is Elasticsearch storing several versions of a given document or just the latest version with a version number?

Roy Russo: This is a topic that comes up often and sparks confusion. Elasticsearch does not provide versioning, in the sense that it retains copies of the original document. It merely keeps a counter, “_version” that increments when the document is updated, indexed, or deleted. This is handy for optimistic locking, as it helps guarantee that you are updating the same version of the document and not a different version that could have been edited by a concurrent update.

InfoQ:  In your book you describe 2 mechanisms for node discovery – multicast and unicast (list of hosts)-. Neither one will work well in AWS. Are there any other node discovery mechanisms, for example, based on AWS load balancing group?

Roy Russo: When deploying on AWS, I heavily recommend the Elasticsearch cloud-aws plugin . Using either an AWS IAM Role assigned to the  EC2 instance or configuring the Elasticsearch configuration with AWS Access keys will make nodes magically find each other. For those wanting to use an IAM Role configuration, note that you can only specify the IAM Role on the EC2 instance when it is created, and it can never be edited again. The cloud-aws plugin, seamlessly allows nodes in the same named cluster to communicate, and is certainly the recommended and supported way of achieving discovery in an AWS environment.

Additional note: As of the 2.x release, this plugin is packaged separately from the S3 plugin. Those of us automating backups to S3, have to install that plugin separately now.

InfoQ:  When Elasticsearch implements Lucene index merges, are they done only on the primary shards and then copied to all replicas?

Lee Hinman: No, merges in Elasticsearch run independently. Since Elasticsearch uses document-level replication instead of file-level, merges will occur in a non-deterministic manner on each of the individual shards of an index.

InfoQ:  With AWS now providing Elasticsearch as a managed service, what impact is it going to have for Elasticsearch adoption? How is it simplifying cluster administration?

Roy Russo: Frankly, I hope it doesn’t hurt Elasticsearch adoption. I love AWS, but the Elasticsearch roll-out makes that offering look like the amateur-hour and could possibly leave a bad taste in users’ mouths with regard to Elasticsearch.

Being able to deploy a cluster using the AWS Elasticsearch service in a few clicks is convenient, but you’ll be trading flexibility and scalability for that convenience until they make changes to how they expose configuration information and functionality. Some of the biggest issues I’ve seen and experienced are; not being able to tune or customize Elasticsearch performance and logging, not being able to install plugins, no choice on switching from Elasticsearch v1.5.x, and their blocking CORS. Add to that, the fact that you cannot run this service inside of a VPC and the additional cost associated with it, and I would steer clear until Amazon cleans up its act on this service. Who knows? Maybe they can read our book and make the changes needed for a successful Elasticsearch service.

About the Book Authors

Roy Russo is the Vice President of Engineering at Predikto, Inc, a predictive analytics company. Before joining Predikto, Roy was the Chief Architect at AltiSource Labs, a FinTech startup based in Atlanta, GA. Roy was the Co-Founder and VP of Product Management for Atlanta-based Marketing Automation vendor, LoopFuse; recently acquired by Atlanta-based SalesFusion, Inc. Roy also helped Co-Found JBoss Portal, a JSR-168 compliant enterprise Java Portal, and represented JBoss on the Java Content Repository, JSR-170. He is currently the founder of ElasticHQ.org the leading open-source monitoring and management application for ElasticSearch clusters, and co-author of Elasticsearch in Action.

Radu Gheorghe works for Sematext, where he provides clients with search consulting, production support and trainings for various Elasticsearch and Solr deployments. Passionate about logging tools such as rsyslog (yes, that can be a passion), he also gets to work on Sematext's log analytics service, Logsene.

 

Matthew Lee Hinman is a passionate software developer looking for challenging software development. Active open source, Clojure and Elasticsearch community contributor. Lee enjoys working in teams on challenging and interesting problems. He cares a lot about code quality and releases the majority of his extracurricular code as open source.

Rate this Article

Adoption
Style

BT