Interview and Book Review: The LogStash Book, Log Management Made Easy
James Turnbull makes a compelling case for using LogStash for centralizing logging by explaining the implementation details of LogStash within the context of a logging project. The Logstash Book targets both small companies and large enterprises through a two sided case; both for the low barrier to entry and the scaling capabilities. James talked about the book on Hangops in February earlier this year, "Its designed for people who have never seen LogStash before, sysadmins, developers, devops, operations people. I expect you to know a little about unix or linux." He continued, "Additionally it assumes you have no prior knowledge of LogStash."
The Problem of Over Simplifing Log Management
James comes from a system administrator and security background. He explains how computing environments have evolved log management in ways that do not scale.
He shares that it generally falls apart through an evolutionary process, starting with when logs become most important to people, that is to say when trouble strikes. At that time new administrators will start examining the logs with the classical tools cat, tail, sed, awk, perl, and grep. This practice helps develop a good skill set around useful tools, however it does not scale beyond a few hosts and log file types. Upon realization of the scalability issue, teams will evolve into using centralized logging with tools such as rsyslog and syslog-ng.
While this starts to handle the scale issue, James shares that it doesn't really solve the problem of log management though because now there is an overwhelming number of different log event types, different formats, different time zones and basically a lack of easily understandable context. Finally a team may retrofit their computing environment with logging technology that can handle large amounts of storage, search, filtering, and the like. In the end unfortunately this approach includes a lot of waste and has a relatively high cost. LogStash saves the day by satisfying the need for a low barrier to entry like the classical system administrator tools, but is fully architected to scale to large web scale deployments.
LogStash Architecture Overview
LogStash provides an architecture for collecting, parsing, and storing logs. In addition, one of the main cross cutting use cases for a LogStash implementation is the viewing/searching of the managed log events. James recommends the open source project Kibana for searching events. Earlier this month, Jordan Sissel, the creator of LogStash, tweeted on twitter that the "latest daily build of logstash ships with kibana3: java -jar logstash.jar kibana" Both James and Jordan are writing about Kibana because it provides a user friendly search interface that integrates with Elasticstorage, the storage engine for LogStash. The following screenshot of Kibana comes from chapter 3 of the LogStash book, you can see a demo of it online:
(Click on the image to enlarge it)
Beyond the viewing of Logs there is an architecture of components that manages the flow of logs from disparate servers through a broker and ultimately into storage. James takes readers through an exploration of each component in the LogStash out of the box configuration, which uses Redis, an open source key value store, to queue logs in preparation for indexing. It also uses Elasticsearch for storage of logs and as a back end for the viewing system. The following diagram from chapter 3 shows the distinct architecture component types including: shipper, broker, indexer, and viewer:
In the book, James drills into the three primary functions within a LogStash instance: getting input events, filtering event data, and outputting events. These three functions of LogStash are performed based on configuration information stored in an easy to understand ".conf" file. The ".conf" file has sections for the three different types of plugins LogStash uses: input, filter, and output. Each LogStash instance is customized to meet the requirements of its role in the overall architecture. For example this configuration for a shipper (containing one input and two outputs) comes from chapter 3:
host => "10.0.0.1"
type => "redis-input"
data_type => "list"
key => "logstash"
debug => true
cluster => "logstash"
Open Source LogStash Fits DevOps Cultures
LogStash is a freely usable open source tool that has been built to work in an open source tool ecosystem, it makes sense then that James has written a book about it since he is a self proclaimed "Open Source Geek" . All the tools in the LogStash ecosystem are installable, configurable and manageable from the command line, which is ideal for automation. Throughout the book James makes it clear that using a automated configuration management system to control the installation/configuration of each component in LogStash is the ideal. However, that subject matter was beyond the scope of the book so the next best thing was covered which is the understanding of how the software components are installed and configured. Combining the understanding with automation makes building out multiple environments seamless. Its free to spin up and down endless environments for different phases in a project, Q&A, troubleshooting, and in support of Continuous Delivery.
Installing Java, LogStash, Redis, and Elasticsearch
James shows how easy the step by step process of installing LogStash and its dependent components really is. Likewise, Jordan Sissel made one of the LogStash project principles: "Community: If a newbie has a bad time, it's a bug." LogStash depends on Java, so installing at least OpenJDK is necessary to run LogStash. Most Linux distributions have made OpenJDK readily available through their package management systems. For example on Debian/Ubuntu the command is "sudo apt-get install openjdk-7-jdk" to install it and on Red Hat/Centos the command is "sudo yum install java-1.7.0-openjdk". Jordan Sissel distributes LogStash in one Jar file which can be retrieved from the LogStash.net homepage. The Jar file and a configuration file specified on the command line is all that is needed to run LogStash on top of OpenJDK. Installing Redis is easily done through package management systems, similar to OpenJDK, the command on Debian/Ubuntu is "sudo apt-get install redis-server" and on Red Hat/Centos is "sudo yum install redis". Redis can begin processing events after modifying a few configuration settings in "/etc/redis/redis.conf" and starting it as a service. Elasticsearch, like LogStash, only requires Java be pre-installed. Elasticsearch can be downloaded as a ".deb" for the Debian/Ubuntu Linux and as a ".rpm" for Red Hat/Centos Linux. It too has a need for a minor amount of configuration in the "/etc/elasticsearch/elasticsearch.yml" file and it will be ready to start.
LogStash Components: Shipper, Broker, Indexer
The book covers the three LogStash plugin types in the context of their usage in shippers and indexers. James shows how to use the following input plugins with LogStash: file, stdin, syslog, lumberjack, and redis. For environments where LogStash can't be installed, there are other options for sending events that integrate with LogStash: syslog, Lumberjack, Beaver and Woodchuck. There is overlap between input and output plugins in LogStash, for example there are both input and output redis plugins. In addition to the main two outputs covered, redis and elasticsearch, James also includes outputs that integrate with other systems including: Nagios, email alerts, instant messages, and StatsD/Graphite. The filters covered in the book include: grok, date, grep, and multiline. James shows how the filter plugins can enable efficient processing of postfix logs and java application logs. In some cases the logs can be filtered before LogStash uses them as input, for example Apache logging has a custom format capability that allows for logging in a JSON format that LogStash can easily process without an internal filter plugin. The broker, which we have specified as Redis, is for managing event flow, LogStash supports the following other queue technologies in this role: AMPQ and ZeroMQ. The Indexer instance of LogStash performs the routing to search/storage.
Scaling LogStash accomplishes three main goals: resiliency, performance, and integrity. The following diagram is from Chapter 7 of the book, it illustrates the scaling of Redis, LogStash, and Elasticsearch:
LogStash does not depend on Redis to manage failover itself. Instead LogStashh sends events to one of two Redis instances it has configured. Then if the selected Redis instance becomes unavailable, LogStash will begin sending events to another configured Redis instance. As an Indexer, LogStash is easily scaled by creating multiple instances that continually pull from all available Brokers and output to Elasticsearch. Within this design events only make it to one Broker so there should be no duplicates being passed through the LogStash indexer into Elasticsearch. Elasticsearch easily clusters itself when you install multiple instances and set the configurations to have common settings. It uses multicast, unicast, or an EC2 plugin to cluster itself based on configuration settings in each individual instance. As long as network allows for the instances to communicate they will cluster themselves and begin dividing the data up among the cluster nodes. The divisions in the data are made automatically to provide resiliency and performance.
InfoQ also interviewed James Turnbull about LogStash.
InfoQ: What benefits and costs do you think would be in a successful business case for a LogStash centralized logging system?
James: The benefits of LogStash are typical to most open source systems management tools: low software cost, open source and hence extensible, rapid development and bug fixes, awesome community developing solutions and helping each other out and the ability to help shape the roadmap to solve your problems. The costs are also similar to most open source tools: no commercial support, not as fully featured as some of the commercial alternatives, and may have a higher barrier to entry, both in terms of the skills needed and the documentation available (although now you have an awesome book!), for implementation.
InfoQ: Why would a DevOps oriented technology team choose the open source tools in the LogStash system over a commercial tool like Splunk?
James: The primary driver for that choice, for a DevOps team, would be cost. Whilst LogStash(and its dashboard Kibana) don't (yet) rival the capabilities of Splunk they are free and are rapidly gaining capabilities and features. Whilst Splunk comes with a considerable price tag that many smaller organizations can't necessarily afford. Of course, it's important to note that whilst the software is free there is still an implementation cost that may or may not be similar to comparative commercial tools.
InfoQ: What are the most easily understood and useful use cases for logging? If they would be different for different roles in an enterprise, could you please describe that difference?
James: The best use cases for logging are trouble-shooting and monitoring. The log data from your applications is often the best source of information when you have a problem in your infrastructure. They also represent an excellent source of data for monitoring the state and events in your infrastructure and for building metrics that demonstrate how your applications are performing.
This being said, different teams in enterprise organizations care about different aspects of those logging use cases. For example, operations teams focus on the trouble-shooting and performance data logs can provide. Application developers are keenly interested in using log output to help find and fix bugs. Security teams focus on identifying vulnerabilities and security incidents that log data might highlight.
About the Book Author:
James Turnbull is the author of six technical books about open source software and a long-time member of the open source community. James authored the first (and second!) books about Puppet and works for Puppet Labs running Operations and Professional Services. James speaks regularly at conferences including OSCON, Linux.conf.au, FOSDEM, OpenSourceBridge, DevOpsDays and a number of others. He is a past president of Linux Australia, a former committee member of Linux Victoria, was Treasurer for Linux.conf.au 2008, and serves on the program committee ofLinux.conf.au and OSCON. Y
mistake in the article
should be :
For example on Debian/Ubuntu the command is "sudo apt-get install openjdk-7-jdk" to install it and on Red Hat/Centos the command is "sudo yum install java-1.7.0-openjdk".