Book Review: "Nagios: Building Enterprise-Grade Monitoring Infrastructures for Systems & Networks"
David Josephsen recently published "Nagios: Building Enterprise-Grade Monitoring Infrastructures for Systems and Networks, Second Edition". The book contains best practices for building monitoring infrastructure, lessons in operational theory focused on the usage of Nagios, and practical guidance for implementing Nagios. David wrote the book in a way primarily useful for system engineers and enterprise architects, though it has information relevant to most roles in technology. David leads readers through system thinking about the Nagios ecosystem of software by providing system integration details that build on the basics found in online documentation and by covering advanced topics that come from knowledge gained in real world usage of Nagios.
David makes it clear that building a deep understanding of the business and technology within scope of a monitoring solution is extremely important. He calls it a "procedural approach" in which the implementation is well thought out and not put together in piece meal. In his view piece meal approaches are fraught with issues and often lead to the inability to reason about important deep technical aspects of a monitoring solution. The book communicates advice on which system wide characteristics are important to focus on, including: processing requirements, network locations, network dependencies, security, alarm abuse, and watching ports vs. watching applications.
In spite of building up a lot of architectural and business knowledge up front about an implementation, Nagios still refrains from making assumptions about the variety of systems that need monitoring. In fact, Nagios doesn't do any monitoring on its own, its purpose is the scheduling of monitoring checks and the firing of notifications based on those checks. Nagios delegates the actual monitoring to plugins that return text indicating status, by doing so it prevents itself from relying on monolithic agents and keeps alignment with the Unix philosophy. Doug Mcllroy summarized the Unix philosophy as follows:
"Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface."
Theory of Operations
Nagios uses a well-defined paradigm with two main logical objects, "hosts" and "services", which are used to abstract the systems being monitored and their constituent components. Services belong to hosts and to accommodate relationships between hosts or between services there is a dependency construct. Nagios offloads the host and service checks to plugins that manage the status evaluations, each different type of check potentially having its own plugin. David takes the opportunity to explain details of monitoring both Windows and Unix. Windows provides scriptable technologies including: Wscript, OLE, COM, WMI, and PowerShell. NSClient++ provides the interface layer between Nagios and the windows scripting technologies through usage of the NRPE protocol. David comments on NRPE being the tool of choice for remote execution of plugins in Unix/Linux systems, this interfaces with plugins written in bash, Python, Ruby, PERL, and command line system tools. David completes the point of Nagios using plugins for monitoring by examining the capabilities that can be created for monitoring other stuff like network gear and environmental sensors.
The book makes it abundantly clear that Nagios is mainly a scheduling and notification system. Nagios manages the scheduling of checks through advanced algorithms that account for long check executions, problem state retries, interleaving to prevent load on remote systems, and purposefully delaying checks to manage load on the Nagios server. That said Nagios will still run checks in parallel and then use reapers to get their results from a message queue. Nagios fires notifications based on the transition between states and associated state type (HARD or SOFT). This notifications events can in turn be configured to send emails, interface with call systems like Pager Duty, and additionally go through escalations to perform further alerting based on the period in which an issue lasts. Additionally the transition out of the Ok state is governed by soft states that allow for mitigating transient issue notifications of low importance.
Nagios complements the highly focused purpose of its algorithmic core by providing flexible interfaces for I/O, thereby allowing itself to become part of a larger solution architecture for logging and monitoring. The main I/O interfaces include a web interface, reporting, external command file for processing commands by nagios, performance data processing, and advanced low level event broker integration. The performance data processing provides an integration point for advanced visualizations using round robin database tools (e.g. RRDtool) and graphing systems (e.g. graphite). The event broker provides advanced integration including querying Nagios state through plugins such as MKLiveStatus that can be used for tactical display integrations (e.g. Nagvis).
Nagios XI is the commercial product that extends Nagios Core and provides ease of use on top of the Nagios Core configuration, often considered the hardest part of working with Nagios. Additionally the commercial product comes with professional support services and has a list of integrated capabilities that would otherwise require the user to integrate those systems themselves.
Nagios is open source and has been purposely designed with integration in mind. David covers the "Event Broker Interface" and while he doesn't intend for the book to be a deeply technical book about programming, still enough depth is reached with example C code to help understand building components that extend Nagios's core functionality.
David made the book into an advanced holistic view of the installation, configuration, integration, and operations of Nagios. He provides enough to qualify the book as reference documentation, however even more than that he helps build familiarity with the Nagios software ecosystem so that readers will know the ends of Nagios and which technologies to further investigate for specific usages.
"The Nagios daemon was designed for and on Linux, but it is capable of being run by any UNIX-ish operating system"
The Nagios daemon (or server) is easily installable on most Linux distributions and Unix variants. The biggest difference between installations is the location of the files, however most will align with the file system hierarchy standard:
- Configuration Files - /etc/nagios
- HTML - /usr/share/nagios
- CGIs - /usr/share/nagios or /usr/lib/nagios
- Program daemon and other executables - /usr/bin/nagios
- LockFiles and FIFOs - /var/lib/nagios or /var/log/nagios
- Logs - /var/log/nagios
- Plug-ins - /usr/libexec/nagios or /usr/lib/nagios
Nagios is written in the C language and requires very few dependencies. Nagios dependencies vary based on features and plugins that are utilized. The web front-end requires a web server with CGI support (e.g. Apache). The plugins require more dependencies because they are the ones actually monitoring the systems, dependencies such as: ping, OpenSSL, BIND tools, Perl, Python, etc.
Nagios can be installed from Linux/Unix distribution specific packages or can be installed from source. Both use standard mechanisms. The hard part is when it comes time to configure the installation. David does not mean to be a lengthy reference for setting files, however he does provide chapter 4 for explaining the configuration of settings for: Timeperiods, Commands, Contacts, Contactgroup, Hosts, Services, Hostgroups, Servicegroups, Escalations, and Dependencies.
Plugins are installed on both the sever and remote systems in the Nagios architecture. Nagios interacts with the plugins on remote systems with the NRPE protocol. Users must install NRPE on the remote systems so that it can be accessed via TCP/IP. It runs as a service that can be configured to accept only known connections and run only known commands.
David wrote an entire chapter on scaling Nagios past its basic single Nagios daemon installation. Techniques include: distributed passive checks, secondary nagios daemons, DNX, Merlin, and Mod Gearmen.
InfoQ did an interview with David Josephsen about his book.
InfoQ: What are the key Nagios implementation tasks that improve collaboration between developers and operations staff?
David: DevOps is maybe the most drastic change our craft has seen in the last 20 years, and this is a good question because, in my experience, systems monitoring seems to be an area in which DevOps tends to devolve back into Dev and Ops, and I don't think that's a good thing. The problem with Nagios in this context has always been configuration. Opsy-types love flexibility, stable back-ends (files not databases), and rich, complex configuration parameters with lots of "hooks". Devsy-types like things that "just work", clean interfaces, and common data models that scale. Nagios core, and the solar-system of tools orbiting it tends to accentuate the orthogonality of these personality types. The Dev rebellion against the classic Nagios-style tools is well documented in movements like #monitoringsucks et al... I'd say the key implementation tasks that bridge the gap between the two center around interfaces. If you are an Ops guy at a shop with a healthy, active Dev or DevOps contingent, you need to ensure that they have interfaces that look familiar to them (JSON/SQL/Webservices etc..), and that (if possible) enable them to change things within the bounds of your change management framework. Otherwise they'll work around you. The easiest way to do this is probably to purchase XI for your organization but there are certainly many ways to make Dev part of the conversation that won't cost money. NRDP with it's https/XML interface to passive check results is a good example of an interface that might appeal to the Devsy in our midst. The first step is understanding what they want (and they probably don't know what they want yet, so the first step might be helping them understand what they want), and the next step is working WITH them to create a clean, 0-configuration (for them) interface to exactly that. Start small, and build up.
InfoQ: What software products, services, or open source projects are there that make good ticketing/help desk systems that integrate with Nagios?
David: Well if you don't already have a help-desk system and Nagios integration is an overwhelming factor in your decision making process, then the obvious choice is Nagios Incident Manager, which is a new-ish ticketing system made by Nagios Enterprises. The XI and Incident Manager were literally made for each other so you won't have any integration trouble there. Personally I've had experience marrying Nagios to RT, Jira, and VersionOne, and I can tell you that the problems are rarely technical. It's pretty trivial to glue Nagios to any system with a REST API or something similar. There are even plugins out there to build on if you don't want to start from scratch. What's never trivial is the interplay between the humans involved. You really need to nail down and eliminate false positives, or any other sort of non-event if you're thinking about help-desk integration. Take some time before you even start planning the interfaces to consider what subset of Nagios alerts that you currently send might make good fodder for Help-desk tickets. Ideally, you'd have a finite list of possible alerts and their meaning/escalation parameters to hand to the help desk well before the systems were tied together (protip: it should be a small list).
InfoQ: Have you experienced Nagios used in a test driven infrastructure environment? By this I mean that the check is written before the system it monitors is implemented, that way when all the checks are OK, you can consider implementation of that system complete? If you have can you describe it, if not, what do you think of the approach?
David: I've never personally used Nagios this way, and while I think it's a good idea to test prospective systems for correct and expected state before they hit production, there are probably better tools to accomplish this sort of thing. There are, for example, pretty amazing continuous integration systems out there like Hudson and Travis which might be shoe-horned into this sort of testing more easily than Nagios. The formally correct answer to this question however is probably Chef, or some other Configuration management engine. Those guys are really knocking test-driven infrastructure out of the park. With chef you can design systems that are verified and tested to be correct before the OS even hits the hard drive. While I think Nagios would work, I'm suggesting something else for two reasons, and depending on your environment, they might not apply to you, so I should tell you what they are. The first is that Nagios really is a notification engine, and in this context you're going to have to work against it to prevent notification by either configuring it to notify nothing, or by configuring it to notify and then turning off notifications. Either of these things is a waste of time, and just sort of triggers an OCD exception in my mind. The second is that your tests are going to be either difficult to design, or too simplistic. By that I mean that you can create a test suite for a web server that consists of check_http, check_ping and etc..., but those tests are very specific, and while they do give you an idea that, yes, the webserver probably works on this new box, that's only sort of approaching the definition of a test driven infrastructure. There's a lot more I want to know about a server before I put it into production. Ideally, I want to put it into production in a verifiably known state, which is the kind of thing I'd rely on Chef for.
About the Book Author
David Josephsen is the Director of Systems Engineering at DBG, Inc., where he maintains a collection of geographically dispersed server farms. He has more than a decade of hands-on experience with Unix systems, routers, firewalls, and load balancers in support of complex, high-volume networks. He authored the book "Building a Monitoring Infrastructure with Nagios" (Addison Wesely), wrote three chapters in "Monitoring with Ganglia" (O'Reilly Media) and currently writes "iVoyer", the systems monitoring column for ;login magazine.
Re: great tool
Today, it seems like the shops that aren't using Nagios are doing one of three things:
1. Something crappy that isn't Nagios (MOM, Openview, Patrol et..al).
2. Going to the cloud (usually Circonus) like you guys are because easier/cheaper.
3. Rolling their own thing which usually has something to do with how hard it is to scale a polling-based system like Nagios.
But yeah, "security and stability" are certainly reasons to justify the use of Nagios rather than the use of something else. If you guys are using a hosted system there are (and always will be) huge questions around security and stability that you'll never really be able to answer in a practical sense, and your management should be acknowledging and accepting that rather than sort of waving their hands in the general direction of the cloud and proclaiming it secure and stable.