Discussion on Nagios Fitness for Purpose
Andy acknowledges that Nagios has a simple plug-in model, conceptual simplicity and reliability. But the drawbacks are greater. According to Andy, Nagios is difficult to scale, as it dos not support any kind of clustering. It is also difficult to configure, with lots of duplication between the Nagios server and the Nagios clients. Another sore point is the lack of an API to make system integration and custom dashboards creation easier tasks. In the age of the elasticity and the cloud, the need for the master to be told about a new client is also pointed out as a significant disadvantage.
Andy makes some proposals to deal with Nagios failings. He suggests that Sensu is a good fit for the monitoring problem, Graphite for the graphing needs and Flapjack for the alerting services. For the anomaly detection and user interface problems, Andy is not comfortable with any of the current offerings.
Laurie reports that Etsy has "10,000 checks in our primary datacenter, all active, usually on 2-3 minute check intervals with a bunch on 30 seconds". They had to perform some optimization adjustments. The team enabled the use_large_installation_tweaks flag to bring the latency down. The team also disabled the CPU dynamic scaling setting on their HP and Dell servers, as Nagios does not seem to play well with the power management algorithms used by those boxes. When Etsy started to use two data centers, they chose to have a Nagios instance in each of them and used Nagdash to aggregate status and reports in a single place.
On the configuration front, Laurie asserts that:
If you spend your day picking through Nagios config files, then you probably either love it anyway, you're doing a huge rewrite of your old config, or you're probably doing it wrong. You can easily automate this.
Etsy has also been using nagios-api, a third-party project that presents a REST-like JSON interface to Nagios, to automate it.
Laurie makes a broader point about Nagios' perceived failings in Andy's eyes. Laurie thinks that the Unix philosophy applies when working with Nagios: "Many small parts, applications that do a small specific thing, which you tie together using a pipe". The fact that Nagios has a strong ecosystem around it is a strong advantage in his view.
Commenting on Laurie's writings, Theo Schlossnagle leans on the "Nagios is not enough" line of thinking:
Reading telemetry off systems and providing deep insight into their behavior is what we need from the operations side. That is a broad task that requires analytics on the data collected. Nagios and the myriad products designed like Nagios simply do not allow for this approach.