Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Making On-Call Less Painful for Developers by Using High-Quality Alerts

Making On-Call Less Painful for Developers by Using High-Quality Alerts

This item in japanese

On-call is an increasing reality for developers. Improving alerts to reduce noise, automation, and removing warnings can help to make on-call work more humane.

Mario Fernández, staff engineer at Wayfair, shared his experiences from being on-call as a developer at OOP 2022.

Companies decide to put developers on-call because they have no other option, as Fernández explained:

Users expect full availability nowadays. If I want to use a website to order something and it doesn’t work, I won’t come back. It’s too bad for the developers who already logged off for the day, but that’s the reality we live in.

Fernández suggested reducing noise by creating high-quality alerts. One thing that works well is burn rate alerts based on SLOs:

The idea is to define a metric that you want your system to comply with (the SLO). Then, you measure how fast it’s being consumed with a burn rate, and you only trigger the alert when your error budget is at risk.

At a certain level of complexity, doing things by hand just doesn’t scale, Fernández mentioned. For instance, burn rate based alerts have a lot of moving parts. Setting them by hand takes a lot of work and it’s error prone:

Automation reduces the work of keeping things up-to-date. If alerts are slightly misconfigured, but you have hundreds of them, you probably won’t change them. If you have a way of propagating changes automatically, you’ll be more predisposed to do it.

Fernández stated that warnings are evil. By warnings, he means signals that aren’t enough of a problem to trigger a "real" alert. Maybe a hard drive is slowly filling, but it still has plenty of space left.

Warnings are overused out of a sense that you might miss something, Fernández said. It creates a lot of noise, and it blurs the line between issues that need action and the ones that can wait:

You can remove most warnings and not lose any signal. Inspecting dashboards or going over logs regularly fulfils a similar purpose without the downsides.

InfoQ interviewed Mario Fernández about how to use alerts and automation and what can be done to fine-tune monitoring.

InfoQ: How do burn rate alerts reduce noise?

Mario Fernández: I’ve experimented quite a bit with alerts for the past year. They work really well because they strike a good balance between false positives and false negatives. You can build alerts that are very responsive when they need to be, but won’t ping developers constantly.

SLOs are a way to make a more systematic commitment to the level of support you want to provide. Business and technical sides are often not aligned on this at all, with the people on the on-call rotation paying the price for that. Google has written a lot about this in Alerting on SLOs.

InfoQ: How can automation help to make on-call work more humane?

Fernández: A driving force behind automation is Infrastructure as Code. When you reflect changes in code, there’s less maintenance. Unlike documentation, it’s easier to keep in sync with the actual reality on the ground.

Over time you can abstract that code so that it fits other use cases, which helps propagate best practices. It’s very frustrating to spend time fixing an issue, just to see another team falling into that some time later.

InfoQ: What can be done to fine-tune monitoring constantly? What benefits can this bring?

Fernández: Systems aren’t static. They change over time, and so should the alerts that monitor a system. Otherwise things become outdated. An extreme case of this is when you don’t remove alerts for systems that are deprecated, and yet you still get alerts for them.

Constant tuning encourages incremental development. Instead of building everything in one big release, you do things bit by bit as needed. It’s less wasteful, as you build what you actually need. That prevents over engineering. But that only works if you commit to constantly tune things.

About the Author

Rate this Article