BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Why the Most Resilient Companies Want More Incidents

Why the Most Resilient Companies Want More Incidents

This item in japanese

Bookmarks

Companies want more incidents because companies want more learnings! According to John Egan, the incident management process is meant to be a cycle of not just the response, but also the account of root cause and the updating of internal processes and practices across the industry. Lowering the barrier to reporting incidents, holding effective incident review meetings using blameless postmortems, and giving everyone access to postmortems is what he advises.

Egan, former co-founder and product lead workplace by Facebook, spoke about how tech organizations are doing incident management at QCon Plus May 2021.

Major tech companies want more incidents, as aiming for fewer incidents can be counterproductive, as Egan explained:

There’s an old way of thinking still holding on at many companies where we look at our incident counts and think, "How do we get fewer of these?" and that encourages all kinds of antipatterns that culminate in an underreporting of incidents across the board. By filing fewer incidents, companies miss countless learning opportunities because they’re not kicking off a process that supports that learning.

The major tech companies that are most resilient have become that way by absorbing the practices they’ve seen from other resilient practices in the airline, aerospace, and emergency response industries, Egan said. They work hard to lower the barrier as much as possible to the incident management process by encouraging and celebrating filing more incidents sooner.

When it comes to postmortems, you really want to focus on lowering the barrier to each step of the process: writing the postmortem, distributing it effectively to others, and discussing its outcome collaboratively with your peers, as Egan explained:

For writing the postmortem, lowering the barrier means having a series of templates for various incident types that keeps the complexity down when possible—you probably don’t need a 10-page postmortem for a small incident, and if that’s your only template odds are the responders won’t fill out anything at all which is infinitely worse than even a short sentence explaining the learnings.

Egan suggested two steps companies can take for distribution and discussion of incident learnings:

1) Make sure everyone in the company has access to the postmortems as they’re written vs. putting them in a passworded folder or distributing them to a small email list.

2) Make sure you have Incident review meetings where the postmortem can be reviewed quickly in a blame-free setting.

Egan mentioned that learning comes in a few flavors: expected learning and serendipitous learning. Expected learning is often captured in follow-up tasks and from responders reflecting on the incident by participating in the postmortem process. Serendipitous learning is the unexpected learnings that others within the organization may have that the postmortem writer didn’t expect or anticipate.

According to Egan, serendipitous learning is something that not only companies but the whole software industry can benefit from:

An engineer in another department of a large company may change their own process for the better after reading about an incident written by someone in an unrelated department that didn’t directly impact them at the time. This is where distribution comes into play. At the extreme end of this, which I’m hoping we’re trending towards as an industry overall, is making postmortems public to get the maximum downstream learning impacts across the entire industry and not just within a single company.

InfoQ interviewed John Egan about learning from the airline industry, improving incident reporting, and simplifying postmortems.

InfoQ: You spoke about how the airline industry deals with incidents. What can we learn from this?

John Egan: The airline industry really is the crucible of resilience engineering and has been since the 1950s— this industry has been working to build best practices in an environment where you can’t fudge the numbers and where failure is more often than not life threatening. This means the airline industry has been able to focus on the long-term impacts of incident management in a very diligent manner and the outcomes have been overwhelmingly positive.

Most recently, in the 1990’s, the airline industry realized that while they had done a fantastic job of creating a learning cycle and process around incident response and learning, the process wasn’t being triggered often enough to have its maximum effect. For various reasons, pilots and technicians were not shying away from filing incidents when they could avoid it so a concerted effort was spun up in the wake of a series of accidents that moved the focus to increasing the count of incidents filed as a major measured metric and this has had the impact of making the industry as a whole one of the safest in the world.

I think we can learn from both parts of history here when building resilient companies: the need for an impactful process and the need to maximize its use.

InfoQ: What can be done to lower the barrier to incident reporting?

Egan: Lowering the barrier is all about easy-to-access tooling and cultural change. Companies need their incident management toolset to be accessible to and usable by the whole company so it feels like any other day-to-day tool instead of something they only see when there’s a catastrophic situation.

In my experience, normalizing the use of the tool so that everyone feels comfortable with it will get you most of the way there as a company. Once in place, it’s important to incentivize incident creation and celebrate those who file, respond to, and distribute learnings for incidents the same way you would celebrate project progress and completion.

Another learning from the airline industry is how important it is that everyone believes the incident management process exists to make their company better, not to gather evidence for punitive measures or penalties. Managers all the way up to executives can take deliberate and visible steps to show that the information gathered via incident reporting is used productively and deliberately by referencing specific incidents for their positive impact and avoiding allowing incidents to be associated with negative outcomes for their employees’ careers.

InfoQ: What suggestions do you have for simplifying the way that postmortems are done?

Egan: If you’re just getting started, set the bar as low as possible to encourage postmortem creation and set the distribution as high as possible. No one wants to put time into writing something that isn’t going to be read.

It’s easier to iterate forward from incomplete postmortems than it is to iterate forward from nonexistent ones. The easiest way to do this is to adopt a tool like Kintaba across the company so you’re not having to manually run the incident process by stitching a bunch of other tools together yourself.

Rate this Article

Adoption
Style

BT