InfoQ Homepage Articles Moving Past Simple Incident Metrics: Courtney Nash on the VOID

Moving Past Simple Incident Metrics: Courtney Nash on the VOID

Feb 14, 2023 11 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Key Takeaways

Popular incident metrics such as mean time to recovery (MTTR) can be classified as "gray data" as they are high in variability but low in fidelity.
Simplified incident metrics should be replaced, or at least augmented, with socio-technical incident data, SLOs, customer feedback, and incident reviews.
Transitioning to new forms of reporting can be hard, but Nash recommends "putting some spinach in your fruit smoothie". Start presenting those data along with the expected/traditional data, and use this as a way to start the conversation about the benefits of collecting and learning from them.
The latest VOID report found no correlation between the length of an incident and the severity of that incident.
Near-miss investigations can be a valuable source of information for improving how an organization handles incidents

The Verica Open Incident Database (VOID) is assembling publically available software-related incident reports. Their goal is to make these freely available in a single place mimicking how other industries, such as the airline industry, have done so in the past. In the past year, the number of incident reports in The VOID has grown by 400%. This has allowed the team to better confirm findings and patterns first reported on in their 2021 report.

The 2022 edition of The VOID report leveraged this influx of reports to "investigate whether there is a relationship between the reported length of the incident and the impact (or severity) of the incident."

The 2022 report reinforced conclusions drawn in the 2021 report and presented some new findings. The report continued to reinforce the risk of simplified metrics such as incident length or MTTR. John Allspaw, principal at Adaptive Capacity Labs, LLC, calls data that underrepresents the uniqueness of incidents "shallow data".

The report labels duration as a type of "gray data" in that it tends to be high in variability but low in fidelity. In addition, how organizations define and collect this measure can differ greatly.

The additional data in the report reinforced the notion that incident duration data is positively skewed as most incidents resolve in under two hours.

This positive skew to the data means measures like mean-time-to-resolution (MTTR) are not reliable. As Courtney Nash, Senior Research Analyst at Verica, describes:

This is the secret problem with MTTR. If you don't have a normal distribution of your data then central tendency measures, like mean, median, and yes the mode, don't represent your data accurately. So when you think you're saying something about the reliability of your systems, you are using an unreliable metric.

The report makes a number of recommendations on what to replace MTTR with but it also cautions against trying to replace one simplified metric with another.

We never should have used a single averaged number to try to measure or represent the reliability of complex sociotechnical systems. No matter what your (unreliable) MTTR might seem to indicate, you’d still need to investigate your incidents to understand what is truly happening with your systems.

Recommendations include SLOs and customer feedback, sociotechnical incident data, post-incident review data, and studying near misses. The sociotechnical incident data is an approach that studies the entire system under incident including the code, infrastructure, and the humans who build and maintain them. The report highlights that common incident analysis focuses on technical data only and ignores the human aspect of incidents.

Within the report, the work of Dr. Laura Maquire, Head of Research at Jeli.io, is recommended as a source of sociotechnical data. Maquire's concept of Costs of Coordination. Maquire recommends tracking data such as how many people and teams were involved, at what level in the organization were they, how many chat channels were opened up, and which tools were used to communicate. This data can help to showcase the "hidden costs of coordination" that can add to the cognitive demands of people responding to incidents.

The report calls out that "until you start collecting this kind of information, you won’t know how your organization actually responds to incidents (as opposed to how you may believe it does)."

However, Vanessa Huerta Granda, Solutions Engineer at Jeli.io, feels that we shouldn't abandon shallow data metrics just yet. Huerta Granda notes that these metrics, such as MTTR or incident count, can be effective starting points for understanding complex systems. The risk is that the investigation stops at those metrics and does not delve into a deeper investigation.

One of the newer findings in the report is that there is no correlation detected between incident duration and incident severity. The report highlights that severity suffers from the same issues as duration and falls into the category of gray data. Severity can be highly subjective as many organizations use it as a way to draw more attention to an incident. The report concludes that

[C]ompanies can have long or short incidents that are very minor, existentially critical, and nearly every combination in between. Not only can duration not tell a team how reliable or effective they are, but it also doesn’t convey anything useful about the event's impact or the effort required to deal with the incident.

InfoQ sat down with Nash to discuss the findings and learnings from the 2022 VOID Report in more detail.

InfoQ: The report highlights the challenges with simplified metrics like MTTR. However, metrics like MTTR continue to remain popular and widely used. What will it take for organizations to move past "shallow data" toward more qualitative incident analysis?

Courtney Nash: Moving away from MTTR isn’t just swapping one metric for another, it’s a mindset shift. Much the way the early DevOps movement was as much about changing culture as technology, organizations that embrace data-driven decisions and empower people to enact change when and where necessary, will be able to reckon with a metric that isn’t useful and adapt. The key is to think about the humans in your systems and capitalize on what your organization can learn from them about how those systems really function.

While MTTR isn’t useful as a metric, no one wants their incidents to go on any longer than they have to. In order to respond better, companies need to first study how they’ve responded in the past with more in-depth analysis, which will teach them about a host of previously unforeseen factors, both technical and organizational. They can collect things like the number of people involved hands-on in an incident; how many unique teams were involved; which tools people used; how many chat channels; if there were concurrent incidents.

As an organization gets better at conducting incident reviews and learning from them, they’ll start to see traction in things like the number of people attending post-incident review meetings, increased reading and sharing of post-incident reports, and using those reports in things like code reviews, code comments and commits, along with employee training and onboarding.

InfoQ: Those are all excellent measures for companies to be moving towards. However, I can see some organizations struggling with understanding how tracking, for example, the number of people involved in an incident will lead to fewer, shorter incidents. Do you have any suggestions for how teams can begin to sell these metrics internally?

Nash: It is indeed difficult to sell something new without an example of its benefits. However, if a team is already working on analyzing their incidents with an emphasis on learning from them (versus just assigning a root cause and some action items), they should be either already collecting these data along the way or able to tweak their process to start collecting them.

One option then is akin to putting some spinach in your fruit smoothie: start presenting those data along with the expected/traditional data, and use them as a way to start the conversation about the benefits of collecting and learning from them. Vanessa Heurta Granda has an excellent post with some further suggestions on how to share these kinds of "new" data alongside the traditionally expected metrics.

Another approach, albeit a less ideal one, is to use a large and/or painful incident as an impetus for change and to open the discussion about different approaches.

In a talk at the DevOps Enterprise Summit in 2022, David Leigh spoke about how his team had been conducting post-incident learning reviews on the sly, and then uses a major outage as the impetus to pitch their new approach to upper management. It turned out to be a resounding success, and now the IBM office of the CIO conducts monthly Learning From Incident meetings that have nearly 100 attendees monthly and hundreds of views of the meeting recordings and incident reports.

InfoQ: Both this year's and last year's reports highlight the value of near misses as a source of learning. However, the report calls out the challenges in understanding and identifying near misses. How do you recommend organizations start growing the practice of near-miss analysis?

Nash: Most of the examples of learning from near misses come from other industries, such as aviation and the military. Given the large differences in organizational goals, processes, and outcomes, I’d caution against trying to glean too much from their implementations. However, I do like this perspective from Chuck Pettinger, a process change expert focused on industrial safety: "If we are to obtain quality near misses and begin to forecast where our next incident might occur, we need to make it easy to report." Report may be a loaded word in our industry, so instead, focus on making it easy for operators and people at the sharp end of your systems to easily note near-misses. This could be in your ticketing system, a GitHub Gist, or some other lightweight mechanism like a Slack channel.

It’s also worth emphasizing that noting a near miss doesn't require it be investigated, lest that reduce people’s inclination to note them in the first place. As you collect these, ideally patterns will emerge that help the team(s) figure out when to dig deeper into investigating one or more of them.

Lastly, I strongly suggest being wary of formal reporting/classification systems for near misses. These risk falling into the same patterns of shallow data collection that plague more formal incident reporting (e.g. MTTR, count of severity types, total incident count, etc).

InfoQ: This year's report sees a drastic decrease in the number of companies using Root Cause Analysis or reporting a singular root cause. The report also notes that Microsoft Azure has moved to a more positive contributing factors analysis. Are you seeing this trend in other companies? What are they replacing root cause analysis with?

Nash: Before getting to what could work, I did want to clarify that what primarily drove the decrease in RCA occurrences in the VOID was a dramatic increase in the number of incident reports in the VOID, most of which didn’t use RCA or assign formal root causes, so it was largely a volume reason.

That said, it’s currently challenging to note a trend amongst other companies given that we have only a small percentage using RCA in the VOID. We’re tracking RCA because we know from research that it tends to focus more on the humans closest to the event and rarely identifies systemic factors and improvements to those as well. It was noteworthy that the shift the Azure team made was how they both broadened and deepened their investigations and acknowledged the fact that each incident resulted from a unique combination of contributing factors that were typically not possible to predict.

This shift from root cause analysis to identifying contributing factors creates an environment that is less likely to place blame on individual humans involved, and encourages the team to consider things like gaps in knowledge, misaligned incentives or goal conflicts, and business or production pressures as incident contributors along with technical factors like bad config changes, bugs in code, or unexpected traffic surges.

InfoQ: What trends are you expecting to see in the upcoming year? Can you make any predictions on what the major finding of next year's report will be?

Nash: Those of us in tech have long relied on Twitter to help detect/disseminate trends, and ironically Twitter itself has recently brought the notion of sociotechnical systems to the cultural mainstream.

While many (myself included) predicted early on that Twitter the product might simply fall over and stop working, what we’re seeing instead is the broader system degrading in strange and often unexpected ways. The loss of product and organizational knowledge combined with the incessant unpredictability of new product features and policy edicts have caused advertisers and users to flee the service. Now, people watching and reporting on this are focusing as much on the people as the technical systems.

This is a trend that I hope continues into 2023 and beyond - not the potential implosion of Twitter, but rather the collective realization that people with hard-earned expertise are required to keep systems up and running.

As for next year’s report, I’ll say that predicting outcomes of complex systems is notoriously difficult. However, we do plan to dig into some new areas that I expect to be very fruitful. We’ve repeatedly said that language matters - the way we talk about failures shapes how we view them, investigate them, and learn from them (or fail to). We suspect there are patterns in the data, both technical and linguistic, that might help us cluster or group different approaches to incident analysis.

We’re also curious if we can start to detect incidences of past incident fixes that led to future incidents.

Readers interested in contributing to The VOID are encouraged to share their incidents. The first step is to analyze and write up the incident. If your organization has a wealth of incident reports to submit, feel free to contact The VOID team directly for assistance.

InfoQ Software Architects' Newsletter

Moving Past Simple Incident Metrics: Courtney Nash on the VOID

Write for InfoQ

Key Takeaways

Related Sponsors

About the Authors

Courtney Nash

Matt Campbell

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter