In a recent blog post, Sidu Ponnappa shared how Mean Time To Recovery (MTTR) should be a key business metric to measure engineering efficiency. Ponnappa notes that only tracking uptime provides no goals to target for improvements. In a recent talk at SREcon22, Courtney Nash, senior research analyst at Verica, shared that MTTR can misrepresent what is actually happening during incidents and can be an unreliable metric.
As Ponnappa explains, MTTR can be a metric that can help bridge the communication gap between business and engineering. By having each team report MTTR for each service they own, it can behave as a proxy for quality and ownership. According to Ponnappa, "by owning and reporting MTTR, teams have no choice but to be accountable for the reliability of the code they write."
He notes that to improve MTTR, teams must take ownership of proper knowledge practices, improve their quality control, and emphasize communication both within and without. However, as Nash explains, MTTR may not be a reliable metric to draw conclusions from.
The Verica Open Incident Database (VOID) collects public incident reports and analyses the data looking at a number of key metrics including time to resolve. Nash notes that in all cases, Time to Recovery data is not a normal distribution and is instead positively skewed (clustered around the left side of the distribution). Nash explains that:
This is the secret problem with MTTR. If you don't have a normal distribution of your data then central tendency measures, like mean, median, and yes the mode, don't represent your data accurately. So when you think you're saying something about the reliability of your systems, you are using an unreliable metric.
A count of incidents by duration for Google displaying a positively skewed distribution (credit: Verica)
John Allspaw, principal at Adaptive Capacity Labs, LLC, agrees with Nash that MTTR is not necessarily representative of the uniqueness that can arise within incidents. He notes that two incidents that have the same duration can have very different challenges and uncertainty. Allspaw believes metrics like MTTR and incident counts fall into a category of "shallow data":
Using this data as if they were bellwether indicators of how individuals and teams perform under real-world uncertainty and ambiguity of these incidents 1) ignores the experience of the people involved, and 2) demeans the real substance of how these events unfold.
Alex Hidalgo, principal reliability advocate at Nobl9, agrees that MTTX data can be misleading when used for incidents. Hidalgo does note that time-to-failure measures for hardware components can be useful as it lacks the human element that is a part of incidents. There are three main challenges with "shallow data" according to Hidalgo:
- Incidents are unique.
- Using averages for your math is fallible.
- As with counting incidents or defining severity levels, defining what any of this even means in the first place can be very subjective.
However, Vanessa Huerta Granda, solutions engineer at Jeli.io, feels that MTTR or incident count can be effective metrics when used as a starting point for understanding complex systems. Huerta Granda shares a process that can encourage further learning from incidents which involves: capturing data through analyzing incidents with a focus on learning from them; reviewing "leadership friendly" metrics (e.g. MTTR, incident count) to look for further context; and presenting the findings in an easy-to-consume format.
MTTR is one of the four key metrics of software delivery performance that Dr. Nicole Forsgren, Jez Humble, and Gene Kim describe in their book Accelerate. They note that MTTR, along with deployment frequency, lead time for change, and change fail percentage, are good classifiers for software delivery performance. In addition, MTTR is highly correlated with the usage of version control and monitoring within an organization.
In a Twitter conversation with Allspaw, Forsgren argued that MTTR and incident counts, and qualitative incident analysis are not mutually exclusive:
[W]e should be doing BOTH tracking MTTR/incident #s, AND doing good qual work & retros. [L]ike, we won't know if we're improving our hard (or easy!) incidents if we don't have the data to track it.
Nash agrees that MTTR can make for a starting point to then present the deeper insight, themes, and outcomes from the detailed qualitative incident analysis. This includes moving towards SLOs and cost of coordination data. Hidalgo also recommends using SLOs, such as error budgets, to better indicate how much unreliability users are experiencing.