Key Takeaways
- If you don’t have good monitoring coverage and/or automatic alerting in place, MTTD is a good metric to start with.
- Never use MTTR without clarifying what the "R" stands for (there are at least 6 ways to define it).
- Depending on which area you want to focus your attention a different version of MTTR may be needed. If the wrong metric is chosen, the signal you’re trying to optimize may get lost in the noise of a multivariable equation.
- While MTTResolve focuses on the productivity cost, MTTRemedy focuses on customer-facing risk.
- Although the "M" in MTT* stands for "mean" (average), a good way to find anomalies and focus the optimization is to use percentiles.
An incident is an unplanned disruption or degradation of service that negatively impacts customers. There are several important time metrics that are used by Site Reliability Engineers (SRE) and developer operations (DevOps) to optimize the incident response. This article introduces those metrics, when to use them, some of their pitfalls, as well as further reading resources.
Dissecting the incident lifecycle
Let’s see what happens during an incident:
Symptoms show up
Something breaks and causes some customer facing issue (eg. the site goes down but there are more). Later we’ll discuss how to group incidents based on their customer impact but for now, assume that some customers are affected by the incident.
Ideally there is a monitoring tool in place which observes the symptoms and triggers an alert for the on-call personnel to mitigate the issue. However, that is not always the case for example when the product is new and its SLOs are not well defined. Another pitfall occurs when there is no operation team or experience with observability and the team resorts to graph watching (polling) instead of an automatic alert system. The symptoms may go unnoticed till the customers shout and/or reach out to support.
In case of automatic monitoring and alerting tools, there is usually a grace period before the alerts trigger. Without a grace period, every tiny symptom would wake up the person on-call even if it would resolve itself in a few minutes (for example when spikes in load would temporarily increase error rate).
The grace period can be anything from a few minutes to hours. Too long and you’ll have unhappy customers, too short and you’ll have unhappy on-call personnel. It is important to find a balance between the business risk appetite and team morale.
It is best practice to set the alerts on symptoms not causes. For example, it might be tempting and easier to set an alert for the CPU utilization being above 90% but the system may well handle a load spike without any issue. Alerting on the error rate however, may show a symptom that matters to the customers. As a best practice, set the alerts on symptoms not causes. By focusing on the symptoms, we make sure that if the on-call person is paged, they have a customer-tangible incident in hand instead of some alert which may or may not cause any customer-facing issue.
Alert triggers
Now we know that something is wrong: either from the customer support or the monitoring system. An alert triggers to inform the person on-call to investigate. In an ideal world, the incident mitigation starts immediately but in reality, the alert trigger is just the first step in getting the person on-call to their computer to investigate. For example, if it’s night time, they need to get out of bed, find their phone, reevaluate their career decisions, and understand what’s going on.
In case the first line on-call support does not acknowledge the alert within a specified time, the second line may receive the alert.
Alert acknowledged
The person on-call acknowledges that they have received the alert and starts the triage:
- Validates that the symptoms indeed indicate an incident that should be dealt with immediately. This step eliminates false alarms
- Assess an impact level (AKA severity level) to the incident
- Notify other relevant on-call persons if needed
When starting an incident, beware of these anti-patterns namely:
- Getting too many people on the call and forcing them to stay
- Status update fatigue assuming silence means no progress
- Wasting time deciding the severity level (when unsure go with the highest prio) or hesitating to escalate to another responder
- Side tracking the incident handling with policy discussions (postpone it to post-mortem)
- Skipping the incident handling policy, post-mortem or action points
Mitigation
Now we have a prioritized incident and have someone in the room. The main chunk of the mitigation begins:
- The symptoms are analyzed to find the root cause and a solution to stop the symptoms
- Config or code may be changed, tested and deployed to production
As the team learns more about the root cause and the incident blast radius, its impact level may change from the original assessment when the alert was acknowledged: if the incident has a higher impact, it may get escalated but if the incident has a lower impact, it may get deescalated.
Moreover the incident may be delegated to another team who may hold the key to solve the customer facing impact.
Repairing the incident may involve temporary workarounds to stop the business from bleeding risk at the cost of creating tech debt.
Remediation
Some time later we need to hold a post-mortem and learn from the incident. One of the key outcomes of an effective post-mortem is to assess the risk of the incident and eliminate or reduce it. The post-mortem may follow up with further actions:
- Clean up: Fix any tech debt that was accumulated during the incident mitigation. For example, the config drift of manually setting a config instead of using Infrastructure as code (IaC).
- Remediate: Implement measures to ensure that this incident will not happen again. This may be anything from blocking private ports to a total system refactoring or re-architecture.
In some cases the resolution might be as simple as assessing that without any further change, the incident may not happen again. Or the threat might be so unlikely that it doesn’t justify the return of investment (ROI) for a remedy. In these cases, the risk is accepted and no action is required.
Metrics
Before we dig into the metrics, keep 2 periods in mind:
- The period when the customers are suffering from the incident (actualized risk)
- The period when the business is susceptible to that incident (potential risk)
It is likely that a particular incident is a one-off event but it is also possible that it may happen again if it is not remediated. Here’s an illustration showing the two periods:
MTTA
MTTA stands for Mean Time To Acknowledge. It is a measurement from the moment the alert was triggered until the on-call person acknowledged it.
Measuring and optimizing MTTA ensures that there is good tooling in place and the person on-call is quick to respond.
When optimizing this metric, it is important to keep two things in mind:
- Correctness: ensure that the monitoring system filters out the false alarms so that when it notifies the person on-call, they take it seriously.
- Morale: alerts are toil, specially when they wake up the person on-call or interrupt their work/life balance. The alert fatigue and high expectation to react quickly can hurt the team morale and lead to employee churn which ultimately hurts the business.
MTTD/MTTI
MTTD stands for Mean Time To Detect or Mean Time To Discover. MTTI stands for Mean Time To Identify and is just another name for MTTD.
It measures the time between when the symptoms first show up until the on-call person acknowledges the alert. In other words it’s a measure from when the problem impacted the customers until it was identified by your business.
If you don’t have good monitoring coverage and/or automatic alerting in place, MTTD is a good metric to start with. The goal is to proactively monitor the system behavior (metrics and logs) to identify anomalous patterns and symptoms as opposed to relying on the customers getting in touch with the support.
Focusing on MTTD may reveal multiple problems on top of what MTTA identifies:
- Observability coverage: do you have reliable logs, metrics and traces to give a good picture about the system status and identify all types of customer impacting incidents?
- Metric quality: do the available metrics accurately represent the system status or are they gathered from the wrong place or suffer bias? Example: as a result of poor sampling the data may say that the system is healthy or faulty where the customers don’t experience it that way
- Alert grace period: do you wait for long enough before triggering the alert? To reduce the noise, the system may be observed for a period before the alert is triggered. Too short and false alarms may lead to alert fatigue, too long and the system may burn the error budget and the business may lose money for breaching the service level agreement (SLA).
MTTR*
Now we’re in the realm of “MTTR'' metrics. You might have an idea what MTTR stands for but most people don’t know all 6 answers. Although less than 4% of English words start with the letter “R”, several important SRE words fall into this category: recover, repair, resolve, respond, restore and remediate. Let’s unpack them and point out their differences as well as when to use each.
MTTRepair
Mean Time To Repair measures the time from the moment the alert is acknowledged till the customers can use the system again. This metric is a good compliment to MTTA and MTTD if they are already in an acceptable range.
Optimizing MTTRepair can narrow your focus to the main incident mitigation activities (usually manual work) and can lead to:
- Prioritization: stop the customer facing symptoms
- Preparation: improve runbooks and automate the mechanical parts when possible. Runbooks can easily drift and turn into toil but automation may require a higher initial investment that may not be worth the effort.
- Practice: use the process routinely to become second nature, for example by using fire drills (going through incident scenarios with the team)
- Mandate: ensure that the on-call person has the right access to be able to evaluate the system and mitigate the incident (eg. to the observability data, runbooks, cloud provider, etc.)
- Approach: ensure that the on-call person has a systematic approach to the problem solving, considers all alternatives and more importantly verifies their hypothesis and reevaluates them as they learn more about the incident.
- Forensics: preserve any evidence for root cause analysis (to shorten MTTResolve & MTTRemedy).
If you’re interested, there is more information about incident management in the Google SRE book.
MTTRespond
MTTRespond measures the time it takes from when the alert is triggered till the customer facing symptoms are eliminated.
MTTRespond sums up the scope of MTTAck and MTTRepair. Obviously by aggregating two independent variables (alert pending and mitigation) there’s a risk of spreading the focus. On the other hand it is one level broader than any of the two and can ensure that the tradeoffs to optimize one will not hurt the other. For example, to reduce the MTTAck for a company with offices across different time zones, you may aim for a “follow the sun” on-call model where the goal is to page the on-call persons during their working hours when they are more responsive than waking them up at night. But due to communication issues, the MTTRepair in Asia or Europe may be different from the headquarters in the US.
Please note that the popular incident response platform PagerDuty, uses different terminology. They call it “resolved” instead of “responded” which may be confusing.
If you choose to optimize an aggregated metric, you may want to focus on MTTRecover which maps better to the customer-facing symptoms.
MTTRecover/MTTRestore
MTTRecover measures the time from when the symptoms start to show up until they are eliminated for the customers. MTTRecover is sometimes called MTTRestore.
This is one of the most important metrics because it represents the time period that the customers were experiencing the incidents. However, it aggregates 3 independent variables (alert grace, alert pending and mitigation) which may spread the optimization focus. On the other hand any effort to reduce this metric will directly have a customer facing impact.
MTTResolve
MTTResolve is similar to MTTRecover but it accounts for the time it takes to remediate the system and making sure that a similar incident won’t happen again.
For example, your product may encounter a spike in load because of a sudden change in traffic due to an important event like an ad at the Super Bowl. Mitigating the incident is of course a top priority to maximize the ROI of the ad but making sure that the system won’t encounter the same issue next time it goes viral is equally important.
It is only when the risk is neutralized that the incident is truly resolved (hence the name MTTResolve). This is not the first metric to optimize for and is typically more suitable for products which have already optimized the other MTTR* metrics and are now down to optimizing the proper remedy.
MTTResolve can be tricky to calculate because the time spent on remediation is often hard to track and requires manually keeping tabs. MTTRemedy is easier to calculate.
Not all incidents require remediation. It is possible that the root cause is not likely to happen again or it is so unlikely or expensive to remediate that the business accepts the risk of it happening again and mitigating when it happens.
MTTRemedy
MTTRemedy measures the time it takes from the moment a risk materializes to symptoms until it is eliminated from the system.
For example if your system was hacked due to the recent log4j zero day vulnerability, the incident may be mitigated but the system is potentially vulnerable until the code is refactored, tested, and deployed to production.
Unlike MTTResolve which is more concerned with measuring the actual work spent on an incident, MTTRemedy is concerned about the time the system was vulnerable. Where MTTResolve ignores the slack time wasted between the incident mitigation till the start of remediation, optimizing for MTTRemedy encourages to start the remediation as soon as possible,
As a side benefit MTTRemedy is easier to calculate because it is easier to calculate when the risk was neutralized.
Other concerns
Mean vs median
MTT stands for “Mean Time To”. Mean known as average, works well when the data has a bell-curve distribution. However, the median can give a better picture and focus the optimization efforts more accurately. As SREs we often use percentiles to spot the outliers. Median is simply the 50% percentile. Depending on the incident data and the optimization maturity you may go for “p50 time to acknowledge” instead of “mean time to acknowledge” for example. This will allow you to focus on the outliers. For example here is the MTTAck for a hypothetical service for the last 20 incidents:
As you can see, sometimes it took a very long time for the on-call person to acknowledge the alert. It looks like the optimization effort should start by focusing on those outliers. But the mean MTTAck for this dataset is less than 10 minutes which might be acceptable. Focusing on the percentile tells a different story:
To calculate the percentile, the data is sorted in ascending order putting the outliers in the end. Now it’s a matter of choosing a threshold to direct our focus. In the dataset above, if we put that threshold on the P90 percentile, the outliers clearly stand out.
Working with percentiles may be a bit complicated and you have to do some educating to establish the concept but when trying to spot anomalies, it’ll come handy.
Look back window
MTTR* is calculated for a look back window for example a week, month, quarter, etc. When optimizing the MTTR* one can compare it to a previous look back window to see the difference. For example, here is the MTTRecover for a hypothetical service which is recently optimized:
In this example, we can see that although the alert grace and pending for acknowledgement haven’t changed significantly, the effort put into the repair phase actually reduced the MTTRecover significantly compared to last month and last quarter.
Impact levels
Not all incidents are created equal. Depending on the impact, they may have different sense of urgency. For example:
- Level 1: impacts all the customers (eg. website is down)
- Level 2: impacts some of the customers (eg. the iOS app doesn’t work but the web and Android do)
- Level 3: impacts a small group of customers (eg. the older iOS app doesn’t work or the website is broken on the deprecated IE6)
- Level 4: impacts occasional customers (eg. the error rate is high in one service but most clients have a retry mechanism in place)
Depending on the impact level, the attention and energy that is put to the incident may vary. While a level 1 & 2 incidents may demand to wake the on-call person up in the middle of the night, level 4 incidents may just lead to creating a ticket to deal with at working hours.
When measuring and optimizing a metric it is important to distinguish between different impact levels to focus the energy on the most impactful type of incident. Trying to reduce MTTResolve regardless of the impact level is a good way to waste resources.
Below is the MTTRecover for a hypothetical system over the past quarter using SEV terminology:
Conclusion
Let’s put it all together, here’s a cheat sheet for all the metrics we’ve discussed:
Depending on which metric you choose, the optimization focuses on one or more areas. If the wrong metric is chosen, the signal you’re trying to optimize may get lost in the noise of a multivariable equation. Here is a brief summary of what periods each MTT* metric measures:
Grace |
Pending |
Mitigation |
Slack |
Remediation |
|
MTTAck |
- |
Yes |
- |
- |
- |
MTTDetect |
Yes |
Yes |
- |
- |
- |
MTTIdentify |
Yes |
Yes |
- |
- |
- |
MTTRepair |
- |
- |
Yes |
- |
- |
MTTRespond |
- |
Yes |
Yes |
- |
- |
MTTRecover |
Yes |
Yes |
Yes |
- |
- |
MTTRestore |
Yes |
Yes |
Yes |
- |
- |
MTTResolve |
Yes |
Yes |
Yes |
- |
Yes |
MTTRemedy |
Yes |
Yes |
Yes |
Yes |
Yes |
Different time spans may be measured differently (some even manually) but as V. F. Ridgway puts it: “Not everything that matters can be measured. Not everything that we can measure matters.”Regardless of which metric you pick to optimize, remember that there’s a risk that it will become a goal (Goodhart’s law). Revisit your optimization strategy as the system architecture and the customer demands change.
Acknowledgement
Thanks Franchesco Romero, Anto Cvitić and Minjia Chen for proof-reading the early drafts of this article. If you liked what you read, follow me on LinedIn or Medium. I write about technical leadership and web architecture.
References
- All the images of this article in a printable format
- MTBF, MTTR, MTTA, and MTTF (Attlassian)
- Benign on-call (Google)
- Incident response (Google)
- Google SRE books (PDF and free online HTML)
- Mean time to detect (MTTD) (tech target)
- Mean time to resolve (MTTR) (BMC)
- Mean time to remediation (MTTR) (optiv)
- Creating an Alerting Strategy (splunk)
- Remediation vs mitigation (cyberpion)
- Why You Should Embrace Incidents and Ditch MTTR (DevOps)