BT

Post-Mortems Trends and Behaviors

| by Manuel Pais Follow 8 Followers on Nov 29, 2017. Estimated reading time: 2 minutes |

A note to our readers: As per your request we have developed a set of features that allow you to reduce the noise, while not losing sight of anything that is important. Get email and web notifications by choosing the topics you are interested in.

Eric Siegler, head of DevOps at PagerDuty, presented  his findings from analyzing data from 1000 post-mortems ran by 125 different organizations over a six months period at the Velocity conference in London last month. Main trends include the prevalence of blameless post-mortems; the fact that only 1 in 100 post-mortems refer to "human error"; and that analyzing the lifecycle of incidents can provide useful insights on weaknesses in the incident response process.

Because information was collected (and kept) anonymously from clients using PagerDuty's post-mortem builder feature, Sigler mined the data looking for people's common names and found no occurrences in half of the post-mortems. The fact that the other half named individuals does not necessarily mean there's a blame culture in place and data could be skewed in other ways, Sigler highlighted; for example, when a server named "Bob" gets mentioned in the post-mortem.

As for explicitly mentioning "human error" as a possible cause of the incident being reviewed, Sigler found nearly no evidence (only 1% of post-mortems). Sigler reinforced this point with the AWS S3 outage last March whereby the corresponding post-mortem did not claim human error as a cause, yet media coverage extensively blamed the individuals involved.

The data collected also suggests that many organizations spend considerable effort detailing the timeline of the incident (and many post-mortems don't include any other textual information). Sigler warns that, although it's a useful exercise to understand the incident being reviewed, tracking common incident state transitions (started, detected, raised, resolved) can provide better insights on where to improve the overall response process. For instance, a recurring long time between started and detected states raises questions on the correctness of our monitoring and instrumentation. A recurring long time between raised and resolved might indicate bottlenecks in terms of sharing knowledge and responsibility in the organization, or simply that too much technical debt has accrued in the failing system.

Other findings by Sigler included the fact that, on average, most organizations performed less than one post-mortem per month. A third of the organizations start the post-mortem within 24h of the incident, another third within seven days, and the remaining more than a week later (making it difficult to resist the selective memory effect).

Sigler stressed the fact that this is a small data set and that results are skewed towards organizations that already have a post-mortem process in place, thus likely having more operational maturity.

Finally, Sigler left the audience with a couple of recommendations. First, post-mortems can be useful to check if process improvements are helping eliminate classes of errors from our systems, or if we're encountering similar issues on a recurring basis. Secondly, post-mortems can uncover organizational issues, thus post-mortem outcomes should not be limited to technological changes.

For more information on setting up a post-mortem process, see PagerDuty's post-mortem process and template or Etsy's practical post-mortems. Etsy also open sourced their data collection and post-mortem tracker tool.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT