Blameless post-mortems of production incidents are increasingly seen as an essential fixture of any organisation's procedures. Mathias Meyer, from Travis CI, shared how blameless post-mortems had a profound effect on him. InfoQ took this opportunity to have a look at post-mortems as practiced by organisations such as Etsy, Chef or GitHub.
A blameless post-mortem is a post-mortem with a focus on learning from the incident. As John Allspaw wrote:
[At Etsy,] we instead want to view mistakes, errors, slips, lapses, etc. with a perspective of learning. Having blameless Post-Mortems on outages and accidents are part of that.
Mathias Meyer describes a blameless post-mortem as:
(...) a meeting where all stakeholders can and should be present, and where people should bring together their view of the situation and the facts that were found during and after the incident.
The main goal is to find what, how and why and incident happened. The post-mortem must produce actionable items that will prevent the same thing to happen in the future.
Blameless post-mortems assume that humans have good intentions in the general case. If this assumption is not held, the organisation will try to find someone to blame. In that case, the engineers involved will withheld information for fear of being punished and so will guarantee that the failure will happen again in the future. As John Allspaw puts it, there is a need to "balance safety and accountability":
We believe that this detail is paramount to improving safety at Etsy.
Mathias Meyer thinks that the notion of human error should be disregarded:
It's not helpful to find out what's broken and what you can fix. It assumes that what's broken and what needs to be fixed are humans in an organization. (...) The humans acting in these [complex] systems are triggers for behaviours that no one has foreseen, that no one can possibly foresee.
As one among several examples, a DNS outage that happened earlier in the year at GitHub shows how failures in complex systems can cascade easily. GitHub published a post-mortem report where it shows how an erroneous DNS change led to fileservers failures and in turn that led to routing layer failures. The post-mortem exposed several weaknesses in the infrastructure and six remediation actions, all of them unrelated to the specific action that led to the outage.
Making sure that the actionable items are executed is crucial or else the whole process looses it's purpose. At Etsy, there is a policy where these items "trump any other work that the engineer is currently working on, including shipping product."
As reported by InfoQ, Etsy built and open-sourced Morgue, an application to log post-mortems. A Morgue report includes all the information related to an incident, including answers to the what, how or when as well as the identified remediation actions. The information is gathered from a variety of sources, including IRC logs, forum posts or monitoring graphs.
A Morgue post-mortem report, as exemplified on the project's homepage
Mathias Meyer finds that blameless post-mortems had a profound impact both in him and in the teams he works with. Do you do (blameless) post-mortems? Have they had any impact in you and/or your organisation?
Tom Gilb & Kai Gilb Jan 26, 2015