BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Psychological Safety in Post-Mortems

Psychological Safety in Post-Mortems

This item in japanese

Bookmarks

Emotions often come to the fore when there is an incident; psychological safety in blameless post-mortems is essential for the learning process to happen. The post-mortem session must be fairly moderated, preferably by an outsider, giving everyone a turn to speak without criticism. Don’t start the analysis of the incident before there is a clear and common understanding of what actually happened.

Matt Saunders, head of DevOps at Adaptavist, spoke about psychological safety in blameless post-mortems at the Atlassian Summit Europe 2018. InfoQ is covering this event with Q&As, summaries, and articles.

InfoQ spoke with Saunders about when to do blameless post-mortems, how they differ from agile retrospectives, dealing with emotions, what can be done to make everyone feel safe in the post-mortem, and how to conduct effective blameless post-mortems.

InfoQ: When do you suggest to do blameless post-mortems?

Matt Saunders: The top answer to this is that any time there is an incident that causes disruption for customers, there should be a post-mortem. And additionally to that - 100% of the time you should go to lengths to ensure that they are blameless. It’s easy for an outage analysis to become a witch-hunt, but this rarely gets to the root of a problem. If someone made a mistake, then it is short-sighted to not also analyse why that person was put into a position where a mistake was possible. So there should be a blameless post-mortem absolutely every time there’s an incident, or even when something unexpected happens operationally.

InfoQ: What are the differences between and similarities of agile retrospectives and blameless post-mortems?

Saunders: Some of the techniques here are very similar. A key tenet of agile retrospectives is to analyse what happened from the perspective of the team, and the same is true with a post-mortem. However, a post-mortem is generally conducted in difficult circumstances - perhaps your company has lost customers due to an outage, people are mad and looking for answers. There can of course sometimes be similar pressures in agile retrospectives, but the likelihood of a post-mortem being conducted in a stressful and potentially aggressive manner is much higher.

InfoQ: In your talk you will dive into the emotional impact of dealing with an incident and how it affects engineers. Can you elaborate?

Saunders: Engineers always want to do the right thing. It’s not just a matter of professional pride; emotions often come to the fore especially when there is an incident, as people can struggle to stay calm. Everyone wants to fix the outage as soon as possible, but this can manifest itself in heightened emotions and raised voices. Decisions made long ago may be revisited in an emotional fashion, and this often isn’t helpful. Dr Richard Cook explains how computer systems can be highly complicated in his frequently cited paper "How Complex Systems Fail." Hindsight often biases post-incident analysis and this can often lead people to feel stupid, defensive, or even that their job is under threat. It is essential to enter the post-mortem with issues such as this in mind.

InfoQ: What are some of the main things that can go wrong in blameless post-mortems?

Saunders: Prejudging the outcome is a frequent problem. The aforementioned hindsight can lead post-mortems to conclude obvious problems, when the reality of how these problems came to occur can be highly complicated. Emotions running over and people getting personal is another frequent problem, and also the influence of senior people must be carefully judged. Perhaps an employee’s manager is in the room, and he or she acted on the manager’s advice which turned out to be badly judged. This puts the employee in a dilemma where he or she may not feel he or she can speak freely.

In addition, the organisational constraints put on the team may lead to mistakes. Perhaps a deployment went wrong because it was performed by someone working in a central team who didn’t understand some key differences to other systems. This probably isn’t something under the control of the team but is still a contributory factor to the incident that needs to be accounted for.

InfoQ: What can be done to make everyone involved feel safe throughout the process?

Saunders: The key point is that we’re doing a post-mortem on the incident, not on the person who made a mistake (if indeed there was a single mistake that caused the incident). The session should thus be run with this front and centre. It’s key to clarify right at the start that this is a learning process for the team or organisation, and not a blame game.

It’s well acknowledged today that blaming individuals is not a good outcome, as this is likely to lead to more fear in the future, people being scared to operate on systems, and a general slowdown in operational fluidity. Instead, basing the post-mortem around learning how to make the teams processes better so that the system helps its operators to not make mistakes should be the key takeaway.

If you can set the scene in this way, and also convince senior stakeholders that this is what the outcome will look like, then people will feel safe, willing to contribute, and help design better systems for the future.

InfoQ: What suggestions do you have for conducting effective blameless post-mortems?

Saunders: Ensure that the session is fairly moderated - preferably by an outsider, that everyone is given a turn to speak without criticism, and that the analysis of the incident is only started once there is a clear and common understanding of what actually happened. Separating the session into three sections: agreeing the timeline, agreeing what went wrong, and - most crucially - what work needs to take place to prevent the problem occurring again is a good formula for conducting blameless post-mortems.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Who could possibly disagree? (Getting there is more difficult than it might appear.)

    by Richard Cook,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    tl;dr: Bringing these ideas into the computing domain from medicine, aviation, and other domains is great. Practically, there are obstacles that frustrate and even obstruct progress here.

    Experience (especially in medicine) demonstrates that:

    1) The bigger the consequences, the less blameless the aftermath will be. BTW, big events tend to strip away all the nice protections for speaking openly, avoiding blame, etc., and events in your industry are getting more consequential all the time.

    2) It's easy to confuse blameless with sanctionless -- blame occurs when someone imputes cause for a bad outcome to a source, often (but not always) a person or group; sanction (in this context) is a punishment, for example, shaming, ridicule, etc. Many (most? virtually all?) conflate the two.

    3) The contribution of human performance to accidents remains problematic, even when sanctions are reduced or hidden from view. All valuable systems are managed, operated, recovered, and redirected by _people_ and so their contributions to failures and success will _always_ be controversial. The whole subject of 'error' inevitably intrudes into these discussions. [bit.ly/BeyondHumanError2ndEd]

    4) Simply reducing blame does not produce clarity about the sources of success and failure. The kind of inquiry needed to understand the many contributors failure in complex systems ironically begins with examining success -- how things usually go well but sometimes turn out badly [examples at bit.ly/TaleOfTwoStories beginning on p.12]

    Everything Ben recommends is absolutely GREAT! DO IT ALL! You'll get something from even modest effort and even more from greater commitments and more informed approaches. Most efforts to go "beyond blame" have -- despite the good intentions -- produced little. There are lots of problems, conflicts, and paradoxes right beneath the surface. Try to remember that uncovering these is what success in post-mortems is about -- after all, the term refers to examining the dead!

    Finally I must quote the great Han Solo: youtu.be/tOuCbDkKIs4

    Richard Cook

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT