BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Reckoning with the Harm We Do: in Search of Restorative Just Culture in Software and Web Operations

Reckoning with the Harm We Do: in Search of Restorative Just Culture in Software and Web Operations

Bookmarks
49:31

Summary

Jessica DeVita discusses the difference between blame and accountability and building a Restorative Just Culture.

Bio

Jessica DeVita has 20+ years of experience in IT operations in a variety of roles and industries including healthcare, entertainment, and cloud computing. DeVita most recently served as SRE Manager for the AKS team at Microsoft. Previously she was at Netflix, chef software, UberGeekGirl, Inc., and St. Jude Medical.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

DeVita: I'm Jessica DeVita. We're talking a lot about Just Culture. Instead of sharing with you more about the theory of it, or what different people have said about it, I want to share with you the results of a study I did on the lived experiences of the people on the ground who are responding to incidents and outages. I'm inspired by the Just Culture manifesto. In particular, the first two commitments of the Just Culture manifesto are that people should feel free to work and speak up and report harmful situations or incidents that they experience or are involved in without fear of unjust or unreasonable blame or punishment. That they should support people. That organizations need to support people who are involved in these incidents and outages. That, in fact, supporting people is the first priority after an unwanted event. The other commitments are important as well, but I wanted to focus on the first two.

Research Methods

I went to go talk with folks to find out what's happening for them. What are they experiencing? As a brief overview of the study, I asked participants about whether they'd felt harmed or traumatized by their experiences. The likelihood of being blamed, or of blaming themselves, and whether they were considering leaving their jobs as a result. I invited them to describe what blameless and accountability meant to them. Before we go into the results, I'll just speak briefly about how I did the research. As a safety science researcher, I primarily focus on phenomenology, which emphasizes the importance of the lived experiences of people. For each interview that I do, I record them and I transcribe them. Then I print those out and I grab my highlighter and pen, and I'm looking for those significant statements that really speak to the research question that I have. This is often referred to as coding. We're reviewing the transcripts and looking for these larger themes and grouping what the codes are into these larger meaning units and themes. Some people will then take their handwritten notes and go into using a qualitative data analysis tool. In that, they will group these findings into those larger meaning units. This is really what forms a codebook. The codebook is where we have a better understanding that we've really captured what it's like for people. The goal of phenomenology is that we would leave with a feeling that we better understand what it's like for someone to experience that. In this study, I used a mixed methods approach, where I surveyed 31 people. I had two focus groups, and had several individual conversations with folks.

Next, I'm going to give you a brief overview of the questions that we asked. Have you felt harmed or traumatized from your involvement in incidents and outages? What exactly is trauma? Trauma results from events where the individual experiences them as physically or emotionally harmful. Trauma has lasting effects on people's well-being. I was surprised, I must say, that very few people said that they had not experienced trauma. Only 7 out of 31 responders indicated that they have not felt harmed or traumatized by their experiences. Only one person left a qualitative comment. They said outages are considered an opportunity to learn as well as share what you learned. If an outage is called, people will volunteer to be the incident commander or chime in hands to indicate that they're available to help. Even with upper management, there's very little negativity or blame. Again, I was surprised, I was really hoping that with all the work that we've done that more people would report that they'd had better experiences. The vast majority of participants described experiencing harm or trauma. They may have used other words, and I invited them to describe their experiences. These folks are what we might call a second victim. Second victims is maybe a new term you may not have heard before. Second victims suffer significant emotional harm regardless of whether their actions actually contributed to the incident. Second victims may experience harm related to stress from adversarial root cause investigations as well. The issues that they experience were not limited to them alone, they also affected their families.

Themes That People Shared

Next, let's explore what people shared. Some of the larger themes and descriptions were that there was a punitive culture, that impacts their mental and physical health. They reported a loss of sleep, and disruptions to family time, and personal time, as well as no time to fix underlying issues. We'll go into some of the significant statements for each of those larger themes that I described. As I mentioned, that people are still experiencing this punitive culture. One participant shared that a CTO told me to take the blame for an incident, or he'd fire me. I remember getting yelled at by the CEO that I wasn't using the right tool. A lack of trust in the team is crippling, to be honest. People have memories like elephants. No one is ever satisfied. The process we have to deal with incidents, 5 whys based, inevitably leads to the determination of root cause that more often than not ends with human error. Postmortems were called recrimination meetings. One responder described how their COO flew into town after a particularly bad incident. They said, in the aftermath of that incident, the COO asked me, "Why did you release the software?" I told him we'd done all these tests, and we thought we were in good shape, but we missed it. He said again, "Why did you release the software?" I said, "I made a mistake, an error in judgment." This is an hour-long meeting. The third time he asked again, "Why did you do this?" I said, "I don't know what to tell you. I screwed up. I don't know what more you want from me." It wasn't like I got fired or anything, but I definitely felt blamed.

The impact to their mental and physical health was the next finding, the next theme that emerged. One respondent said, "I have been traumatized for being involved in incidents for the past five years, to the point that every time I see an alert today, I get an anxiety attack." "Sometimes I can't help but not hide it. It will explode in emotional bursts or manifest in bright red hives all over my body, or I'll completely shut down." "I remember my lower back hurting because of the amount of adrenaline I'd been running on for the past 36 hours." "It was very stressful and it gave me a lot of anxiety which led to a loss of sleep, and I would say a more profound sense of disengagement from the workplace." Sleep loss was a very common theme among the responders. Asked people, what's your relationship with sleep? One participant shared that sleep damage makes it very difficult for me to be on-call. When a page wakes me up, I generally do not sleep afterwards. I asked if sleep was ever discussed by management. They said formal sleep discussion doesn't happen. It's a very dangerous discussion. Not being able to go to sleep at a reasonable hour takes its toll on your mental abilities. Another participant shared that they are asleep with one eye open. "It's like you're asleep with one eye open. Don't dolphins do that? It's like half my brain is at work, ready and poised to respond to the pager while the other half of my brain is trying to relax." "Sleep is very holy to me. I'm not on-call right now. I have a hard time separating work and non-work stuff, and so being on-call right now, I couldn't cope with it in a healthy way."

Some participants also described that working in Europe, there was a lot of questions and avoidance around something called the European Working Time Directive. That it was awkward to meet the requirements of this directive in a modern 24/7 on-call team. The European Working Time Directive, as I learned, doesn't cover sleep very much, but it does cover rest. It's quite complicated, but that you're not supposed to have less rest than 11 hours in every 24 hours and not supposed to work more than 48 hours in any given week. Another participant shared that, "After the incident, when I tried to get back to sleep, my adrenaline was still going, so it took me a while to get back to sleep. It wiped me out for the whole day." "There were some periods of time where I just would not respond. I wouldn't even wake up to the pager. My partner would end up waking me up, 'Your phone is going off, go get it.'"

This disruption to family life was a significant theme for many participants. One person described feeling exhausted, spent, drained, and guilty for working and being away from my children. "They were unable to count on me for being able to participate through the duration of some event whether that be a meal or anything else." That interruption became a highly negative emotionally charged topic for this participant. "Scheduling of shifts is never done with any consideration for family events." How are people coping? One participant shared that they are not coping well. "Having a good support network and therapy helps, but this industry can be absolute shit for them." Some participants described disengagement as a coping mechanism, that they would disassociate or leave the job. Another participant described using it as fuel. Another person shared that they mostly get angry and try to change our industry as a way to cope. Using it as fuel so I feel better by at least knowing I have helped others to not deal with it alone. The resentment I felt from those experiences turned into fuel.

Restorative Just Culture Checklist

Next, I want to talk a little bit about a resource that Sidney Dekker has shared on Just Culture. He's provided this Restorative Just Culture checklist. This is another set of questions that I asked some of the participants. First of all, it says that a Restorative Just Culture aims to repair trust and relationships damaged after an incident. It allows all parties to discuss how they've been affected and to collaboratively decide what should be done to repair the harm. Who has been hurt here? These second victims are these engineers who again were involved or just responding to the incident. I also want to mention that there is a third victim here. There's incident analysts, those people who are tasked with supporting engineers after an incident and investigating what really happened. They are what's known as a third victim. Have we acknowledged how we've harmed people? Do we acknowledge them? Second victims can experience this harm related to adversarial root cause analysis investigations. They can suffer significant emotional harm, again, whether or not their actions contributed to the incident or whether it was preventable at all. The impact on them can be severe, as we learned from some of the responders. They can take the form of signs, symptoms associated with acute stress syndrome, or post-traumatic stress disorder.

Some of this comes out of some research on patient safety professionals as the third victims of adverse events. These third victims, according to this research that Holden and Card did, are those that experience psychosocial harm as a result of indirect exposure to an incident, such as leading those incident investigations. Their study found that these third victims have an almost complete lack of emotional support, and a sense that the harm that they experience goes unacknowledged. The harm is clearly real. Respondents experienced anxiety, lost sleep, emotional exhaustion, and a sense of being blamed by everyone for events they weren't involved in. These led to some of them considering leaving the profession.

We often talk about the organization as a victim. While organizations can certainly suffer reputational harm after these events, Holden and Card found that corporate victimhood is qualitatively different from the harm that the individuals experience. That organizations do not experience acute stress syndrome although their employees might. Organizations do not burn out and leave the profession although their employees might. Have we overfocused on the organizational harms and the reputational harms and ignored the needs of our responders and the incident analysts? The Just Culture checklist asks us, what do they need? What do these folks need? I'll share with you next what some of our study participants shared in response to this question of, what do they need? How could we reduce the harm to them? Some of the larger themes that emerged are that we really need to educate management. They suggested we need to listen to engineers. We need to trust people. We need to take care of people. We need to slow down and alleviate the pressure to ship. They spoke about needing more training, more staffing, and more sustainable rotations. They asked us to focus on learning instead of blaming, and to use inclusive language. No more human error. No more root cause. No more fat fingering a change.

What People Needed

Let's take a look at some of the significant statements that people shared about what they needed. They wanted us to invest in maintenance and plan for failure. Spend more time on maintenance. "I think what does harm at my company is the frequency of incidents and false alarms. Sometimes our SRE team gets paged for things that resolve themselves before the SRE even gets online to check things out." "Make investing in resilience a priority for product teams. Make it clear to those teams as well. Usually, they want to improve things but they feel this external pressure to ship. Stop with the 'Get it right the first time' mentality." One person shared that they wanted regular fire drills or empowerment training, so everyone is prepared and it doesn't feel like a total shock and scare. Being unfamiliar when you're a frontend engineer and suddenly having to parse through logs and APM like you're a DevOps person at 1 a.m.

Educate management. "Just give us a break." "Education for senior management around learning from incidents, accident models, helpful and unhelpful behavior including language." "Better training for managers on how to manage their staff. Several managers simply shouldn't be managing people." Another respondent said that we would really need serious culture change from the top, but that there was no appetite for that. The next theme is really focus on learning instead of blaming. "Allowing for human nature to thrive instead of sanctioning individuals due to the complexity of systems." "Adopting investigation approaches that are capable of uncovering systemic challenges far away from the frontline. If the tools you use only uncover causes close to the last people who are involved, then that's all we'll learn and focus on. Also focus on multiple perspectives." One person said that if you make people mad and you chew them out, you're not going to get good work out of them, so don't do that. Another theme was the need to talk about what happened. "Personal outreach and giving people an opportunity to talk through things is very helpful, but also quite rare," according to one participant. "Provide a safe space to talk about what happened." "Build a space for them to share their concerns about the work they do at the sharp end." The next theme was really sustainable rotations. One person shared that it really has to be six people at a minimum for a somewhat humane rotation. I think managers get very upset when I say that.

What Is Blameless?

I invited participants to describe, what is blameless? What does it mean to them? What does it mean to you? Way back in 2012, John Allspaw was talking about blameless postmortems and a Just Culture at Etsy. He said that having a Just Culture means that you are making an effort to balance safety and accountability. He also described that a blameless postmortem process means that engineers whose actions have contributed to an accident can give a detailed account and that they can give this detailed account without fear of punishment or retribution. What did our participants have to say? They characterized blameless as a behavior. It's just something people say, as a doorway for inquiry, and a number of other ways of characterizing it, but mostly just not pointing fingers.

Let's go into some of the statements that people shared. Blameless is a behavior. "When faced with a surprise, often negative, the primary goal is not to blame individuals, although blameless does not mean sanction-less, but to learn and understand what can be improved with the goal of keeping our systems sustainable and adaptable in the long run." "Blameless means not pointing fingers at people for causing an incident." "Blameless means that we don't accept so and so messed up as the root cause of an incident. We look at the system's tools and procedures that failed to prevent the mistake or mitigate the consequences." Blameless was just something people say. "It's just something people say nowadays, like we do Agile or psychological safety. It would be weird not to do it. A lot of folks don't understand what it takes. I've seen some very blameful conclusions come out of blameless postmortems. You can still see the blame in the language and the actions." "Blameless is a squishy marketing term used by part of the safety community to try to make blame attributed to frontline workers go down. It's hard to tell how well it succeeded." "Blameless has been Agile-ified to mean whatever the person in charge wants it to mean."

Blameless was a doorway for inquiry. "Blameless means the emptiness of blame, not the dissolution of it. It means accepting that blame will happen as a natural result of a fleeting human emotional reaction, and that we should see it as a doorway for inquiry." "We look beyond human error to discover the cause of an incident." "Approaching incidents from a perspective that all involved were doing the best that they knew how, and the incidents occur because of systems factors and systemic pressures, not individual mistakes." Blameless recognizes that software is hard. Another participant shared that we need to recognize how blame occurs and what it can tell us about the way our brain is recognizing patterns, and to transform it into something more useful. Finally, they shared that developing an empathy, an understanding of how a situation could unfold and why these decisions made were sound and reasonable from every operator's perspective.

I asked participants how likely they are to be blamed for incidents at their workplace and how likely they were to blame themselves. I was a bit surprised to see that a majority of participants said that they were very unlikely to be blamed. Just as surprising was that a lot of participants said that they would blame themselves. It's interesting how the notion of blamelessness manifests and how people describe it. Even then, these blameless organizations that claim to hold these values that actually self-blame is a significant emotion for people. I asked people, if they were likely to change their jobs as a result, to leave their jobs. Actually, another surprising result is that many people were not looking to leave their jobs. That a few of them obviously had left their jobs or were actively looking. Another person shared that they had lost two people out of their team due to the heavy on-call burden. It's a significant risk to lose these people.

What Is Accountability?

I next invited people to talk about, what is accountability? What does it mean? What does it mean to them? Accountability was described in a few themes. It was characterized as culpability, being on the hook. It was shared as being a capability. A plurality, it takes a team. Also, described as prevention. Let's look at some of those significant statements that people shared in these larger themes. Accountability as a capability. "The capability to accept your own role in an event, and your ability to go through restorative steps with other people involved." "That people would have control over their work, but also their responsibility for it. Without both of those concepts that things fall apart." "The power of agency supported by the reciprocity of others." That you can be doing the best you can and still fall short of your goals, but that's where we help each other out and don't beat each other up when we miss. Accountability, it takes a team. "We all pitch in to get the service restored as quickly as possible, and to figure out how to improve the situation in the future."

That there's a group of individuals that are stewards of a given system and service and are committed to its improvement and long-term sustainability. Accountability as prevention. "Accepting that existing tools or procedures failed to prevent the incident, and spending time fixing those things as a team before we move on to more exciting work." "Taking action so that the system cannot allow the mistake again." Accountability as account giving. One person shared that it was about forthrightly being able to recount decisions and actions that were taken. Another participant described their unpopular opinion that accountability can only come from yourself, that you can hold someone accountable, but that's typically punitive in nature. Accountability being self-imposed, meant that I'm going to take steps to educate myself and hopefully others around me to the best of my abilities such that future events are mitigated based on what I've learned.

Is There Conflict Between Blameless and Accountability?

I asked them if there was conflict in these concepts, conflict between blameless and accountability. For one participant the answer was yes. They said, "Some events aren't blameless. When someone intentionally violates a policy or is malicious, then there should be accountability. When policies and cultures don't support staff in being successful, I can see where blameless may be ok." For the majority of the participants, there wasn't conflict between the two concepts. This person said that it's possible to say that we as a team need to spend more time making our system more stable, without finger pointing at any particular person or team. Despite the fact that blame serves a social function, blameless and accountable are not necessarily at odds, especially if organizations want to learn from incidents, and explore why locally rational decisions that may have been successful until there was an incident, had surprising effects. Accountability is the thing revealed to us when blame is but a passing phase instead of a concrete resting point. We discover that accountability is a plurality. It takes a team to be accountable in complex systems. How can we treat blame as anything but a film that we allow ourselves to recognize, politely remove, and move on?

Study Conclusions

I want to share now some of the conclusions from the study. Accountability means different things to different people. There's no one definition. Even in cultures that practice blamelessness or that claim to have a Just Culture, people are still being harmed. Sleep is holy, we need to talk about sleep. Why are we still waking people up in the middle of the night?

Closing Thoughts on Accountability and Just Culture

Some closing thoughts on accountability and Just Culture. The word accountability, it's the condition of being able to render accounting of something to someone, according to Dubnick, whereas the idea of accountability is less amenable to easy definitions. Dubnick explains crucially, "To perceive oneself as accountable is to accept the fact that there is an external reference point, a relevant other that must be taken into consideration. Being accountable is thus a social relationship." Accountability, looking forward, instead of backwards. Sidney Dekker has written about Just Culture extensively. He described how that in a backward-looking accountability, that holding someone accountable is directed at events that have already happened. That accountability can also be forward looking. He references Virginia Sharpe.

Restorative justice, according to Dekker, achieves accountability by listening to multiple accounts, and looking ahead at what must be done to repair the trust and the relationships that were harmed. Perhaps operators involved in mishaps could be held accountable by inviting them to tell their story, their account, and then systematizing and distributing the lessons in it. By using this to sponsor vicarious learning for all. Dekker says that perhaps such notions of accountability would be better able to move us in the direction of an as-yet elusive blame-free culture.

Dr. Richard Cook says that, "There is no such thing as a Just Culture. There's Just Culture. Where complex systems failure has occurred, culture plays out predictably. It's more about the power dynamics, reserving the decision about what is acceptable, and calling that just is a species of nonsense. Just Culture is almost entirely a fig leaf for the usual management blame assignment. Justice is in the eye of the beholder." Culture is what you do, it's not what you say, as one participant shared. I hope this has offered you some insights into the lived experiences of people. We should still pursue the restorative approach to learning from incidents.

Questions and Answers

Brush: As a manager that supports folks both doing incident analysis but also responders. I'm curious if you can give me any tips and tricks on day-to-day things that managers can do to better support the emotional load.

DeVita: I think as managers we have that special opportunity, but every word we say carries extra meaning. I think probably modeling some of the blameless language, but also, I think day-to-day, what I'd love to see more managers doing is maybe just trying to check in with people after they've been on a long incident. It can take some time. Asking how folks are doing can be really impactful. It sounds maybe simple, but if you ask it and really try to check in with people afterwards, I just think that goes a long way. Also, monitor, how many times are people being woken up? What are you doing about it? Stop what you're doing and go fix that. It's not worth it. The cost to humans, it's just incalculable.

Brush: If you got a company that is offering 24/7 services, and they do need tight turnaround times on responders, what do you suggest to avoid waking people up at night?

DeVita: I'm not saying it is going to be possible to not do it entirely. I think when we discover that the entire thing is resting on these two people, that's a burden that's really pretty great for them to bear. I would just say, first of all, again, noticing that that has happened, and show that you're investing in it, because there's fulfillment with doing your service really well. When you do get woken up for a problem, you definitely, I think, have a more connected experience with how it's breaking. The breaks don't respect the clock. They don't care about our schedules. I would just really try to think about if this person was in Fiji enjoying the sunshine, would our other responders be able to do something to mitigate at least until the next business hours? Like, what's possible here? That's a question I would ask people afterwards. When you find that your experts have been woken up, definitely reach out and try to talk with them. Maybe ask if you have that good relationship with them, "If you haven't been able to pick up for some reason, do you think your team could have handled it for you? Would you have been able to enjoy your vacation," hypothetically? That question can come very differently from a manager than it can come from an independent incident analyst, if you have folks that are neutral, or strive to be neutral. Those are some questions I would ask, but knowing your power influences all of that. Dr. Cook reminds us at the end, like the power dynamic, may be not verbally spoken, but is always there.

Brush: I hadn't thought of that. That's an interesting thing to bring up about how the manager asking the question might get a less direct or not really honest, but a more like packaged answer about how this happens and what can happen instead.

The other thing that I was curious about is kind of like the opposite end of it, maybe this relates to it. What are some things that sometimes folks on the team will do inadvertently to make the whole experience somewhat worse. They're well-meaning but accidentally they make things worse.

DeVita: First of all, that is an observation, and it depends on who's observing. When we have that observation, it's probably in hindsight. It's, ok, it looks like this made things worse, but we only know that in hindsight, usually. We have sometimes well-meaning people who may join an incident call, and it's not helping. I would just invite us to remember that the incident manager, commander's role if you're running that incident, and please explicitly identify yourself that you are running that incident, so that anybody joining always knows who's running the show. That person has a duty, I think, to say, unless you're taking over as incident commander, this needs to be a voice only bridge between the engineers, please take it to text only.

Brush: No, I think it's great, like the roles and responsibilities, inserting them.

DeVita: Or if you have a relationship with that person, maybe you do that. It really depends on the situation. Incidents are never really scripted. They're a bunch of people coordinating together who may not have experience working together. That's why some of the incident management things can be helpful, because even if you've never worked together, if you identify your roles, maybe people know what to expect. Unless they're going to relieve you, that might be just like, whoever is speaking right now, can I ask you to go to text only? Or you message that, or maybe your other helpers are messaging like, we're all moderators, we all have to help each other in these incidents.

Brush: I think this question is around strategic incompetence, or I've heard it called malicious incompetence. How do you get folks to widen their knowledge? They can be more effective overall as a team, in terms of redundancy.

DeVita: Remember the viewpoint that you're taking. You are seeing that as strategic incompetence. I'm not challenging that you have that experience. It's a judgment, essentially, that they are avoiding learning. I would invite you to be really curious about what's going on there. I'm not a gambling woman, but I would say to you that there's something very fearful going on there. If they are afraid to touch that system, you should probably slow down and fix that. Because if the system is so fragile that they're like, not touching that at all. I would not enjoy having that continue. If you just add the manager power dynamic, and push them, no, you have to go learn. It seems like there's some safety issues there, either from, they're afraid of the deployment system. They know it's held together with bubblegum and toothpicks. Adding a manager push in there is absolutely not going to help. I would see if you can find a way to bring people together after an incident, and if you can start to build your trust with people that it's ok to be curious about other parts of the system that are opaque to them. Until they see a blameless experience, they won't believe you.

 

See more presentations with transcripts

 

Recorded at:

May 12, 2023

BT