InfoQ Homepage Presentations Incidents, PRRs, and Psychological Safety

Incidents, PRRs, and Psychological Safety

View Presentation

Speed:

39:10

Summary

Nora Jones discusses the context around PRRs and provides takeaways on how one can improve production reliability.

Bio

Nora Jones is the founder and CEO of Jeli. She is a dedicated and driven technology leader and software engineer with a passion for the intersection between how people and software work in practice in distributed systems.She created and founded the www.learningfromincidents.io movement.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Jones: This talk is about creating a Production Readiness Review, about how it relates to incidents and psychological safety. I'm going to focus on the care you put behind Production Readiness Review process, and what should and should not go into that process. My name is Nora Jones. I've seen orgs from the frontlines as a software engineer, as an SRE, as a manager, and now as I'm running my own organization. In 2017, I keynoted at AWS re:Invent to an audience of around 50,000 people about the benefits of chaos engineering and helping make you more production ready at jet.com, now Walmart, and at Netflix. Most recently, I started my own company called Jeli, based on a need I saw for the importance and value to your whole business about a post-incident review. I also saw the barrier to getting folks to work on that. I started Jeli because of this. I also started an online community called Learning from Incidents in Software.

Outline

I'm going to spend a lot of time talking about context. A lot of what we learn in incident reviews and incident processes can actually be related to Production Readiness Reviews, and checklists, and processes themselves as well. I'm going to spend the first half building a case on why you need to spend time cultivating and sharing the context behind a PRR process in your organization with everyone there. Then, in the second half, I'm going to give you tangible ways to do that.

Creating a Repeatable PRR Process

Few services begin their lifecycle enjoying SRE support. We learned this actually from chapter 34 of the Google SRE textbook. You're likely spending time creating this process or iterating and changing the existing process you have because you had an incident or an untoward outcome. As much as you try to or not, you might end up being quite reactive towards this. It's ok, PRRs are usually fueled by this fact that there's been an incident or that there's been surprises that have happened recently that says, ok, we as an organization need to get a little bit more aligned on what it means to release a piece of software. We have to spend time figuring out how those surprises happen first by doing post-incident review, by looking at past incidents that may have contributed towards untoward outcome, and even past experiences.

What Are Your Business Goals, and Where Is Your Organization At?

Before you start creating this process, you have a very important question you need to answer for yourself, if you're involved in creating this process. Where is your organization at? What are your current business goals? The process outlined in the Google SRE book is not going to be relevant for a startup. The Production Readiness process that a startup uses isn't going to be recommended for a larger organization as well. I'm going to take this to a super high level and assume there's two different types of orgs: your post-IPO org, and your pre-IPO org. In your pre-IPO org, you might care a lot about getting stuff released quickly, whether or not people are using the product. What it means to be production ready there is going to be quite different from when you have an actual stable service that you need to stay up all the time. Keep that in mind before you actually get started on this.

Don't Take Production Readiness Reviews from Other Orgs as a Blueprint

I really don't want folks to take these Production Readiness Reviews from other organizations as a blueprint, it should be unique to you. I'll explain the impetus behind this. It can be actually in fact problematic to your organization to just copy this. We actually as a tech industry received the idea of Production Readiness Reviews from other industries. It started with aviation, and it worked really well for them. Then it went to healthcare. Healthcare decided, checklists are working very well in aviation, we should have checklists in healthcare. I want to share a quote from a late doctor that studied system safety in hospitals, because I believe it's actually relevant to us as a software industry as well. We're later to the game here, from these highly regulated industries that have these Production Readiness Review processes.

This is from Robert L. Wears book, "Still Not Safe," highly recommend. He says, "Many patient safety programs today are naive transfers of an intervention of something successful elsewhere, like a checklist. It's lacking the important contextual understanding to reproduce the mechanisms that led to its success in the first place." I state this because we have a danger as a software industry of getting into this as well. This SRE checklist may have worked exactly at Google, but context matters. You need to make sure the checklists you're developing has context. You need to know the why behind each item. You need to make sure your coworkers understand and inform that as well.

Five Monkeys Experiment

This leads me to the most important question of my talk, do you know the history of why your current Production Readiness Review process was introduced, and do new hires? I hesitated a bit putting the story in because there's problematic connotations behind it and problematic practices behind it. I don't want anyone to think that I'm relating organizations to monkeys. I opted to share this story because I think it illustrates the point I'm trying to make about psychological safety. There are hallmarks of this in this particular case. Going to the case in 1966, researchers at University of Wisconsin-Madison, studied monkeys. They put five monkeys in an environment as part of an experiment, and in the middle of the cage was a ladder with bananas on it at the top. Every time a monkey tried to climb the ladder, they would get sprayed with icy water. When other monkeys saw other monkeys try to climb on the ladder, they would help stop them. Eventually, the researchers started replacing the monkeys that they had sprayed with icy water with monkeys that hadn't been a part of this experiment before. Until there was a group of monkeys in the middle that had never been sprayed with icy water.

I bring up the story because it captures a pervasive theme in many organizational cultures, if they don't have psychological safety, to do the things that we're always told that's the way it's been done without questioning or revisiting the reasoning behind it. Even long after that reason ceases to exist. It's actually a really good measurement of psychological safety in the organization. Are we blindly following lists and procedures even when we are feeling that we need to question the context behind them? Have you been a new hire at an organization where you think, this doesn't quite make much sense, but you don't feel comfortable saying the thing yet? That's an important measurement for leaders in organizations as well. Are folks asking you questions? Are folks asking each other questions about why certain things are happening? Because again, it might not be relevant anymore.

Giving, and Getting Context

My hope is at this point that I've convinced you that you need context. You need to understand context. You need to be constantly giving context. How do I give or get it? We spoke earlier about the fact that the Production Readiness Review process might be coming into play, because the organization has had a lot of recent incidents. Maybe they're having a lot of them, maybe they've just had a string of bad ones recently. I know Laura Nolan was actually talking about firefighting, and what's happening a lot when you feel like you're firefighting a lot of incidents in an organization. As you're trying to introduce this process, people might be very busy in the middle of it, too. It's important to take the time and space to actually put some care into this process.

Psychological Safety

Let's go back to increasing psychological safety for a second. Amy Edmondson is a professor at Harvard Business School, and she has a lot of amazing work on psychological safety in organizations. Do we say that we have a uniformly high level of competence in our organizations that I can say what I'm actually thinking about this release? What I'm actually thinking about this Production Readiness Review. If you can, that's indicative of high psychological safety, which will make your team more effective. It will make your Production Readiness Review way more effective as well. I'm curious also, how many different answers you get on what's important to a release. Different people are holding different contexts from marketing, to engineering, to PR. Does it feel safe to speak up about this? Do we know why the PRR is happening? Do we feel comfortable asking about it? Do we know who's driving it and why?

Hearing Different Perspectives When Creating PRRs is Important

I want to share a quick screenshot from Jeli. We're an incident analysis platform, but some of this stuff can be applied towards Production Readiness Reviews as well. This is a list of different people that may have participated in an incident. Hearing different perspectives, when creating the Production Readiness Review process is really important. We want to make sure it's not just the person that's pushing the button that says, this is release now, this is ready to go. We want to make sure that we're capturing all the perspectives. One of the things we like to do internally at my organization, is put our releases in an individual Slack channel, so that we can analyze afterwards, who was involved, from who was observing the channel, to who was participating in it. That way, we can capture the different perspectives about what this release means to them, so that we ensure that we're capturing every party. This is a significant amount of work.

Law of Requisite Variety

I want to talk about the law of requisite variety. What I'm basically trying to say here is that the care that you put into the Production Readiness Review process should match the importance of that particular release as well. This law basically says, informally and practically, that in order to deal properly with the diversity of problems that the world throws at you, you need to have a repertoire of responses, which is at least as nuanced as the problems you face. I'll share a quick image from a consultancy that does some of this work. It's shaped between problems and responses over here. I want you to imagine this left-hand column problem equaling the release itself that we're talking about in that moment, and the right-hand column response equaling the PRR process. When we look at responses here, in relation to creating a PRR process, this is what we mean. Any control system that you have must be at least as complex as the system that governs. The system that governs here is the release, and the governing system is the PRR process. This is how it actually should match.

I want to also share this equation related to some of that as well. We know context is important, but disseminating that context is also really important. We spend a lot of time being reactive in organizations and reacting to errors going down. I really like this equation from cognitive psychologist who studies organizations as well, Gary Klein. In order to improve performance, in order to improve the value of these releases and how good they are, it's a combination of both error reduction and insight generation. The insight generation piece is the context. It's how much we're talking to everyone. It's how much they're contributing to the situation. Yet, we as a software industry focus a lot on the error reduction portion. We're missing the mark on improving our performance, and actually creating high performing teams internally.

How to Get Ready To Make a Production Ready Review

Before writing the PRR review, you should get introspective about looking at incidents, and talking to the necessary parties and teams impacted by this release. Getting ready to make a production ready review. How do we do that? When creating or iterating your PRR process, consistency might be important, you might not be iterating this every time and starting anew, but you should at least retrospect it a little bit. Before writing the PRR process, I actually recommend that you have held retrospectives for incidents or previous releases that have surprised you or even gone well in various ways with the necessary parties, without doing the PRR review yourself as an SRE in a vacuum.

Again, I want you to write down the context on what inspired this PRR process. Keep those receipts. Were there specific incidents or releases that didn't go as planned? The person in charge of writing this PRR should also acknowledge their role in this. That also indicates psychological safety. Even if you're not a manager or leading the organization in some way, but you're in charge of the PRR, there's a power dynamic there. Acknowledging that, as the creator of that will help make the parties around you that are contributing to it feel safer. It's important to get context from these teams outside of SRE on what it means to be production ready. What does that mean for marketing? What does that mean for product? What does that mean for software? What does that mean for leadership? What does that mean for ICs? Then, do you have areas to capture feedback about this particular release?

Components of a Strong Retrospective Process

Myself, and a lot of my colleagues at Jeli have spent our careers studying introspection and incident reviews. We've spent a lot of time creating what it means to have a strong retrospective process. I want to relate this to PRRs. There's a lot of components that we can lift and shift from a strong retrospective process to a PRR process. We want to first identify some of the data. We want to analyze that data. Where's our data coming from? We want to interview the necessary parties. We want to talk to and have cognitive interviews from the folks participating in the release in some way. We want to then calibrate and share what we've found with them, involve them in the process, co-design it together. We want to meet with each other and share our different nuances of the event and different takes. Then we want to report on it and disseminate it out.

Every organization has different needs, strengths, limitations, and goals for their Production Readiness Review process. Take what steps make sense to you. People that are doing the PRR process also probably have constraints on how much time they can spend on it. They probably have other responsibilities, especially if it's a smaller organization. Different releases are going to have differing levels in terms of the time and care spent on that Production Readiness Review. Fundamentally, you can learn from all of these, but realistically we know that orgs are balancing tradeoffs between time spent in development and operations versus time spent in learning and sharing context with each other.

Identify

You may want to look at multiple things that are informing the PRR, like incidents. This might be like a quick example of three incidents that had various releases associated with them that may have not gone as planned. If you can look at those together and annotate as a group, the different ways that you were talking about these incidents, and the different ways that impacted releases of various parties, you're going to get a lot more value out of it. You can see here that, we're seeing the nuances of this particular release and how it impacted the organization. People are tagging when stuff is happening, and people are commenting questions that they have about the particular release and how it might have gone awry. This can then inform the Production Readiness Review process in the future, which might be a pretty awesome action item of an incident.

If you're creating or revising your PRR, you can identify previous incidents to find the opportunities and people to inform that PRR. You want to interview colleagues and have a cognitive style of interviewing. You can take things from the SDD briefing guide to look at how we interview people and how we talk to our colleagues. We want to find out like what their understanding of the PRR is. What's important to them? What feels confusing, or ambiguous, or unclear? What they believe they know about the PRR that their colleagues don't. Talking to them individually will help reveal this too. What scares them about this release? A premortem of sorts. These things can help make your readiness review ready to catch something before it has a failure or unintended consequences.

Earlier, we talked about checklists potentially becoming almost dangerous because it's a box checking exercise without context. You're going into these events trying to understand what things were happening beyond your knowledge. That's why we incorporate the different perspectives of our colleagues. I'm sharing this again, a particular Slack channel related to a release, finding all of the necessary parties. If all the groups impacted aren't talking, that's also something to dig into first. We can see a few people were just observing the situation. Did you talk to everyone that was impacted by this in some way, even if they were just lurking or observing? Make it explicitly clear to people that you're talking to, what the expectation is when you're talking to them, what their participation or role is in informing your Production Readiness Review.

Incidents are Catalysts

I related this to incidents before, and I want to tie that back. Incidents are a catalyst to understanding how your org is structured in theory, versus how it's structured in practice. It's one of the only times that rules go out the window, because everyone's doing what they can to stop the bleeding as quickly as possible. Surfacing that delta between what you thought was going to happen and what actually happened, should and can inform your PRR process in a really beneficial way. I would strongly recommend tying incidents to it. This incident, or the past incidents are catalysts to showing you what your organization is good at, and what needs improvement in your PRR processes. Where were deltas in people's mental models in the situation, especially between teams? Use your colleagues. Leverage them. Collaborate, talk. In this distributed world that we have with the pandemic where everyone is remote, this is even more important, as I'm sure folks have realized. Use previous incidents and releases to inform your PRR process, and then rinse, wash, repeat each release.

How to Know There's Improvement

How do you know if it's working? How do we know if we're improving? We're not going to get to zero incidents, obviously, but maybe our releases feel smoother. Maybe we're feeling more alignment with our colleagues. Maybe we're feeling comfortable speaking up. More folks might be participating in the PRR process. More folks feel comfortable questioning the why behind different pieces of the review. If you're not seeing that in the beginning, that's usually indicative that there's more to dig into. People know where to get context on the PRR. Folks feel more confident in releases. Teams are collaborating more in this process beyond just SRE doing it in a vacuum. There's a better understanding of the shared incident, the shared definition of production ready.

Questions and Answers

Sombra: There have been a few themes there, is there anything that you would like to mirror or echo?

Jones: I think one of the points I wanted to stress is there is a huge psychological safety component to being able to ask questions behind certain rules and procedures that just feel normal in your organization. I understand that not every organization has that. I think it starts with your colleagues, like trying to cultivate that relationship with your colleagues. Because, ultimately, there are a lot more individual contributors than there are managers. I think at that point, the managers can't ignore some of those questions. It doesn't need to be combative. It can just be inquisitive. Say, before each release, you need to get three code reviewers. You've never really questioned it before, but you're like, why three? Why from this particular team? Can you give me some history on that? You as the engineer, having history on that particular nuanced rule will help you in the future and will actually help that release. Because it's helping you collaborate with your colleagues and actually understand your organizational context a bit more.

Sombra: We were talking about Production Readiness Reviews, and the opinion of Laura was that the teams that own or they are subject matter experts should own the process. What is your opinion whenever multiple teams own a specific piece of software?

Jones: I feel like that's always the case. It's very rare that one team has owned the entire release, and it's affected no one else. This is actually a mindset shift that we have to do as SREs, I think a lot of the time, it's like, my service was impacted the most, so I should own the release process for this. What does that impact mean? If that incident hit the news, have you talked to your PR team, like about the release process? Have you talked to your customer support people? I don't quite know what it means to be the team that's most directly impacted. I get a lot of us are writing the software that contributes greatly to these. What I'm advocating for is that we co-design this process with the colleagues that might also be impacted in this situation. One of the ways you can do that is check who's lurking in your Slack channels and incident? They might not be saying anything, but they're there. Check who's anomalous for that, and chat with them. Because I think it ultimately benefits you as an SRE to know what to prepare for when you're writing the software, when you're releasing it, when you're designing it.

Sombra: Have you found in terms of dissemination of information in humans, what is a good way to bring back the findings of an incident analysis to the team, if they're not in charge of doing that analysis?

Jones: It's really hard, but it's so necessary. I feel like so often in organizations we write incident reviews, or we write release documents, as a checkbox item, rather than a means of helping everyone understand and develop an understanding together. I think it starts with shifting the impetus of writing these documents. It's not to emotionally feel good and close a box and feel like we're doing a thing. It's like, to collaborate. I think every organization is a little bit different. I think some organizations really leverage Zooms. They might feel like they have a lot of Zoom meetings, especially in this remote world. Then some organizations might shy away from Zoom meetings, and dive more into documentation world.

I've honestly been a part of organizations that have been either/or, where we'll have to have a Zoom meeting to get people to even read the document. Then, on other organizations, if someone's requesting a Zoom meeting, you're like, why don't you just send me a document? It's like one way or the other. I would encourage you to think about where your organization leans, and whatever you can do to get people to talk to you about how they feel about some of the line items and how it impacts them. I think too often we just write these documents or write these reviews in a vacuum, and we're assuming people are reading them, but are we actually tracking that? Are we actually tracking their engagement with it? Are we actually going and talking to them afterwards, or are they just blindly following the process? We're happy because that means that it's good.

Sombra: You said that multiple people should answer their PRR, but you get multiple perspectives, and a different level of experience. For example, a junior engineer will respond differently than a senior engineer would, and the team sees that the junior engineer has a different response. How do you stand the multiple contributors? How do you coalesce the voices of multiple contributors at different levels of experience?

Jones: At some point, there is going to need to be a directly responsible individual. A lot of the time it's the SRE, or the most used software team. They're going to have to put a line in the sand. Not everyone's going to be happy. I think focusing less on people's opinions, and more on people understanding each other is really helpful. Like, here are the different perspectives involved in the situation. I really love this question about the junior engineer versus senior engineer. Because I would encourage the senior engineer in that situation, to ask more questions of the junior engineer on why they're feeling a certain way, and help give them the context, because that's going to be data for you and yourself. Because this junior engineer is going to be pushing code, and if they don't quite understand some of the release processes and the context that goes behind them, it might indicate to you that you need to partner with them a little bit more. It should be revealing, actually, in a number of ways. I would leverage that as data. It would help level up the junior engineer in the team itself. I think involving them in the process without necessarily having them dictate the process is going to be really helpful for their growth and also the organization's growth in general.

Sombra: That is an assumption that this junior engineer would have a substandard opinion. There are some cases in which your junior engineer will tell you that something is wack in your system, and that's definitely a voice that you want.

Jones: All of our mental models are incorrect, whether I'm a senior engineer, whether I'm the CEO, whether I'm head of marketing. We all have different mental models, and they're all partial and incomplete. I think your job as the person putting together that PRR document, is to surface the data points between all these mental models so that we can derive a thing that all of us can understand and use as a way to collaborate.

I'm curious if you gave your current PRR process to people in other teams and ask them why they thought we were doing each thing as an organization. I'm curious what answers you would get, because I think that would be data for you as the author yourself. Not asking if they understand it, but asking them why they think each line item exists. If they give you a lot of different answers, I think that's homework for you to do yourself, to put more context in there.

Sombra: PRRs, if they are service specific. Also, how do you leverage one when handling incidents?

Jones: I don't think a PRR needs to be created for every service. I think there should be an overarching release process that's good for the organization to do as a whole. I think individual services might want to own their own release processes too, like before we integrate this with other systems, or before we deploy it to prod, not necessarily the organization releases it to customers. We might want to have our own release process too. I think that's up to each individual team. In the service deployment situation, if it's just about your service and your team deploying to prod, I would make sure you include all your team members in that, especially the junior engineers. I would actually have them, if you do have a psychologically safe organization, share their context first so that it's not checkered by some of the senior engineers' context. It really requires you to have a safe space for them to do that.

I would also look at some of your downstream services, too, if you have the time to incorporate some of those thoughts on your releases, and how your releases in the past have impacted them. Any internal incidents that you may have been a part of, too. Lots of different nuances for lots of different situations. Primarily, in my talk, I was talking about the overall Production Readiness Review of like, we're deploying a particular release right now.

Sombra: In the time continuum, you would have this as a precursor from going out. Then just like if there's an incident, you will do a different process to analyze that incident and pick it back into the PRR.

Jones: Exactly.

Sombra: I was asking Laura, if there is any tooling that she preferred or that she gravitates towards when implementing a PRR. Is there anything that you have found that is helpful in order to start laying out the process from the mechanics?

Jones: I think the most important piece is that you choose a tool that is inherently collaborative. By that I mean, you can see people's names that are writing things in there. You can surface things from your colleagues. I would strongly advocate for your PRR to not necessarily be in like GitHub or GitLab, or something that is inherently in a readme file that might not get updated that much. Maybe yours do. I would encourage it to be in some format where you can keep this document living, and encourage people to ask questions after certain releases. Question whether or not the PRR process worked for us.

I've certainly been in organizations where we've been a bit too reactive to incidents. With every incident that went awry, we would add a line item to our PRR process. Definitely don't do that, because then you end up with this PRR process that's just monstrous and you're not actually understanding the underlying mechanisms that led to that individual thing you're adding being present to begin with. I think a lot of this is about asking your colleagues questions, and having those questions be public in this living document. It can be Google Docs. It can be Notion. I am biased because I started the tool that did some of these things. We have a lot of collaboration components in Jeli as well, more focused towards incidents. There's a lot of tools. The most important thing is that it's collaborative and that it's living.

Sombra: Since we're also talking about the environmental and human aspects of PRRs, what organizations or what stakeholders are we normally forgetting that could extract value from this document? Do you have an intuition or an opinion?

Jones: Are you asking about the consumers of this document, like who it's going to be most useful for?

Sombra: We tend to see things from an engineering lens, but are there any stakeholders? For example, product would come to mind. If you'll never ship anything, because the PRR is 200 lines or 200 pages long, that has an implication or at least has a signal for the business.

Jones: Totally. I would be very scared if it was 200 pages long.

Sombra: Obviously, it's an exaggeration.

Jones: I think there's a more nuanced version that needs to exist for engineers and for your team. I think it should be based off of the higher level version. It should all be feeding into a thing that everyone can grok onto, and that product can understand, that marketing can understand, so that you have that high level mechanism, and everything's trickling down to a more nuanced view from there. Because if you're having separate things, that's when silos start to exist. Especially in the remote world that we're in now, I'm seeing so many more organizations exist even more in silos than they normally have been. Over sharing and over communicating and over asking, like, Inés, what did you actually think of this particular line item? Because you've been involved in an incident like this in the past, not like a pain, as this does sound good, but actually getting into the specifics of it and making Inés feel safe to answer that question from me.

Sombra: This is actually reinforcing your previous point where if you put it in GitHub, like other areas of your organization may not have access or permissions to be able to see it. The access and discoverability of the document itself gets cut.

Jones: What do you do if you have a question on a readme in GitHub, you can't comment on it. You go and ping your best friend in Slack and you're like, do you know what this thing means? It's not in a public channel, and so you're in your silos again. They might have an incomplete view of what it means. They'll probably have an opinion like, yes, I don't get that either, but we just do it anyway. It's important to encourage an organization that's asking these questions in public. I know a lot of the folks in this audience are senior engineers. I think what you need to do is demonstrate that vulnerability, so that other people in your organization follow.

Sombra: I think this is very insightful. It's like, we just tend to think about systems, but we forget the people, and then they're made with people.

Jones: The people are part of these systems. We tend to just think of them as just software, but it's like we're missing this whole piece of the system, if we don't take the time to understand the people component too. We're not trained a lot in that as software engineers. It might seem like, this is the easy part, but it's actually just as complex. Understanding both of these worlds and combining them together is going to be so important for the organization's success onboarding new hires, keeping employees there. All of these feeds into these things.

See more presentations with transcripts

Recorded at:

Jul 22, 2022

Nora Jones

InfoQ Software Architects' Newsletter