InfoQ Homepage Presentations How Do We Talk to Each Other? How Surfacing Communication Patterns in Organizations Can Help You Understand and Improve Your Resilience

Culture & Methods

How Do We Talk to Each Other? How Surfacing Communication Patterns in Organizations Can Help You Understand and Improve Your Resilience

Bookmarks

View Presentation

Speed:

Download

49:13

Summary

Nora Jones discusses how communication patterns in organizations can reveal how systems actually work in practice, vs how we think they work in theory.

Bio

Nora Jones is the founder and CEO of Jeli. She is a dedicated and driven technology leader and software engineer with a passion for the intersection between how people and software work in practice in distributed systems. She created and founded the www.learningfromincidents.io movement to develop and open-source cross-organization learnings and analysis.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Jones: A few years ago, I was working at an organization that had built their own feature flagging system. They used this system all the time. There were a number of problems with this homegrown system they had built. Not to mention, this company wasn't a company selling developer tools. They weren't even in this industry. They built this to help them move faster. The team that was running this was really stressed out about it. The whole engineering team at the company was using this piece of technology, but there were a number of problems with it. It didn't have type safety. It wasn't being maintained very well. They had a number of other things they were maintaining. They sent out this email, it looked something like this, "Engineering," they gave everyone a big heads-up, "Please don't use this old technology on this date. There's a lot of issues with this technology, and it's mainly the lack of type safety and maintenance." Then a code freeze happened right after this period of time. What happens during code freezes? Developers are working on a lot of things that they want to push out, and they haven't been using their tools regularly. Someone had pushed out this change. They had a month-long code freeze, and developers are back and they're pushing their change. When they were pushing this change, they made a mistake. The fastest way to get rid of this deploy would be to change the feature flag to something different. What developers would do at this company was change it to an incredibly large number that would never be a feature flag. What ended up happening is that this number was larger than the max integer that Java could parse, and it brought everything down. There was a really catastrophic issue, it was quick, relatively. It was interesting, in that when I went and talked to the developer that pushed this change, they had used this old piece of technology. They said, I actually had no idea there was a new system. Then I went and talked to the team. They said, I actually didn't know our old tool was even still running. In this situation it's a very minuscule incident. It was catastrophic, but it was small from a time perspective. No one really wanted to go into it, the person was embarrassed, the team was embarrassed. I think a lot of folks could reduce this to, this person should have read their emails, or the team should have communicated better. When I dug into it, I noticed most of the engineers at the company that were really long tenured, had no idea the new technology still existed, all of the newer engineers to this organization did. This old technology was actually running on a single node. Folks did not know that this was happening. I bring this all up, because all of these communication patterns are very interesting, and they're not good. They're not bad. They're just data, as my friend Rian says.

Background

I'm Nora Jones. I am the founder and CEO of an incident analysis company called jeli.io. I've worked at Slack, Netflix, jet.com, and several other organizations. I've co-authored a couple books on chaos engineering, how you can improve your system's resilience through injecting purposeful conditions. I founded the Learning from Incidents software community. A few years ago, I founded my own incident management company called Jeli, based on things that I had seen and experienced as an SRE at various organizations.

Reacting to Incidents

I bring up incidents in regards to communication patterns, and in regards to resilience, because they can really teach us about how our organization works. We react to incidents, but we don't always learn from them. We don't always spend the time understanding the conditions that needed to be true in order for the incident to take place. That story I brought up, so many conditions needed to be true in order for that incident to have taken place. Our reactions to incidents are quite expensive when we don't take the time. We subtly make a lot of changes to our organization, and it might not be right after an incident that these things happen. We might create new roadmap items. We might reprioritize roadmap items. We might decide, we need to spin up a new team, we need to get rid of this technology. We might, over time, fire people or reorg people because of incidents, and we might not attribute it exactly to an incident, but it does happen over time. All these things have really strong impacts to the bottom line in our organization. There are things that our businesses don't always see nor attribute to an incident, and they are expensive.

Communication During Incidents

Understanding how you communicate, especially during urgent situations like incidents can lead to more resilient systems in our organization. I want to talk about how we can do that. I have to bring up GenAI. You can't automate learning and teach in your organization with generative AI, but you can collaborate it. You don't want generative AI to eliminate the cognitive work, because that's not how we get better. The cognitive work is the real work. It is what makes our systems resilient. It's how we decide which pieces of technology to use, and how to design them. Like, what does resilience actually mean in 2023? The industry has changed a lot. We've gone through a global pandemic. There have been economic conditions, and there's a lot you can still learn about your organization and influence the way it operates through studying communication, even if you have different types of resources than you're used to. Communication during urgent situations like incidents could tell you how your organization actually works, versus how you think it works. Here's the thing, incidents end up being mirrors, all rules and procedures go out the window because everyone's doing what they can to get us back to normal. Companies are in transition phases all over the industry. They might be struggling financially, and we all have to keep this in mind with our roles, and we have to adapt. We're still having incidents. Now is the moment you'll get time after incidents to show what you learned from them and how it's going to be applied to what your businesses are experiencing. I keep saying learning from incidents. When I say this, it really is not why did this thing happen, but why did it make sense for us to do it this way? Think back to my story earlier, everything that everyone was doing in that story made sense to them. The things other people were doing did not make sense to each of those individuals. That's all data. I am going to take this talk to ask the industry, why did things make sense the way they were? How did it make sense for us to do it this way, so that we can be better now and beyond?

Part 1 - Developing Expertise

Part 1, I'm going to talk about how expertise gets built in organizations today. Part 2, I'm going to take us through two different incident investigations. We're going to get technical but also socio-technical in this investigation. Then part 3, we'll talk about what's next. How does expertise get developed? I'm going to take us through a fictional company. This company is a startup, pre-product market fit. There's only four people at the company. I very creatively named them Person A, B, C, and D. Person A says, "I'd like to use new technology to make this project go faster. It's really going to set us up for success in the long run." Person B says, "Ok, if it gets us there faster and more reliably, let's do it." How many of you have been in this situation before? We're trying to move through a project, we have deadlines, we need to use technology to figure something out, and so we introduce a new piece of technology. Person C, voice of reason, says, "Is this easy to use? I'm not really familiar with it." Person A says, "Don't worry, I'll teach you all. It's really straightforward." We have a new expert in this technology, now Person A is actually going to own this technology for the remainder of their tenure at this organization. Has anyone been in this situation before? Has anyone been Person A? Person A is now highlighted, because they're now the resident expert of this technology at this company. They only knew a certain amount of what they needed to actually implement this and how it was going to help them. Person B and Person C decide to get a little bit of expertise, but they don't have the expertise that Person A does.

Months pass, and the feature that new technology was used for was a huge hit for their organization. Keep in mind, this is a startup so they're growing fast. They have a lot of customers requesting this, and generated a lot of demand for the product, which means we hire. We hire a bunch of people. Person A is still the expert on this new technology, but they're also an expert on a lot of stuff. They've made a lot of architectural decisions. People are asking them a lot of questions. What's actually happening is as our system and our organization have expanded and grown, the expertise actually dwindles a little bit. This is where things get tricky because the technology is being used in different ways now. Again, the people are growing, the system is growing. Where we get into trouble is if person A doesn't know that their expertise has dwindled, and also everyone around them. It's usually everyone around them that doesn't realize their expertise has dwindled. They're busy. I want to share David Woods' Theorem from the SNAFU Catchers STELLA report. He says, "As the complexity of a system increases, the accuracy of any single agent's own model of that system decreases rapidly." We have all these people in the company and we have people with little bits of expertise in the system. An incident occurs, so Person C is first on the scene in the incident and their expertise has virtually gone to zero at this point. There's been a lot going on, and they bring in the only person that they know that knows how to work this new technology, Person A. All these other people join, and they're able to resolve the issue, but there's a million other things going on. They do an incident review, but it only includes a small group of people, and it's done hastily, and Person A actually can't go, they have a lot going on. Management gets this high-level report with high-level metrics about the incident, and things go on. The thing about incidents is parties engaged in incident response have different goals and priorities. Surfacing these after an incident or surfacing them during an incident can tell us a lot about what can make our systems resilient.

Over time, we actually end up with a smaller group of people that have some of this expertise, and we're not realizing it. Where I see orgs get into a lot of trouble is they put the onus on Person A to explain why they are such an expert in this technology, to teach people about this technology. What ends up happening with Person A is that they're inherently bad at explaining some of their stuff. They have a lot going on, it's hard for them to remember what it was like to be a beginner. It's actually more beneficial to extract that out of Person A, to ask them cognitive questions. To ask them what they were looking at during incidents. All these things can help transfer over expertise and create more experts. Where I see orgs get into trouble is when they put this on Person A's management track, and like, in order to get promoted, you need to teach people more of this stuff. They're going to end up not in a great situation. The better situation is to abstract this from them. Over time, Person A is involved in almost every single incident that involves new technology. They don't actually have time to teach people. They are running around and so they end up just doing it every time. I'm sure a lot of us here can relate to that, when we know a particular thing and we're jumped into it. We can see all of this stuff happening all the time. It's easier to just do it rather than show people. You can't meaningfully design a resilient system without looking at the cognitive work, without looking at what Person A was looking at, who they brought in, who brought in Person A, which is itself work. Doing incident reviews hastily, not wanting to look at communication after the fact, not studying the stuff and only looking at the technical facts, you're limiting yourself. It needs to be known that this part is work as well. It's also really awkward. It's awkward to say, I don't actually know how this works. It's awkward to call someone out for not knowing how this works, in a beneficial way. It's awkward to be a boss loitering around. It's awkward to have your boss loitering around. All this stuff is awkward, and yet there are ways we can work some of these things that can help extract more of this information from us and help our organizations grow a little bit more.

Part 2 - A Tale of Two Incident Reviews

With that in mind, I want to take us through a tale of two incident reviews. This really helped me underscore the value of cognitive work, because in a previous life, I was actually tasked with doing an incident review through this incident. This incident is fictionalized, but a lot of it is true. We have two incidents, and it's the exact same case. The first one, it was a seemingly innocuous incident. It didn't deserve a long incident review. Only a few people were involved in the incident. It was completed by the person that was most involved in the incident, it was completed by the Person A. Then the second review, I'll take you through what it looked like under a cognitive lens. Again, seemingly innocuous, it didn't deserve or have time for a thorough post-mortem. There wasn't a huge amount of customer impact. Have people had situations like this before? It was like scary for a second, and then it was like, let's move on. Templated approach for the post-mortem, it was mostly written to be filed. It wasn't really written to be read. It was completed by the members in the team most involved in the incident. The summary looked something like this, which is a pretty standard template in the industry. We have a summary, an impact, detection, resolution, detailed summary, contributing factors, what went well, what went wrong, how we got lucky, and action items. This is very standard. Does it always work? We don't know. How much time was spent on this? Who read this? Who consumed it? Who was it for? Who attended the meeting? Was there a meeting? What was the purpose of this? What was difficult or easy about handling the incident in the moment? What we know about this incident so far is that, it was pretty smooth coordination, but yet I see nothing in this template that helps understand why it was so smooth.

We're going to actually go through a real case here. A change was made to search infrastructure tooling at this company that updates which search index to look at each day. They were able to recreate the issue, and soon after it had been reported, engineering was able to track down the problem and fix the bad configuration. Due to the nature of the bug, no pre-processing work had to be done for search to be functional again. Then they have a little bit more of a detailed summary explaining how search works. This is a fictional e-commerce company, so we're using SKUs, which are stock keeping units. A search index is split into collections, a larger collection for older SKUs, and two to three collections for recent SKUs, one per day for newly created SKUs. Each collection can be individually submitted for reading or writing, since search queries typically are read from all collections from a given vendor. They employ a feature called collection read alias, which allows them to query from them using a given name. These are done two days in advance of the read alias. There was a bug in the search tooling introduced the week before, so newly created collections were not being added to the read alias. Since the collections were doing that two days beforehand, the first collection to be affected by the bug was three days ago. Then they had a timeline. Bug is introduced in Python tooling, December 6, collections created. They have a few potential causes that were ruled out. They figured out it was a misconfiguration. The configuration is fixed, the fix is confirmed, and the fix is applied to the index file. I want to pause here. If you were a new engineer on the search team at this organization, or an engineer at the company, what value would this give you, just reading through this? Take some time to think about that as we go through this.

They had an impact section as well. All e-commerce users had a degraded search experience for 20 minutes over which the incident took place. Is 20 minutes good, or is 20 minutes bad? If I was a new engineer at this company or even an engineer that had been there for a while, we might have very different interpretations on what 20 minutes means at this organization. Surely, I know that they were happy about how they experienced this incident, but there are some things left to be said here. Contributing factors, lack of type system or static analysis in code, alert that would have detected this is also broken. Lack of, that's pretty normative. An alert that would have detected this is also broken. A little bit of counterfactual reasoning here. I also just want to pause on this slide because I think you could put these contributing factors in almost any incident that has happened in the tech industry. What went well? Fairly quick fix. Again, we know that they were happy about fairly quick, but what does fairly quick mean here? What went wrong? Alert that should have detected misconfiguration was also broken. Having the word should have in your incident reviews or in documentation or in your meetings is going to stop you from digging further. If it should have it, it would have, but it didn't. Let's go through how it didn't. How we got lucky, it happened during a low traffic time and not many users were impacted. What's low traffic? What's not many users? Let's talk about some of these things. Action items: fix the bug, fix the alert, check all of our alerts. These could also be in any incident in the tech industry. Take some time to think about what questions you have about incident 1. I've seen so many incident reviews like this. When they're tasked to Person A after an incident like this, they're going to look like this. They're going to feel really normal to them. They're going to just be doing it just because it's a checklist and they want to move on with their day. They're not really getting any value out of these things. Even though it was a seemingly innocuous incident, there is so much value to get out of it.

This brings me to investigation number 2. It's always a little bit more complex than it appears. This investigation was conducted by an engineer outside of the main team, so it was not Person A. It included evidence from previous deploys. It included teams impacted but not previously involved in the timeline. It informed new team members about system dynamics. If done well, your incident reports can be your most valuable onboarding tool. They teach the new employees things that they would never see before. They teach things about how you communicate in orgs, what people look at, what dashboards are important, where they check during an incident. All these things can ramp people up very fast. The purpose was to engage the audience and to be read. It was not to report. It was not to file and forget. It used a guide kind of like this. This is from the Howie guide, which is a free resource on Jeli. It included a background of the document, who responded, who investigated, executive summary. It not only included customer impact, but it also included employee impact, key takeaways, triggering event, contributors and enablers, mitigators, what was difficult or easy during handling, and some follow-up items. I want to go back to this, because when the purpose of your document is either to file or the purpose of it is to have action items, you're going to end up with a much more shallow report. If you do things in a way that your purpose is to engage and learn, really good action items actually just end up falling out of that. It's pretty amazing how it happens. Try it with an innocuous incident sometimes, I think you'll be surprised.

Let's jump into the timeline for this one. In order to do that, we're going to actually jump into how people were speaking during an incident. A lot of timelines I see in the tech industry don't include what people were saying during an incident, they include someone's interpretation of what they were doing. Sometimes they even take out names. I see a lot of notions in the tech industry like, to be blameless it means not name names. That's actually inaccurate. This is something J. Paul Reed says, to be a truly blame-aware organization, it's safe to name names. It's safe to name, who did the thing. It's safe to speak up and say, I did that thing. It's safe to have these awkward conversations. I'm going to jump into the key updates channel. I think if you remember our previous timeline, it started well after November 23rd. Justin Bacon is a search engineer on the search team, and he was tasked with a project to update some of the search keys. He was tasked by this from the business. There was a lot of SKUs that needed updating. They were bringing new SKUs in. He was writing that they were receiving requests related to the key change beta release. This is just a project channel that he's sitting in. Between November 23rd and November 30th, the search team received several pager-driven alerts on search not working, and invalid and revoked keys were a source of confusion for the search team. Imagine you're on the search team and you're getting paged in the middle of the night for this project you had to rush on, saying, invalid key. It's actually not correct, because they were supposed to remove them. They were doing these key updates and they didn't have time to update their alerts yet. They're tired. They are getting woken up every night with these key change releases. They introduced PR 22 on December 3rd, to avoid the noisy alerts that they're getting as a result of the key changes. They create a new thing to not alert when the key is invalid. Then the partition is created to hold the indices for December 6th. These are always created two days in advance. Then there's a second unrelated PR, so it was introduced to not update the read aliases for live collections during a migration. This was a result of a team member seeing a migration take a minute longer than necessary. It was just some unnecessary cleanup work that they added to this PR while they were waiting for it to get reviewed. Between December 4th and December 6th, errors started getting detected, but there were no pages due to PR 22. Then this is where user search queries began not reaching the December 6th partition. Both of these PRs had to have happened in order for this incident to take place.

Let's jump to a spam detection channel where fictionalized customer service agent, Conor Jacobs, is basically fighting bots from their competitors that are trying to spam their site. He's like, I'm working on automation that's going to stop these bots once and for all. He's posting here for posterity. He's deploying now. He's going to run through tests when it gets to staging. In this org, they do some quick manual smoke tests. Even though he wasn't touching either of these systems, he was doing manual smoke tests for them. He says, Add to Cart looks good. I'm moving on to search next. That's weird, search is not returning a SKU that I just added in test. I'm assuming that's normal behavior. My deployment wasn't related to search. He works in customer service. He's just doing this to help with the process. He's moving to the second deploy. He's going to run through tests on this one, too. Now let's jump to the customer experience channel. Natalie is receiving some reports that customers can't access search results for the items in the promo email sent out yesterday. We've learned a few new things with this incident, even though it's the exact same as incident 1. We've learned that marketing was involved. We've learned how customer service was involved. We've also learned about the project that the search team had to work on right beforehand. She's looking into it and managing them. They appear to be receiving search results for all other SKUs added before last week, so it seems like it's not an obvious down. Now Conor actually is like, "I saw this too. I think it's the same thing. I'm going to page the search team. Something's not right. I can't figure out auto-paging, I'm just going to manually page them, which I think pages all of them." Then the entire search team hops online at the exact same time. Justin pops in, Natalie pops in, Jen pops in. I am curious about why Conor was able to get all these people in the room so quickly. How did everyone just hop online, and I should note, it's also the middle of the night for the search team. Conor has been at the organization for six years, he was actually the eighth employee. Everyone knows who Conor is. Conor knows who everyone is. Imagine getting paged in the middle of the night and you see Conor's name, you're probably going to hop online pretty quickly.

Now we're finally in the incident channel. Justin hops on, and he says, I don't see any messages past UTC midnight in the results. Yes, they're not in the alias. Going to update it. Melissa is there too. We got three people online so quickly. I think we're starting to see why this incident was resolved so fast. It's not as innocuous as it appears. Justin goes, it's related to PR 23, a variable named update_aliases overrode the method named update aliases, ugh, Python. I love seeing someone say ugh, or, no, or putting like a scream in an incident chat, or anything like that. That is data. Someone saying, ugh, Python, that is data. You got to follow that. When you follow that you actually learn that only this team in the entire organization is using Python. There are very few Python experts in the org, and they're also expected to maintain expertise on other languages in this organization too. Surfacing all this stuff is creating a conversation. It's creating interest. We've enrolled people from three different parts of the company, not to mention product, and we're teaching them about how each other works. We're teaching them about each other's goals and different perspectives. It will help them work better together in the future. You want to make these things engaging. When you see just a plain timeline, and it's linear and so clean, incidents don't actually work like that. As humans, we want to rationally explain something bad that happened. We want to see cause and effect, but it doesn't always work like that. Sometimes we have a few red herrings. Sometimes we think we fixed those red herrings. Sometimes someone comes in and is like, no, that's a red herring, don't fix that. You figure out the actual thing, and then you keep going. All of that stuff is data. It's all important to include, and it's also important to make it engaging.

Some key takeaways from this incident. There's multiple notions of what it meant for search to be working. When I talked to Conor Jacobs, the tests he ran were very different than the tests that the search team would have run, were very different than the tests that the pricing team would have run. That was apparently by design in that organization, because they thought it would encompass the most tests. Incident handling excellence was influenced by team cohesiveness and collaboration and also familiarity. The search team had actually been working together for five years. They knew each other very well. They knew Conor very well. They knew him well enough to respond to a text in the very middle of the night. Because all of them were online, they all knew about all the PRs that could have created this situation. Changing code with alerts has risk. There was a really quick awareness of this issue. There's always multiple roles and experiences to study and multiple people to invite to the meeting. These incident reviews can actually be used to gain insight on each other's sharp ends, which inherently improves the resilience of our systems.

I want to create some of the differences here. Incident 1, really fast, they probably put it together in about 45 minutes. They did. It was a 20-minute incident, there were two contributing factors. There were two action items that got completed. There were two people involved in the whole incident. There were no difficulties during handling section, customer impact. There were 10 people that read the document. I want to point out that these types of incidents for incident 2 take time to investigate. A lot of this time is spent with data collection, which can be helped if you're getting some data collection tool. The time of the incident can date back three weeks to the event with incident 2. It had eight contributing factors. It had more action items that got completed. It had more people that were involved. It had a dedicated section for difficulties during handling, it had customer-employee impact. It had 140 readers of the document. Yes, this route takes more time. You don't have time to do this for all the little blips that come up during your day. How do you choose which incidents to apply this to? You want to look at what's anomalous for your organization, not just customer impact, not just SEV-1s. I know you and your org take time for SEV-1s of these incidents, but what about when it's not a SEV-1? I'm curious, how many people are involved in most of the incidents in your organization? If it's normally 10, and all of a sudden you have 300, but it was still not a SEV-1, is that worth investigating further? Have these people ever worked together in an incident before? Are most of the folks tenured or new employees? There's a lot of different routes you can take for deciding these things. Most importantly, you want to look at what is obscure for you.

Part 3 - What's Next?

What's next? Working to understand how experts think and talk and behave and engage people in, this creates more experts. It creates a more cohesive organization, which ends up making your system as a whole more resilient. When I say system, I don't just mean your technical system. This is how you scale these people. It can help them understand each other. It helps them work better together in the future. It helps them know who to reach out to. Like I mentioned, the second incident, doing it by an engineer that's not on the team is going to give that person a lot of context on how search works. Now if they're in a search incident in the future, they're going to know all these people. They're going to know what to do. They're going to know what not to do. If we look at our chart from earlier. If we actually do incident reviews like this, and we look at some of the cognitive work, we end up going from this to this. We end up creating a lot more experts in our organizations.

Some tips for generating patterns from communication. Use more than customer impact to warrant an incident review. Make it engaging. Create a record. Schedule and make time for learning. Don't break your promises to learn. It can be easy to schedule over these things if they are checklist items. If there's time spent to make it engaging, people will want to come to these meetings, people will want to read these documents. Putting the responsibility of running the incident review outside of Person A is also incredibly valuable for all the reasons we've talked about before. It's something leadership needs to support. Making it engaging, making it visual, making it in whatever way folks like to learn, but that tabular timeline is not going to do it for you. It's not going to be something people read through, especially if it's just the opinion of the writer. Doing this can give the writer of the document more expertise. Like I mentioned, it can be your best onboarding document. It can inform roadmaps. It can serve as professional development and refresher training for folks. It can help enable better meta-analysis across incidents. I know for a fact that most organizations are surfacing some pattern across their incidents. If they're not actually going deep on them, the patterns are almost meaningless. If you're making decisions based on some of those patterns, you're going to end up in a bad spot. It can also help identify where headcount is needed or not needed. It can inform build versus buy decisions. Like maybe we should stop investing in this technology if there's two people propping it up. It can also create more experts.

I'm going to drop a hot take for a second. If you aren't doing this stuff, your metrics are going to be wrong, and they are lying to you. Your MTT*, your number of SEV-1s, all those metrics without context, are not helping you make better decisions, they're not helping you be more resilient. Using communication patterns can actually add a ton of context to these metrics, and help you walk backwards from a story. Stories are what people respond to. Stories are what people remember. My challenge for you, in the next couple weeks, look through a previous communication at your organization during a time-sensitive situation. Don't pick the most critical SEV-1 you've had, but pick something innocuous or seemingly innocuous, and dig into it. Dig into how people talk to each other. See if you can find the word ugh, or a scream face emoji. Those things are data. See if you can just take someone out to lunch, have a Zoom coffee with them, and say, what did you mean by this? Can you tell me about some of that reaction? Has this happened before? You're going to learn something.

Questions and Answers

Participant 1: In organizations that treat incident response as a formality, how do you make the case for doing an investigation on a low priority incident, something that's not that big a bang?

Jones: I think keeping in mind your organization and how they react to these sorts of things is really important. There are few different approaches you could take. You could do it under the radar. Timebox how much time you spend on it. When I started doing these in a previous organization that was a little bit unsure of it, I would just do it in some additional time. After I got some of my work done that day, I would study that incident a little bit further. Most of the time, people that are involved in incidents would love to talk to you about the incident that they were involved in. They're very interested in sharing their story, especially if you approach it from a way of like, I'm just trying to understand. I'd love to understand what your experience was here. Like, this part must have been frustrating, was that frustrating for you? They'll usually go off on a spiel. You'll get the buy-in of the folks that participated, which can be helpful. Again, it all depends on your organization. We have some materials on the Jeli site that can help you approach managers or give some feedback there. Do it under the radar for a couple of them and then show the results to someone. Once the results start speaking for themselves, you'll probably get more buy-in.

Participant 1: How do you address folks that say, what's the root cause?

Jones: I don't want to shame anyone when they're in the middle of a review. If they're asking about the root cause, it's not the time I think, to get hung up on language they're using and stuff. I think it's also helpful to keep educating people. It's also important as you're doing this stuff to pick which hills you want to defend a little bit more. If you're trying to defend root cause, if you're trying to break people away from root cause or break people away from 5 why's like, I don't know where your organization is at with this journey, but I would just take it a little bit at a time. I saw an org that I think worked for a year to get root cause out and contributing factors on. It takes a while sometimes, and so you slowly chip away at it. I think if you can get people asking different questions and participating, that's the real value. Even just attend an incident review in the future and count how many people talk versus how many people attend, that can tell you some interesting things and some areas to start on. If your organization is further along in this, and they're open to receiving feedback on root cause or something like that, I totally get why you think that way. I totally get why you think it's a root cause, like we want our minds to work that way, but look at all these things that needed to be true. By saying root cause, we limit looking at all these other things.

Participant 2: We definitely have a template similar to the one that you pulled up with a million different things, and it just becomes a chore. On balance, do you think a comprehensive template like that is a net harm compared to one that is very simple.

Jones: I think, after an incident, people want to do a good job at whatever their org is expecting them to do. I think they're really busy. I think a lot of the time, people forget how to do an incident review because they're not doing one every day. I think templates and guides, and this applies to my thoughts on runbooks and stuff too, they can be useful if you're not more of an expert than the document. I think if people are only doing incident reviews every so often, giving them a guide or giving them a set of questions to ask about the incident can be very useful for them. Otherwise, if people come to a blank doc, I see them struggle with what to come up with, and they get really stressed out and they usually just end up not doing it. I think it's a nice balance, but I think it's important to be careful in terms of, is this document limiting me? Is this template limiting me? I really like the approach of giving folks a set of questions to look at, rather than categories of like, fill this out, how much impact, things like that. I also think there's a big balance between a lot of us have to fill out incident reports for auditing purposes. The auditing purposes can sometimes get in the way with the learning purposes. I think it's important to allow both things to happen at the same time. Otherwise, if you're ripping people from their template, but they're still needing to do auditing, they're going to get frustrated and not spend much time on the learning. You're going to end up with more of the incident number ones. I would even maybe start with just adding some questions to your document about some of the cognitive work. I think that would be a good step.

Participant 3: I'm wondering if you have some perspective on appropriate timing for incident reviews. My background is in mostly mission critical and safety critical software, and often emotions can be running really high, because people could have been harmed as a result of this, but you don't want to [inaudible 00:45:12].

Jones: The emotions can be running high in all different types of situations. I would recommend starting your interviews and data collection process quickly. I think if emotions are really running high, I would actually do one-on-one interviews with people that were involved in the incident. I've been in those situations before, and like, sometimes people will break down crying, but it's a good place for them to do that. You can still get the insight you need on the situation. What you can do in that situation is then collect the results from the people that you want on one interview, and aggregate them into a document, you share it with them. They're enrolled in it, but they're not so much put on the spotlight in the meeting. Then it's like, you're telling a collection of stories that you heard, rather than having people that are in a really high emotional state put on the spotlight. Because that, as I'm sure you've seen, doesn't always lead to a super productive outcome.

Participant 4: One thing I've found is it's not everywhere good. One challenge I've had with some folks is, a lot of things that make sense at the time for an operator [inaudible 00:46:56]. Is there any way of seeing the obvious layer to other people. What do you think about that?

Jones: The first scenario I brought up, everything that someone did in another department, the other department was like, what are you doing? Why are you still using that old tool? Why are you using things in that way? If you can just like say, it made sense. No one comes into work to do a bad job, for the most part. I'm sure there are a few folks in the world that do. Most people want to try their best and do a good job. Figuring out and approaching things from a, why did it make sense to this person at the time, rather than thinking it was obvious, is going to allow for them to explain a little bit further. If you come in like, "This should have been obvious, we sent this email a couple weeks ago. Do you not read your email?" You're going to get a really shallow answer in response. They're going to be in a little bit of the defensive. If you can remove that part of your brain, you're going to get a lot more data. Our org used to communicate this way, and you're going to learn a lot more about your org. Because if it made sense to them, and if it wasn't obvious to them, there are definitely more than one person that that is true for, that's not going to speak up if you approach it from the obvious standpoint. I think for folks that have that inclination, explaining to them we're going to get better results about this if we don't take that approach, can usually help them understand why that that's important.

See more presentations with transcripts

Recorded at:

Apr 05, 2024

Nora Jones

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

How Do We Talk to Each Other? How Surfacing Communication Patterns in Organizations Can Help You Understand and Improve Your Resilience

Summary

Bio

About the conference

Transcript

Background

Reacting to Incidents

Communication During Incidents

Part 1 - Developing Expertise

Part 2 - A Tale of Two Incident Reviews

Part 3 - What's Next?

Questions and Answers

Related Sponsored Content

This content is in the Culture & Methods topic

Related Topics:

Related Editorial

Popular across InfoQ