InfoQ Homepage Podcasts Nora Jones on Resilience Engineering, Mental Models, and Learning from Incidents

Nora Jones on Resilience Engineering, Mental Models, and Learning from Incidents

Jul 03, 2020

Podcast with

Nora Jones

Daniel Bryant

In this podcast, Nora Jones, Co-Founder and CEO at Jeli and co-author of O’Reilly’s “Chaos Engineering: System Resiliency in Practice”, sat down with InfoQ podcast co-host Daniel Bryant. Topics discussed included: chaos engineering and resilience engineering, planning and running effective chaos experiments, and learning from incidents.

Key Takeaways

The chaos engineering and resilience engineering fields, although inextricably linked, are often incorrectly conflated. Resilience engineering is focused on “identifying and then enhancing the positive capabilities of people in organizations that allow them to adapt effectively and safely under varying circumstances.”
The UX of internal or engineering-focused tooling, such as chaos experimentation tooling, is extremely important. However, engineers that create these tools often overlook the value of UX, or don’t have the relevant skills in user design research to undertake this.
We all work in socio-technical systems. It is important to take the time to understand both aspects. Developing empathy and working alongside teams that you are trying to influence is essential. It is extremely important to continually work to build correct “mental models” of a system.
The before and after of running a chaos experiment is as important as running the experiment itself. However, the aspects of planning, creating effective hypotheses, and analysing and disseminating the results are often under-resourced.
Incident analysis can be a catalyst to help you understand more about your system. The Learning from Incidents website, alongside books such as Sidney Dekker’s The Field Guide to Understanding Human Error and Scott Snook’s Friendly Fire, can provide excellent background information to these topics.

Subscribe on:

Transcript

00:21 Bryant: Hello and welcome to the InfoQ podcast. I'm Daniel Bryant, news manager here at InfoQ, and product architect at Datawire, and I recently had the pleasure of sitting down with Nora Jones, co founder and CEO at Jeli. Nora has been a leader within the chaos and resilience engineering communities for quite some time now, having originally work in the space at award winning organizations like jet.com, Netflix, and Slack, I've learned a lot from Nora over the years, whether this has been from her books, her QCon SF talks, her AWS re:invent keynote in 2017, or from her recent InfoQ podcast recorded with my co-host Wes.

00:51 Bryant: In this podcast, I was keen to understand the interesting recent trends within the chaos and resilience space. Topics of particular interest were the user experience of chaos tooling, and also exploring what is involved around the before and after stages of running chaos experiments. I was also looking to understand the human factors, and socio-technical aspects of building and operating resilience systems. And lastly, I wanted to explore how to go about building a community of practice to share these ideas. Nora and several of her friends and colleagues are doing this around https://www.learningfromincidents.io/, a new website that sprung up. Hi, Nora. Welcome back to the InfoQ Podcast, thanks for joining us today.

01:24 Jones: Thank you for having me.

01:25 What were the key takeaways from the QCon London track that you recently hosted, “Chaos and Resilience, Architecting for Success”?

01:25 Bryant: So you and I last met, I think at QCon London. You ran a fantastic track there, thanks again. Chaos and Resilience, Architecting for Success. Great lineup of speakers. What was your key kind of vision, or learnings, or takeaways that you wanted the audience to walk away from that track?

01:42 Jones: Yeah, it's a great question. Thank you all so much for having me. That track was super fun to put together. I've been working in the chaos engineering field, I guess since 2015 now. Which is around the time when it started getting formalized too, and the principles of chaos came out. But I've noticed over time there's been some conflation between resilience engineering, and chaos engineering, and incident analysis. And of course those worlds are inextricably linked, but it's super helpful to understand the differences of them so that you can get the most out of developing those kinds of programs in your organization. So my goal with that was to help understand the relationship between those two worlds, and also to provide real world examples.

02:23 Jones: And so in that track, we had a number of fascinating talks. We had Laura Maguire talking about the cost of coordination and incident response, which was from years of her research in critical digital services and incident response. And she learned a lot about how the tech industry thinks of the incident commander role and how it might not be providing the value that we think it does as an industry. So I definitely recommend checking that out.

02:47 Jones: And then we also have Randall Koutnik give a talk on resilience engineering with UI. And so, one thing that I've realized over my years of working in internal tools teams and SRE type teams, is that we are frequently teams of four backend engineers, but yet we're making tools for the whole company. And so we're just assuming that, say it's four of us, the four of us can learn UI, the four of us can learn front-end, the four of us can learn design, the four of us can do product management. And so I think these roles in organizations end up wearing a lot of different hats. And it's good sometimes, but I think we reach a boundary of those limits where we kind of need those skill sets.

03:27 Jones: And so Randall has actually been a UI engineer, mostly for internal tools teams. So it's really unique role, so I wanted him to share his experiences there. I worked in an organization one time where we were looking at a bunch of incidents and things that were triggering them. And this one internal tool system was always mentioned. ANd those poor engineers, they were doing a great job, but they were also handling so much for the system when most of their background was in operations. But yet they were developing the design for it, they were interviewing their users internally for it.

03:57 Jones: They were doing a lot of stuff, and it wasn't fully acknowledging that they maybe needed more capabilities on that team as well. And then I also spoke in the track on my journey with chaos engineering and how my thinking has evolved over time, how it's changed, how I've been trying to help organizations get more ROI out of it, and realize more potential from it, how incident analysis comes into play with it.

04:19 Jones: And then lastly, we had Crystal Hirschorn on the track as well, and she was talking about chaos engineering at Condé Nast. And then we had Samuel Parkinson from Financial Times that gave a really interesting talk on running game days, but with regards to incident response in their organization. And so that's also a dichotomy that I wanted to bring up, is practicing things under pressure, under pressure moments, is really key for helping you prepare on your team. Obviously there's nothing like doing the real thing and actually being put in on call and understanding that pressure, but he talked about how they had helped prepare folks for that, that weren't maybe practicing it as much.

04:58 Jones: I think that's very different than chaos engineering too, and that's helping build a form of resilience in the team. And so what I really wanted to provide overall with that vision for the track was to understand a bunch of these worlds so that we're not conflating them and giving them their individualized attention so that we can get the most out of them.

05:17 How important is it for engineers to consider the UX for internal tooling, such as that used to run chaos experiments?

05:17 Bryant: Yeah, that makes a lot of sense Nora. I think half the battle with sort of approaching a new topic is understanding the component parts, and the other half is how do they all join together, I think isn't it? And particularly when you're starting out the journey, it's not even realizing some of the most important bands. You mentioned there about Randall's talk, I thought it was fantastic. The whole UX of tooling is something that I've made a bunch of mistakes in my career. I built tools on bash scripts, right?

05:38 Jones: Yeah, same here.

05:40 Bryant: So I think in fact the UX is really important in this space, yeah?

05:43 Jones: Oh yeah. It's super important. And when I was at Netflix, I was on the chaos engineering team and we were a team of four backend developers. And I worked with absolutely brilliant people there that had immense expertise in systems and Java. But we had put together this UI for the application where people could run chaos experiments on, and I remember we essentially would ask them to do this giant form in the UI, was what it was. And in the form, it admittedly we didn't put as much thought and expertise into the design of that form in that front end as we had into the back end of the system, which was where most of our expertise lied, right?

06:23 Jones: And so I noticed after a while, most of the people running the experiments and using this tooling were the folks on our team. So I did a bit of user research internally to try to figure out a little bit more of what that was, and what was triggering that, and how we could change that. Because ultimately, yes, the four of us could be running experiments on the Netflix ecosystem and we had this unique expertise of the Netflix ecosystem, but we didn't have individualized expertise on the services and the components. And so the mental models we were refining when we ran these experiments were our own. Which was ... We were missing helping teams refine their mental models. So it was good to go look into that.

07:03 Jones: But I remember when I had interviewed someone on a team, we're pulling up the application together, doing some basic user research, and I just have them use it. And there's this label on the form that's, "How much SPS do you want to impact?" And SPS is Netflix's heartbeat metric, it's the key metric. If SPS is too high or too low, people get paged, it's a big thing in the organization. And so apparently seeing that on a form is very scary to people. We had given them a form with how much SPS to impact, and we had defaulted it to a certain number, but we had allowed them to change it. And they frankly didn't know what they were supposed to change it to, they did not know how we were calculating that number. It was just a lot for them to think about as they were also thinking about the hypothesis of the experiment that they were going to run.

07:54 Jones: So that was just one example, there were many examples of how we could make that tooling more usable for folks, which I can get into later in this podcast. But that was one example of the design element and the UX element that it was, "Oh, that makes a lot of sense." We, the four of us had a lot of context on what that meant, right? And why we put that number there, but we weren't necessarily exposing it to the user and teaching them at the same time. So there's a lot to learn on our ends about that UI/UX. And I think it reaches a certain point where if you're developing a tool that's impacting your entire system internally and everyone's using it, should you have front end engineers on that team? Should you have specific, nuanced rules that are helpful with design, with research? So it was really good to reflect on that.

08:43 How important are the people and relationship aspects of chaos and resilience engineering?

08:43 Bryant: I think one of the key things there, it reminds me of the conversations I've been having with some very interesting folks for the last few weeks on the podcast around platform teams, or SRE teams, yeah? Now how you actually operate them is very interesting. Is it more of a consulting role, or ... And I guess the challenge of this, you don't want to go and seem like, "I knew everything about chaos, and I'm here to teach you all this stuff." So that sort of human relationship is really important, yeah?

09:05 Jones: It's so important. And I would argue that it's a massive part of it. I mean, as SREs and as internal tools people, like I mentioned before we have to wear all these different hats, and we have to be pretty good at all of them, but we also have to recognize the limits of our expertise. But I would say building relationships is table stakes for SRE and internal tools type roles, which is where chaos engineering falls under too. You really have to make an effort to understand your users, you have to make an effort to understand the teams.

09:33 Jones: In tooling, you can build the best tooling in the world, but that doesn't give you that kind of insight and change into helping you understand what they're actually thinking, what they're actually going through, what kind of incidents they're having, how they think about reliability, how they think about their OKRs, their quarterly planning.

09:53 Jones: And those are things that you need to understand when you're developing internal tools to help with the reliability of the system, you have to make an effort to understand those things. And so I think in a lot of those teams, there's weeks you need to dedicate throughout the quarter where your heads down coding, but there's also weeks you need to dedicate throughout the quarter where you're planning, where you're meeting with people, where you're watching them use the tool and not influencing them in a certain way.

10:17 Bryant: Developing the empathy.

10:18 Jones: Yeah. I think it would be so beneficial for SRE and internal tools type folks if their company has a user research department where their product has actually been researched, go sit in on one of those sessions. I think there's a lot to be learned from that. I've had the opportunity to sit in on them before when I was back Jet.com, and they were fascinating, I definitely was taking notes all over the place. Because you're doing the same thing, you have a mini product inside your company that is impacting your business and your bottom line at the end of the day. So it's good to take in some of these practices.

10:49 How do engineers develop understanding and empathy for the users of the tools they create?

10:49 Bryant: Yeah, I think the double danger is, and I've done this mistake myself, is we are not the customers. We are developers, sure, you're a developer, yeah? But you're not necessarily the customer as you've kind of alluded to there. And that's that empathy it's ... I like your tip there, is in developing user research skills, understanding how to understand your customer.

11:06 Jones: Yeah, your customer is your fellow developers. And so I think a lot of time we almost assume that it's okay to not design in a certain way. I've certainly designed tools that look like a command prompt, and I'm, "But it does what it needs to do." Right? But it reaches us, even though we are making them for developers, they're not in our heads as we're making them. And so I think the design component is super useful in influencing the user to do the right thing.

11:31 Jones: Netflix had this mentality of not putting gates on anything. So every developer in the company had access to almost everything. And there was just a mentality with the tools to not limit the user on doing, I guess, what would be a silly action. And so we had to get super creative in helping them have enough context to not make that silly action, but ultimately not preventing them from doing it.

11:55 Jones: And I think that's a useful mentality I carried towards a lot of my tools from when I left Netflix too, is how do I help teach the user as they're using my tool? How do I help them get better at this? How do I help them think about their system differently so that they're not just inputting and outputting, but they're actually thinking critically about what they're doing.

12:16 Jones: And I think that is on the tool owner to help them do, because the person that's using your tool, they're using it once every week, once every two weeks. It all depends. You're thinking about your tool all the time, they're not. And so you need to help them get in that mindset and understand what that mindset is going to be with all these other external priorities that are happening on their team.

12:38 Could you explain the core similarities and differences between chaos engineering and resilience engineering?

12:38 Bryant: Very nice, Nora. Yeah, so keen to switch gears a little bit now, I'm definitely keen to dive into resilience engineering, incident management, incident analysis, and so forth. But first off I do think people, myself included, we often struggle around the differences of say, chaos engineering and resilience engineering. And I've had a couple of interesting chats with John Allspaw on Twitter, he's rightly corrected me on my articles I've written, but could you perhaps help us understand what is the core similarities and differences between this notion of chaos engineering and resilience engineering?

13:06 Jones: Absolutely. I've also ended up conflating them myself when I was doing this earlier on, and the more I learned, the more I was doing internally, I realized how important it was to understand the differences between them, but also how they're inextricably linked. So I recently published a book on chaos engineering with Casey Rosenthal and we had a number of contributing authors all throughout the software industry, and even outside of the software industry that were practicing chaos engineering. And I wrote a chapter in that book, chapter nine on creating foresight, and it kind of dives into these differences a little bit.

13:38 Jones: I like to share Sidney Dekker's definition of resilience engineering, where he says, "Resilience engineering is about identifying and then enhancing the positive capabilities of people in organizations that allow them to adapt effectively and safely under varying circumstances. Resilience is not about reducing negatives or errors." Right? And so it's not about lessening your error count, it's about enhancing the capabilities of people. And in order to do that, you have to understand those capabilities of people.

14:07 Jones: The resilience engineering framework and methodology certainly influenced chaos engineering and the principles behind them. There was a lot of research done, I know, by the Netflix team, before I even joined about resilience engineering and how to make chaos engineering work. But ultimately chaos engineering, a lot of it does focus on tooling right now. I think there needs to be a lot more focus on people and understanding the mental models of the folks, and ultimately help develop them, understand the differences between them, but what I've seen throughout my time doing chaos engineering and seeing it throughout the industry, I think there's so much focus on creating tools that inject failure.

14:47 Jones: And I see it like internally within companies, I see it like externally. And there's so much focus on that, and it's important to have tools that do that. You ultimately need something to do a chaos experiment, but it doesn't stop there. And so I don't see a lot of focus on what happens before you run a chaos experiment, and what happens after you run a chaos experiment. And what I find is a lot of that exercise is left up to the readers. There isn't a lot of material on how to do these portions.

15:19 Jones: And so companies ultimately end up focusing on what they know, which is they have this team of deep backend and developers that build tools in there. They're awesome in that way, but there's not a lot of focus on the planning portion, and how to decide where to experiment, and how to decide what to experiment on, and how to decide which people to bring into the room, and how to decide when to run that experiment? Those are all scientific things that deserve a lot of forethought, that deserve a lot of time and material, but not a lot of time is focused there.

15:49 Jones: And then you have the during piece, which is you're injecting the failure, you're observing where it's being injected. And then you have the after piece, which is, okay the experiment finished, now what? How do we disseminate that information? How did we dif what we thought about the system versus what we learned? And so the chapter I have in my book kind of outlines the differences between these three areas. And again, they're all super linked. But I think as an industry, we have not placed very much focused on the before and the after.

16:22 Jones: And so I've seen chaos experiments get run with one person, just hand on the button, and then they share the results afterwards, which you miss out on all the benefits of chaos engineering, which is refining the mental model of all the people in the room. And the pushback I get when I suggest that sometimes is, "Well, it's really expensive to have all these people in the room, Nora. If we have a tool that does that, you can just tell people about it afterwards." But it's, what are you learning in that case? Right? Like what are you actually changing if only one person is seeing it?

16:53 Jones: And so I actually recommend including everyone that's using that piece of the system, I recommend including sister teams. And obviously not every chaos experiment needs to be that way, but it's helpful to have all those people in the room, and to have everyone kind of sharing what they think about the system prior to even doing an experiment. And if there is psychological safety in the organization, I honestly recommend having people that are newer to the team share their thoughts about the system first. Because that tells you how you're training people and how well you're giving knowledge to new folks.

17:27 Jones: When I was at Slack, we had a lot of systems that were being handed off between teams. They were reaching a critical point where they had a lot of employees joining, and systems that were previously unowned or owned by one person now had a team forming around them, but a lot of that team had only been at the company a year or so. And so I saw a couple incidents where you would hand all the knowledge to that team, but you wouldn't know where to start, you wouldn't know where to stop, and it's really hard to not have a framework around that. And something that you can use chaos engineering for is to handoff systems, is to share what you think they know about them, what they think they know about them, where they think they should go when an alert is happening, who they should notify when something's happening. And so it's really helpful in testing if you've done enough with that handoff too.

18:13 Jones: And again, it's really hard for the person that's doing the handing off to know if they've explained enough, because when you're an expert in something it's so hard to understand what you're an expert in, unless someone specifically asking you those questions. Which is where the importance of a chaos engineering facilitator role comes in. This third party that is technical, but is not super nuanced in the system and can ask the silly questions, right? You need someone asking the silly questions, and you need them asking it to the right people.

18:44 Jones: And so I think that lack of focus in the industry, I've seen a lot of chaos engineering programs not last in organizations after a year or so, because ultimately it's a lot of work on the tools, and it's hard to understand the ROI that you're getting out of it. So one recommendation I have had that I've actually seen work is linking it to incident analysis. You can use learning from incidents and doing incident analysis across your organization to understand themes across your incident, to understand things that are triggering them, to understand which teams are typically involved in certain incidents, to understand which people are getting pulled in, even when they're not on call. To understand which dashboards are frequently being looked at, to understand what systems in the organization generate a lot of interest when there's an incident.

19:32 Jones: Spending time understanding those socio-technical things beyond just alerts, or beyond just the technical parts of systems, actually helps you get more return on your investment with your chaos engineering too, but also it helps with a number of other things. It helps you with your quarterly planning. It helps you understand where to direct attention. It helps you understand if you need to add more people to a team. And so incident analysis is something that I've learned is not really about the incident anymore, the incident is a catalyst to helping you understand more about your system.

20:05 Jones: And you can use that knowledge to direct your chaos experiments, to understand like what's important to experiment on. And also when you do a chaos experiment like I've seen people find results from them, put them in a Jira backlog, and it stays there for seven months. And so it's like if we wrote down our learning, we disseminated them, but we focus way too much on fixing what we found, but we don't actually end up fixing it. And so I think focusing on incident analysis too as kind of a driver helps you understand how much you should prioritize those fixes.

20:40 Jones: I saw a team in an organization that I worked in, and they had run a chaos experiment and I helped them with it, where they injected some time in between calls to see if something would time out and to make sure that we would still be resilient to adding that additional time between these RPC calls. And they weren't resilient to it, but also they weren't sure how often something like that would happen. That was not data that we were giving them, right? And so it was a hunch they had, they were working off of a hunch, and it's hard to know how to prioritize fixing that without that hunch. And so I saw that team have that ticket that stayed on that Jira backlog for so long. At that point, I'm, "How much do I push on this in my role? How much do I push on getting them to fix it? Are they still getting value out of the chaos experiment?"

21:25 Jones: And after reflecting on it, I didn't think it was my job to push on them to fix it. I thought it was my job to help them have enough context to understand whether or not to fix it. And ultimately what I saw happened was that they brought a new hire onto the team and they had that new hire fixing some of the bugs that they found from their chaos experiments. And I think at first glance, that's a pretty intuitive thing to do. It's okay, that's a great way to understand the system is by fixing bugs, right? And so I understand how people could get there, but that's actually one of the worst things that you can do because they don't have an understanding of the system yet.

22:02 Bryant: Oh, interesting. No context, basically?

22:05 Jones: Yeah. They don't have context, pairing them with someone that's fixing those and that's working through how they got that way to begin with is an amazing idea. But giving that to a new hire, ultimately they're not going to do what you intend them to do because they don't have any expertise built up in that system yet, and they can't fully understand how it got to the way it was. And so I see a lot of teams doing that, starting off a new hire with bugs, but it's much better to pair them with them, it's much better to have them observing incidents rather than doing the work of fixing a bug that they don't have a full comprehension of how it got that way to begin with.

22:42 How important is it to develop a mental model of a software system, both the people and technical aspects?

22:42 Bryant: Something I've heard you say a few times, Nora, which I'm learning as my career progresses, this notion of a mental model is really, really important, yeah? And there's definitely a mental model for your customers, but there's a mental model, I think for the system, and maybe when I say system, it's not just the technology, it's the people too, right?

22:56 Jones: Yeah. It's knowing who to page, you may page someone named Karen that was on the consult team five years ago, and now has nothing to do with console, but you remember working with her on it. And that's the only reason you need to page her. That kind of stuff should get put in your incident report. That should get written down because you have this specific mental model of who could fix a thing. And what happens when she leaves the company? What happens when you leave the company? Do other people know to page her in those situations? Are we burning her out? Is she getting paged too much in those situations?

23:32 Jones: So taking time to, yeah, to understand the mental models of the system and the people involved is so important. I think we're getting a lot better at it, but I don't see us focusing as much on the people parts of system as we do on the technical parts of the system. And it's so important to focus on both. I'm not advocating only focusing on the people, or only focusing on the tech, but understanding how they work together and not leaving the people out of that equation, not leaving them out of the reporting on what triggered the incident.

24:03 Jones: And again, it involves developing psychological safety in your incidents, where in your organization where it's okay to mention someone by name and know that they're not going to lose their job, know that they're not getting blamed for it. They're just getting mentioned by name because they were involved.

24:18 Could you explain the motivations for creating the learningfromincidents.io website?

24:18 Bryant: Yeah. And I think this is a nice segue, Nora. I was busily reading the last few weeks, learning from incidents.io. I know you're heavily involved in that one, which I thought was fantastic. I think it's published, what? Back in 2019, late 2019?

24:30 Jones: Yeah, in December, 2019. Yeah.

24:33 Bryant: Yeah, so thoroughly enjoyed it, some great blog posts. All the names you kind of recognize, Ryan Kitchens I saw there, and a bunch of other folks I sort of chatted to on the podcast. I think one of the key things folks often ask with these things is why build a community around it? Because it looks like you're sort of forming a community of very interesting like-minded people, Jessica DeVita, I know John Allspaw was there, a bunch of folks I see you regularly talking and presenting with. What's the grand plan, I guess, with building the community?

24:55 Jones: Yeah. I'm happy to chat about it. So when I was at Netflix, I was on the chaos engineering team for awhile. And as I mentioned, we were really focused on the during piece, right? We weren't focusing as much on the setup, and the hypothesis, and knowing where to experiment, knowing who to pull into the experiment, and knowing what context to give them, and also what to do with the results. And so my colleague at the time, Lauren Hochstein, who was on the chaos engineering team with me started researching incidents. And I was, "Oh, that's a really good catalyst for helping us prioritize our chaos experiments." And so I started getting really into it too.

25:30 Jones: And before you knew it, him and I were both on this other team that we had formed around themes around incident analysis, and we wanted to use it to give people context around the organization to help make their decisions. And it worked so well, we would output these reports that a ton of people would read. And we would comment on the triggers, we would comment on contributing factors, we would comment on the folks that were involved, we would comment on themes behind the incidents. And the more we did that, the more we could do themes across the incidents.

25:58 Jones: And so I remember I had outputted this report that did themes across some of the major and minor incidents in Netflix. And it really helped influence OKRs, and quarterly goals, and stuff like that. And him and I were getting really into it and we were working with a couple other people too. And I was, "This is so fun and so valuable in a number of ways," that that's when I started realizing it wasn't just about the incident. The incident was a catalyst to the org understanding how they think they work versus how they actually work, and understanding that diff.

26:29 Jones: And I thought, "Surely there's other people in other organizations doing this that we can talk to about this. It would be so fascinating to understand how other organizations are thinking about their incidents. I would love to see postmortems from other and have the ability to share them with each other in an environment that was safe, and it was just about learning from each other." And so I just kind of on a whim tweeted, "Would anyone be interested in a community around this?" And I got 200 DMs in a night. And I wasn't ... It was a lot. And so I was, "Okay, there's definitely a lot of interest here." But I wanted it to be a trusting environment where we could build up relationships with each other, and share postmortems with each other, and share different learnings, and cultural maybe issues in our organization that were preventing some of the learning from happening.

27:18 Jones: Because when you're mentioning people and when you're talking about the socio-technical systems, naturally political pressures are going to come up in organizations, and so it's super important to know how to navigate those. And so having a support community to talk about how that's happening is really important. And so that's sort of what it started evolving into. And so I had that Slack for about a year, and while I didn't want it to be a huge public community, we still invite folks and we build up relationships together. But the most important part is that it's a trusting environment and that we can share some of these things with each other.

27:49 Jones: But we were learning a lot, and so I wanted to open source a lot of those learnings so that we could open that up to the world. And that's when I deployed the LearningFromIncidents.io website that shared some of the learnings, that shared how folks were doing in their organization. And what I found a lot of with resilience engineering and incident analysis, sometimes it's so focused on papers and academia, and I really waned LearningFromIncidents.io to be focused on actual stories where people were chopping the wood and carrying the water in their organizations, and talking about them. And so you'll see on that website, it's almost always real stories of how people are doing this in their orgs, rather than just talking about the academic papers. We referenced them certainly, but it was trying to be more industry focused and trying to be more focused on software itself as well.

28:39 Is there such a thing as a “root cause” with a production incident?

28:39 Bryant: Yeah, very nice, Nora. As I was looking through the blog post that popped out, one was the Classics and Notion of the Human Error. I don't know if you've got any opinions you want to share about that? I think we in the media, when we see these issues that Google, or Netflix, or pick a big company ... Actually Google and Netflix are bad examples, but some big companies and they go, "Oh, it was human error, I've got the root cause." And I was kind of curious on your take on what do you think around human error and root cause?

29:02 Jones: Let me start out by saying there's a lot of thoughts I have on these fields and how to make it better. And then there's also this topic of change management, within organizations, within the industry, where I really don't ever want to tell someone, "You're doing it wrong." Right? I want to meet them where they are today and help them kind of expand that thinking. And if I was starting out in an organization and I was helping them put together their incident analysis reports, helping them do postmortems, would I put the word root cause in there? Probably not.

29:35 Jones: If I was joining an organization that was using it, I don't think that is the battle I would fight first, right? I think there's a lot more to it than that, and I think you need to get people there on their own. Because if you just take it out of those reports, if you just tell them that there's not human error anymore, you're probably going to get a lot of pushback and you're probably not going to be super liked in your organization, which is really important when you're trying to introduce new means of doing things.

30:01 Jones: And so I think a lot of the problems that arise around using the terminology root cause and human error, I think much more of them are around human error, because it doesn't actually help make things psychologically safe in an organization. So I'll give you an example. I was in an incident once ... I was actually tapped to analyze the incident, but I wasn't a part of it. It was a major incident in the company, and I came in the next morning, it was super early in the morning. And one of the engineering leaders pulled me aside and was, "Nora, this was all human error." And I, "Oh, what do you mean by that?" And they were, "Well, Frank didn't know what he was doing, he got handed this new system and he didn't need to address that at 4:00 in the morning. And had someone a little bit more senior received that alert, they would have just waited until they were in the office the next day, they wouldn't have gone to sleep after addressing it."

30:55 Jones: And so that was the context I had, and that was like the narrative kind of being spread throughout the organization. You really can't control the narrative. Especially after a big incident, people are going to be talking about it and they're going to be forming their own stories. And so what I did was I went and talked to the Frank in that situation. I asked him to tell me about what happened. And naturally, Frank knows that everyone throughout the organization is really nice to him, but also subtly blaming him. And you imagine being a more junior engineer in this situation, it's kind of scary.

31:26 Jones: And so my role at that point was to help make him feel comfortable. It was just a one on one conversation, we grabbed a coffee, tried to keep it real casual. I was just trying to understand his mental model at the time, because if he did this, surely someone else was going to do it in the future, right? And so I think that's some of the problem with human error is that when we just label it like that, we don't actually expand on what was going through their head at the time, because it made sense to them. They weren't being careless, what they did made sense to them at the time.

31:56 Jones: And so what I found out through that interview was this engineer had been up all night dealing with another system's issue, for six hours. That issue had just finished around 3:00 in the morning, that person went to bed for 15 minutes. They got this alert for this completely separate system. And that alert essentially said, "A box was down." This person did what we all do and kicked the box. But, this particular box could not be kicked. And this was the first time this engineer had ever received an alert for this system, his team had been given this system six months ago, and it was the first time he received an alert around it.

32:34 Jones: And he had context around how to handle a system that was similar to this in which he would kick the box. But this particular thing was so nuanced that you weren't able to kick the box for this. And the person that handed off to him knew that. But naturally when he's not dealing with it, he's also super tired, he just didn't think about it and followed the instructions he knew to follow. Went to bed, woke up, everything was on fire.

32:58 Jones: Poor thing, right? But when you hear that side of the story, you're, "That makes total sense." And so I personally wouldn't label that as human error. I would think it's an issue with the system, and by system I mean the organization, how we hand off things. If he's on call for six hours dealing with something, why is he still on call for something else? Reroute those alerts to the secondary at that point.

33:20 Jones: So there were a lot of learnings that we had from that incident to help support people like Frank better, that I think if we had just labeled it as human error, the organization would not have evolved from that. It would have pinned a lot of the blame on Frank. It would have made people that made that move in the future kind of more afraid to share share they were working on. And so again, I don't necessarily like to break orgs from using the terminology that they're super comfortable with, but I like to help give them context on different things that they can do instead. And so I don't think the problem is necessarily with using the word root cause or human error, it's the problem is that doing something like that makes it stop there. When really the root cause was not human error in that situation, it was a lot of stuff, and all that stuff was important to mention in the report.

34:06 Could you recommend any good books for understanding resilience engineering?

34:06 Bryant: Yeah, good stuff Nora. I'll be honest, I could probably chat to you all day about this kind of stuff, because I think it's fascinating. I think it's a young industry as well, so I think ... Because I'm a classic enterprise developer, I'm building my mental model of the whole space, and I think many of the listeners of InfoQ are, but we've got a limited timeframe, so today I'm going to have to wrap it up. If folks want to learn more, can you give any references? And I can always put them in the show notes if you don't want to say them now, but is there any good jumping off material for folks to get more involved?

34:30 Jones: Yeah, absolutely. So LearningFromIncidents.io, we're trying to put a lot of material there, trying to open source a lot of learnings. And you'll see people from all throughout the industry, posting things on there. The chaos engineering book we just had outputted, I would also highly recommend reading The Field Guide to Understanding Human Error by Sidney Dekker. Yeah, it's a great book. Sidney Dekker was a pilot, and so you're going to see a lot of aviation examples, but you'll notice how relevant they are to the tech industry. So that's a good orienting point as well.

34:58 Jones: I also really liked the book of Friendly Fire by Scott Snooks. That's a great book on understanding how something came to be, something that kind of seemed like human error, all the nuances that went into it. And it's a really great book, it's a true story. Yeah, there's a lot of folks on Twitter that are talking about this stuff that are learning about it. And I want this space to really be approachable because I think there's a lot to learn, and there's a lot that we can implement in the software industry to make our chaos engineering better, to make our OKRs better, to make our teams more resilient. So yeah, I think that's what I would start with and I can also DM you some stuff as well.

35:35 Bryant: Awesome. Well thanks for your time today, Nora.

35:36 Jones: Yeah. Thank you, Daniel.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Spite-Driven Engineering: A New Blueprint for Cloud Security in the AI Native Era

Architectural Patterns: Moving Beyond Cloud-Native to Local-First - Insights from Adam Wiggins

How eBPF Empowers Developers to Observe inside the Linux Kernel in a Safe and Unintrusive Way

Increasing Users' Data Agency: from BlueSky's AT Protocol to the Local-First Software Movement

InfoQ Software Architects' Newsletter

Nora Jones on Resilience Engineering, Mental Models, and Learning from Incidents

Key Takeaways

Subscribe on:

Transcript

01:25 What were the key takeaways from the QCon London track that you recently hosted, “Chaos and Resilience, Architecting for Success”?

05:17 How important is it for engineers to consider the UX for internal tooling, such as that used to run chaos experiments?

08:43 How important are the people and relationship aspects of chaos and resilience engineering?

10:49 How do engineers develop understanding and empathy for the users of the tools they create?

12:38 Could you explain the core similarities and differences between chaos engineering and resilience engineering?

22:42 How important is it to develop a mental model of a software system, both the people and technical aspects?

24:18 Could you explain the motivations for creating the learningfromincidents.io website?

28:39 Is there such a thing as a “root cause” with a production incident?

34:06 Could you recommend any good books for understanding resilience engineering?

More about our podcasts

Previous podcasts

Rate this Article

This content is in the InfoQ topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter