InfoQ Homepage Presentations Rethinking How the Industry Approaches Chaos Engineering

Rethinking How the Industry Approaches Chaos Engineering

View Presentation

Speed:

Download

50:39

Summary

Nora Jones focuses on the Before and After phases of developing Chaos Engineering experiments (whether they be gamedays or driven by software) and develops important questions to ask in each of these phases. She digs into some of the Ironies of Automation present with Chaos Engineering today.

Bio

Nora Jones is a dedicated and driven technology leader and software engineer with a passion for people and reliable software, as well as the intersection between those two worlds. She believes that safety is pivotal with software development nowadays. She co-wrote two O’Reilly books on Chaos Engineering, and how a product’s availability can be improved through intentional failure experimentation.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Jones: Let's set some definitions first, on chaos engineering. Who has practiced chaos engineering before at their organization? Not a lot of hands. Has anyone run an informal game day type thing? From the chaos engineering principlesofchaos.org, it's defined as the discipline of experimenting on a distributed system in order to see how it withstands turbulent conditions in production. The idea is that the turbulent conditions are already there, and that chaos engineering is finding out more about them. It's been a bit misconstrued over the years. As with everything that gets popular in software, it eventually becomes a marketing term.

Different Phases of Chaos

What I want to talk about is these different phases of chaos. We have, what happens before we do some chaos engineering? What happens while we're doing chaos engineering? What happens after? Laura McGuire actually hinted at some of this idea called creating foresight in her talk. I'm going to take those approaches and apply that towards chaos engineering as well, because really doing chaos engineering is a way that we can create foresight and hone expertise within our organizations.

I want to point out that there is too much focus as an industry on the, during part. It makes sense. We're engineers. We like tools. It's really cool to break things in production. It sounds fun, especially when we can do it safely. However, there's so much more to it than that, that we're not taking advantage of. Therefore, we're losing what we're actually gaining from spending so much time on this tooling part. I think we need to just stop focusing all this attention on how we can make things fail and break them. Because we miss out on the value of the setup of the experiment and the dissemination of learning outcomes that can bring our organizations and help us level up, overall.

This is a little bit of my background. I've been talking about chaos engineering for many years now. My mental model of chaos engineering has refined and grown since I first started doing it. I co-wrote a couple books on it. I have a book coming out in April that I co-wrote with Casey Rosenthal, along with a number of contributing authors where I spend one chapter actually talking a lot about what I'm going to talk about. I also recently founded a community called, Learning from Incidents in Software. I spent some time at both Netflix, Slack, and jet.com developing and honing my chaos engineering experience.

I want to set some common ground. The goal with chaos engineering isn't to build tooling to stop issues or to find vulnerabilities for us. It's to push towards a culture of resilience to the unexpected. There's a number of ways to do this, chaos engineering is not the end-all be-all. Crystal touched on that a little bit in her talk. Tools can help with this. Sure. It doesn't stop there, just because we find things with our tools, and just because our tools work in surfacing vulnerabilities, what does that actually mean for our organizations? That's what we're going to talk about.

I want to share this quote from Sidney Dekker, David Woods, and Richard Cook. They say resilience is not about reducing errors. It's about enhancing the positive capabilities of people and organizations that allow them to adapt effectively and safely under pressure. Given this focus so much on the tooling in the last couple years, I think we've lost sight of this a little bit. We almost celebrate finding vulnerabilities. What I've seen happen is we don't actually always fix them. We don't actually do anything about that. We clap and then we move forward. It gets added on to a JIRA backlog, and it sits there for seven months causing us a lot of anxiety because we know it's in the background, but we don't actually know how to prioritize it.

I'm going to talk about some different phases of chaos engineering that I don't believe get enough attention today. Let's start with the before phase. Sharing this quick quote from George Box, he says, "All mental models are wrong, but some are useful." This track has talked a lot about mental models. It's talked a lot about mental model refinement. It's talked about how my mental model of the system is completely different than my teammate's mental model of the system, even if we're the only two working on this system.

Story Time - Chaos Automation Platform

I want to share a story. When I joined Netflix back in 2017, we were building some pretty cool stuff. We had written some white papers about it. It had been featured in conference talks. We had people rolling similar tooling. I want to give a quick overview on what we built so I can set the context for this. We had built this tool called ChAP. It was a chaos automation platform. The way it worked is it took advantage of the fact that Netflix services use a common set of Java libraries. We essentially annotated incoming requests metadata, that a particular call should be failed, or have latency added to it with RPC calls, and what have you. Here is an example where we were failing a Netflix bookmark service. The hypothesis was, if bookmarks is down, if your ability to save a movie for later is down, that should not impact your ability to actually watch movies. That was always our key heartbeat metric. We would take a very small percentage of traffic, just enough to get a signal. We would split it in half. We would put half of it into a baseline cluster where we wouldn't do anything with it. Then we'd do half of it to a canary cluster, where we would either fail or add latency to the call. We were so confident in this that we did it in production. We knew that if it did impact this key metric, this heartbeat around play, we would automatically route these people back into the main cluster. That would happen within a matter of seconds. All a user would see, if anything, would be a blip.

When I got to Netflix, a lot of this was built for developers spending probably 95% of our time, heads down building this really great tooling, this really novel tooling. As Randall mentioned in his talk, please pop your heads up and talk to users. When you're on an internal tools team, you have users. Your users are your co-workers. Your users are other teams. I had this question, I said, who is running these chaos experiments? We built this great system. It could do all this stuff. We were talking about it at conferences. We were talking about it in books. We were able to safely fail in production. When I looked at the runs page, I saw that it was really just the four of us running these experiments, which is pretty wild, because we had expertise in this tooling and how ChAP worked. However, that was our expertise. We didn't have expertise in the nuances of the Netflix ecosystem. I didn't know the ins and outs of the bookmark service. How could I possibly come up with the best experiment there and disseminate what I learned to the bookmarks' team? Which is why the findings sit on a JIRA backlog for a while, because two distinct sets of context. We had this runs page where you could see who was running our experiments. We were so proud of what we had built. We should have been. It was really fun work. What good is it if the only people that are really using it were us? We spent a little bit of time trying to get people to run experiments. We'd go sit with a team. We would talk through the tooling, and how to use it. I would watch people's faces get really scared, because when you go tell a feature development team to fail their stuff in production using this tool you created. We were four back-end developers, but it was essentially a giant form. One of the questions was how much production traffic do you want to fail? Apparently, people are scared of that question when that's just an open box.

What did we do? This was clearly a problem. We were the only ones running. We were mostly the only ones using this tool. It was very hard to get people to come up with experiments and use it. We attempted to solve this problem in a very common software engineer way. Does anyone have a guess on what we did?

Participant 1: Stickers.

Jones: Yes, but no. Any guesses?

Participant 2: Documentation?

Jones: We tried to automate it. We said, it's hard to run experiments. Takes a long time. We also just want to code. Surely, we can just spend a couple quarters automating the creation and the prioritization of all the experiments for them. Then they won't have to do it. We were pumped. This was some really fun work. I spent a long time doing this.

A Story on Automating Experiments

I started by collecting information about services from a ton of different sources in order to do this. These different sources were various systems across Netflix. They were also via one-on-one interviews with people. This was all in the service of putting this in this algorithm that I was working on. Some of the sources were getting the timeouts for services, getting the amount of retries, getting the fallbacks, getting the percentage of traffic it served. This was all in service of saying, is this actually safe to fail to begin with? If so, add it to this queue of experiments that need to be created, and create it in a certain way. Then prioritize it in a certain way. It turned out expertise on various phases of creating these experiments was all throughout the company. I had to talk to a ton of people, more than I had during my tenure there just by doing this project.

After doing all these interviews, and aggregating all this data, and putting it in an algorithm, I realized I now had this very specific expertise. I did not want it. I did not want that expertise. I didn't think it was actually going to be useful just in my head. I also had a question, if I am getting all this data in service of checking if something is safe to fail, and also how important it is to run an experiment on, surely, it would be useful for people to just see that information in one place. I was building an algorithm. I also just threw it up on a dashboard. This was not the goal of the tool. The goal of the tool was not to create this dashboard, but I figured it would actually be useful for people to see how we were ranking their experiments. We called this Monocle, we got cutesy. It was ChAP, and he had a Monocle, and see optics on experiments. I was very proud of that name. People could log in and see their application and their cluster, and all their services associated with their cluster.

All these columns that you see here are things that could potentially be added to a chaos experiment. The safe to fail X and check, while it looks simplistic, when I showed people their dashboards for the first time, almost every team just immediately opens their laptop and starts diving into their code. Wait, that's not supposed to be unsafe to fail. That was the surprise. This UI was a direct result of going through the preparation of automating these chaos experiments. I hadn't even started doing it yet. One of the funniest parts was we turned this automation on. We only turned it on for things that were already safe to fail. We turned it off after two weeks. Because this actually provided us most of the value that we wanted, was this dashboard. We also could not keep up with the experiment results. Turns out we were probably doing about four experiments a day before. When you do an experiment, you have to go look at the results, which takes a while. You have to synthesize them. You have to talk to the team about them. When we put it on automation, we were running about 200 a day, which meant we were looking through 200 experiments. When you have 200 experiments running, you find a lot of vulnerabilities. It turns out when you give a team 500 vulnerabilities, they usually fix zero. If you give them two, they might be more likely to fix them. I haven't been on this team in a couple years. I don't know if they turned it back on. It was off the last time I checked.

I want to really drive this home because the most insightful part of doing this journey was not actually coming up with this automation, or coming up with these experiments, it was actually designing it. It was working with the teams. It was getting them to talk about their mental models. It was asking questions in valuable ways, and surfacing this information in a common form that everyone could look at. Everyone had different pieces of understanding on that dashboard, but they didn't have this developed view of how it all worked together. What I really want to drive home with this talk is we don't spend enough time here. We did write a white paper about Monocle. We wrote a paper about automating the experiments. We talked a lot about the algorithm. You can find it. It got accepted to ICSE. It was cool work. When I step back and look at the business value it was providing, it was really in that dashboard. It was really in surfacing and disseminating that expertise.

How Do We Design a Meaningful Experiment?

Not everyone is building or buying a tool like this internally, nor should you. Sometimes we're doing game days. How do we design a meaningful experiment if we're doing a game day? The advice I would give here is to focus on, the who, the what, the why, and the how, and have this be a collaborative process. All too often I see, with multiple companies, that they involve just the expert in the game day prep, in the game day room. They don't invite the new team members. They don't invite the intern that just joined. They don't invite all the people that are going to be on call because, why? It's expensive. They're missing out on a lot of the value.

One piece of advice I would have for this would be to pick a facilitator. In the incident analysis world, we talk a lot about having an incident review or a postmortem, done by a third party that has enough expertise, but wasn't quite in the action. I would recommend the same thing with chaos experiments. Pick a facilitator that has a technical background, but is not on the team. The reason being, is they're a layer removed and they can ask the silly questions. They can ask the questions that the team would not ask themselves.

What Do We Want To Experiment On?

A big question is, what do we want to experiment on? We know we want to do some of this chaos engineering. What do we want to run an experiment on? Involve all the people participating in the experiment in the preparation itself. A lot of times, I'll see just one-on-one for the prep. Then all of a sudden, 14 people will show up to the actual experiment. Because that part's cool. That's the part we nerd out on. That's the part where we're like, we're failing production. It can make a fun blog post. The participation part and the planning part is actually some of the most important steps.

When I was at Slack, I was leading up chaos engineering and human factors incident analysis type work there. This team came to me, and they said, "We want to come up with a chaos experiment on this new system that's being rolled out to prod. Can you help us come up with it?" I worked with them. I essentially facilitated for them. We spent about two weeks on the prep. The person that had told me, "We're ready for prod. This is just a gut check." Came up to me after two weeks, and said, "Nora, we're not rolling this out to production anymore. We're not even going to run the chaos experiment on this. Just going through this exercise, we found out that we are not ready." We did not even need to run the experiment. We got the value out of it.

There are some common questions I could ask as a facilitator to get them thinking about things in a certain way. These aren't applicable towards every type of scenario or every type of experiment. Some questions you might want to ask, if it's a services type chaos day, you could ask, are there downstream services that you're worried about in your system? Do you know what happens to upstream services if something goes wrong? Do you know what happens when something goes wrong with your services now? Do you have any fallbacks in place for things that are supposed to be non-critical? What do those fallbacks do? Who knows what those fallbacks do? Try to find that boundary for them. When you hit that boundary, figure out where that expertise lies, and bring them into the experiment too. Ask them how those fallbacks could actually impact the user experience. Ask them about recent changes in their system. It's funny, these seem like basic questions, and I've only really asked five of them. By this point, you usually hit a team's wall, and they're having to look up stuff. They're learning stuff as they're prepping for this game day that they're going to run. Some other questions you could ask are how confident are you in the configuration settings for your particular service? How confident are you being on call? How confident are you in any of your team members being on call? This has to be asked in judicious ways. There's also a psychological safety component. That's really important. Sometimes you may want to ask these questions one-on-one. I always also ask, what in so many words scares you about your service? I keep it purposely open-ended. Because the confidence or lack thereof that people possess is a sign of uncertainty about their own mental model, or perhaps a lack of confidence in the operating procedures.

The Apollo 1 Launch Rehearsal Test

I want to bring up this story about the Apollo 1 launch rehearsal test. I've heard the Apollo mission actually referenced so much throughout this conference so far, which is great. There's a lot of awesome case studies in it. This mission, Apollo 1 was a launch rehearsal test. It was the first crewed mission of the United States Apollo program. It was the first project to land the first man on the moon. It was supposed to be a low-earth orbital test to launch on February 21, 1967. The mission never flew. A cabin fire took place during the launch rehearsal, which killed all three crew members that were supposed to fly out. The name Apollo 1 chosen by the crew was made official by NASA in their honor afterwards. One of the things that was interesting about this test was, not all the necessary parties were involved. It's just a test. A lot of stuff went wrong. They didn't have rescue, medical assistance, or crew escape teams involved in the planning of this test. None of those people were notified because it was just a test. It went off the rails. The hatch got locked, and everyone inside passed away. Unfortunately, there was no one to call.

Novices vs. Experts

Flipping back to the design of chaos experiments from the story, you also want to include new employees. This is a good way to get ramped up. It's also a good way to test how you are talking to new employees about your system and how you are disseminating that expertise and ramping them up. Samuel Parkinson gave a great talk about how they do this at "Financial Times" with ramping people up to go on call. This is another good opportunity for that. I want to share this quote from Gary Klein who's a cognitive psychologist. He wrote this book called, "Seeing the Invisible." He says the new employee can see the times when the system broke down but not the times when it worked. How many of you have had a new employee join your organization? They've just been aghast at a particular part of your system. You're like, "You're so cute. It was so much worse before." It is honestly adorable, really. They're coming up in arms like, how could it possibly be this way? As opposed to novices, experts can see what's not there, including both in your experiments, but keeping psychological safety in mind. Making sure it's ok for people to say what they know and what they don't know, will really help increase the ROI of these chaos experiments.

Start Looking At Your Incident Data and Make Some Observation

Another recommendation I would have, and I see a lot of organizations not do this, is to start looking at your incident data and making some observations around that, because that will reveal some themes around what you might want to experiment on. Laura McGuire gave a great talk about cost of coordination and software incidents. There's a lot of coordination costs involved in incidents that could indicate that expertise might need to be more disseminated throughout the organization. Or if you're pulling in people that aren't actually on-call for a thing, but you need them. You called someone Tim earlier, Tim always shows up. Do we actually record when Tim shows up, when he's not on-call? Because that's an indication that Tim is an island of knowledge. What happens when Tim leaves? Do we know how to do those incidents? I bring this up, because looking at your incident data will actually show you where you have gaps in your organization, which should prompt you to do chaos experiments there. Then you'll know how to prioritize those findings a little bit better, because you understand how important they are. If your incident data resembles the following, how many of y'all have worked at a place that has a spreadsheet with these columns? This is not actually going to reveal those gaps for you. You need to get a little deeper. You need to ask different questions about your incidents, in order to be ready to understand where to do chaos experiments on, in order to understand how to prioritize the findings, in order to understand who to get involved in the room.

You Can Derive More Meaningful Experiments with This Data

Here are some different questions that you can ask about your incident data that can point you towards opportunities for chaos engineering. Maybe we ask, which systems haven't failed in a while? Which systems had failures that took us by surprise? We always have incidents that we're like, how in the world did that happen? We also have incidents that we're like, that sounds about right. Which incidents involved un-owned systems or the ones that needed a Tim to step in and resolve? Are we recording that in that spreadsheet? Which incidents involved people that had never worked together before? Simon has talked, he said that they've resolved this wild issue. He brought up that part of the reason that they were able to work through it is that the 10 people involved knew each other really well. That's something that's not to be discounted when you're recording this data later. What happens when we have people that don't know each other really well? That adds to those coordination costs. I could point to a place where a chaos experiment might be a good fit to understand this a little bit better. What did near misses look like? Where did we go, "That was pretty close? I'm glad TechCrunch didn't catch that. I am glad that customer calls did not go off the rails." Which ones involved difficulties in even figuring out what was going on? I have been in incidents where we're sitting for 14 hours trying to figure out what in the world is even happening. Then when we get to that point, it's easy to fix. Are we measuring that point? Because that also tells us something.

Resilient organizations don't take past successes as a reason for confidence. They instead use them as an opportunity to dig deeper and to understand how they work. I want to come back to this because how teams decide what to experiment on, how they use all this data, and all this prep, and all this setup work is just as revealing, if not more revealing than the experiment itself. I mentioned that story at Slack, where they said, we actually don't need to run this experiment. We're not going to release this when we thought we were. That's revealing. It's ok. It's a good thing. You're getting ROI. You're giving ROI back to the business.

Discuss Scope

When you get to the point where you do define a chaos experiment, and you decide, I'm actually ready to run this. We feel good about this. You want to discuss scope. You want to say, how do we define a normal or good operation? Where are we actually injecting the failure? We want to say, what do we expect to happen? What do we expect from individual components? What do we expect as a whole? How do we know if it's in a bad state? I see a lot of people do chaos experiments and just use their normal dashboards. Crystal talked about in her talk, where she saw this dashboard once that had 50 tabs on it. She was like, I didn't even know where to look. I see that a lot, this dashboard hygiene. That is something actually you should reveal in your chaos experiments. If it's hard to figure out what is going on. That's a signal that maybe something should improve. With all this in mind, how are we actually observing the system? How are we limiting the blast radius if something goes off the rails? What is the perceived business value of the experiment? Let's tie this back to the users, whether it be our co-workers, whether it be our end users. What is the business value of this experiment? How does the rest of the organization perceive it? Is that different than how I receive it? Is it different than how the team that's creating it receives it? Then up next, hypothesize. Actually, state what you think is going to happen. If we fail X part of the system, then Y will happen, and the impact will be Z.

The During Phase

We've set up our plan. We've involved all the parties. We move on to the, during phase. When I was first getting started with chaos engineering, and I told the story a lot, I was at a company called jet.com. Our marketing team was crushing it. We had a lot of traffic to the site. We were also going down a lot. We were spending all these marketing dollars, and yet, we were going down the first time people access the site. We created a Chaos Monkey. It was pretty simple. I asked folks if it was ok to turn it on in the organization. Everyone was like, yes, sure. It should be fine.

We started in QA, had enough foresight to do that. I didn't get a lot of input from folks. We turned it on assuming that everything was going to be good, and actually took QA down for an entire week, which significantly impacts the business. Luckily, the chaos program survived past this. We improved from this. We learned how important it was to actually involve people in these experiments and ask certain questions. It was way more about the culture, and the people, and the setup, and the dissemination than it was about this tool. Although the tool was fun to create.

At Netflix we had graphs like this. I mentioned that baseline and canary cluster before. We would actually add canned graphs on to every experiment where a user could track the experiment, or the canary cluster versus the control, the baseline cluster. If those deviated too far on certain business metrics, the experiment would automatically shut down. We had built that sophistication into the system. It was important to look at this afterwards like, this thing clearly affected downloads, or this thing clearly affected license requests.

Roles

If you are doing more of a game day type experiment, rather than a tool type experiment, it's good to have roles too. It's good to have specific roles for people. These can change a little bit and be a little bit fluid. Talking about this beforehand will help you set it up, will help you be organized about it. Having a designer, the person leading the discussion, a facilitator person keeping everyone on track and asking those questions. The person executing the commands, actually, on the command prompt or whatever tool you have. A scribe taking notes on what's occurring in the room, and keeping everything on track there so that you can disseminate it afterwards. An observer that is looking at and sharing relevant graphs with the rest of the room. Like, "I see this thing or this thing." Then a correspondent whose pure job is to just talk to the folks that are on-call the rest of the business to say, "We're running a chaos experiment right now." I've seen incidents come up, while these are happening, and those two parties not talk. A lot of confusion has happened as a result. Having that be a particular person's role is also very important.

Steady State

The principlesofchaos.org also talks about steady state. It's so important to have a steady state in our experiments like, what is our normal operations? What's our good operation? What's standard? Not, what is our North Star? What's normal? Answering this question is actually hard. It's very interesting to ask it to different people in the room, because you might get different answers, which might be an indication that you should not run that experiment. Don't be lenient about this definition. Like, "We're all on the same page now. We can just do it." Because you've set up the tools at this point, and you're just excited to run it. You really should not be lenient about this definition. You should figure out how and why it is that these definitions deviated so much from person to person, or team to team.

I want to go back to that Apollo 1 story. The morning of the test, the crew suited up and detected a foul odor in the breathing oxygen which took about an hour to fix. Things were a little bit off. Then the communication system acted up. Shouting through the noise, Grissom vented, "How are we supposed to get to the moon if we can't talk between two or three buildings?" This was really sad. It was really devastating. There was a lot that went wrong that morning. They had invested a lot of time in this test. They had a production deadline on when they were going to get to the moon. I'm sure we've all been in those places before, with a lot less risk.

The After Phase

Let's go to the after, because this is one of my favorite phases too. This is where things relate to incident analysis. It's disseminating what we learned. Often, we find this vulnerability or we don't find anything at all. Then that knowledge just stays within the team that was in the room. It doesn't really go past that. We don't have systems that help us disseminate that knowledge, that help us tell these stories. One of my favorite quotes, John Allspaw says, "Resilience is the story of the outage that never happened." If resilience is the story of the outage that never happened, then what story are you actually telling? Who is there to hear it? Is this stuff you're keeping track of? It's really important. Who's reading this? Who's grokking it? How are you disseminating some of this? What good are chaos experiments if the only knowledge gained is actually from those in the room running the game day? If you're measuring who reads and hears and acts on these themes, maybe even correlates them to incidents. That's how you know you're successful and you're growing. It's a very tricky part of chaos engineering. The tools in the market today do not focus on this part. Because you also have to make it a story that people want to listen to. It takes time to develop good storytelling skills as an engineer. Build an internal brand around it.

We had one at Netflix that we called, how we got here. We used to do it with incidents. We would share a how we got here story. It would take the reader through a narrative, read like a ghost story where you're watching the protagonist, go in the house, and you're like, why are you doing this? Obviously, this is a horrible idea, but that keeps the reader interested. That's how we disseminated this expertise. We got everyone reading these. We had people on non-engineering teams reading these because they were interesting. We spent a lot of time on them.

Forget Templates, Ask Questions Instead

Asking better questions after these things happen will lead to better stories too. Forget templates on this, you don't need a game day template. There are some things that you can ask like, what did you experiment on? Why'd you experiment on it? What were the reactions? It's a qualitative metric. Were people scared about this? Or some people are like, "That's fine." When I did that experiment at jet.com, it was a lack of concern. Then it led to this massive thing. Talking about that gap is really important. Because why did that gap exist to begin with? What was surprising? What was new? What mental models got recalibrated? Getting people in the room to share what new things they learned is really important. You can capture these things in reports. People get really excited about sharing like, "I didn't know it worked that way." Or, "I didn't know Joe knew this thing. How did he find that graph?" Those are really great things to mention. What were the necessary adaptations? Did things degrade or collapse at any point? When was the last time people worked together in an incident? Have these teams worked together before? Were you doing this experiment because they hadn't worked much together before? Or, maybe there was previous tensions between these two teams? Who was present? How did we assess ourselves? How did we decide that, this went well, or, no, this went off the rails? How long did it take to find the right graphs? This is the big one. Who knew?

I have been at a few organizations where someone would share an incident like Tim's magical graph. Everyone's like, "Tim, thank you so much." No one questions how he created this magical graph. It's just Tim and he's magical. We can't actually learn that. Figuring out how Tim knew to look there, what he drew on from before is really important. Looking at what actually went right in a chaos experiment will help us understand what went wrong. Don't just talk about things that went off the rails. Talk about what went really well. What was exciting? What surprised us in a very positive way, in both terms of technical and human? The pursuit of success, what did we learn that we were actually really good at, something Sidney Dekker says. This is where we can bridge chaos engineering and resilience engineering. They are very distinct things. We should focus more on what we learned we're really good at, not just finding these vulnerabilities. We can leverage chaos engineering to measure the sources of resilience and create feedback loops that enhance an organization's ability to monitor and revise their mental models, target safety experiments, and build expertise. Build and disseminate expertise.

In order to sustain a chaos program, I do think there needs to be an upfront investment in incident analysis. Otherwise, you're just almost over-indexing on the most recent bad thing that happened. Where, this sounds right today. If you invest deeply in incident analysis and looking at your themes, looking at how things are going, you will have a better understanding of where your gaps are.

These reports can actually be used as a cultural artifact. One of the things I was really surprised by at Slack was that these reports were read a lot by new hires, which was something I didn't expect. It helped them ramp up. It helped them get to know their teammates. It helped them get to know surprises. It helped them ramp up quicker and ask different questions than they would have otherwise.

Chaos Engineering In the Incident Space

I see a lot of chaos engineering in the incident space, in general, get measured by the count of incidents, which is a really shallow data point. It makes intuitive sense why our heads go there. It's a top level metric. It's what the business pays attention to. There's so much more to it than that. It's not actually learning. Looking at these shallow metrics, we won't actually be able to improve. There's no possible way to attribute the reduction in incident count to chaos engineering experiments. There's so much that goes into it. Just stop counting incidents in general. I know that might be a hard sell for your business. There are a lot more ways that we can learn about how we're doing. Darell Huff wrote this book in the 1950s. It is still super relevant today. It's called, "How to Lie with Statistics." It illustrates outlining errors when it comes to the interpretation of statistics, and how these errors create these incorrect conclusions, or these safety blankets for us, that lead us to believe we're performing much differently than we're actually performing.

Key Takeaways: Rethinking Approaches

I want to go back to this creating foresight. We've talked about a bunch of different phases of chaos today. You can't shortcut resiliency by buying or building a chaos engineering tool. You just can't. Safety should not fall on just ops or reliability focus teams. You shouldn't give new employees vulnerabilities to go through and fix. I also see a lot of organizations where, when this ends up on their backlog for a while, a vulnerability, and they don't know when to fix it, they think it's a good idea to give it to a new hire when they join as a way to ramp them up. That's not a good idea either. Use these write-ups as learning opportunities. Disseminate them throughout the organization. Keep track of who's reading them, because that's really interesting. Connect all phases of chaos and include incident data as well. Purchasing a chaos engineering tool before you've invested in learning from incidents will actually hurt your chaos program. The chaos tools in the market today don't address these concerns.

Resources

If you want to chat more about some of this, you can reach a few of us at hello@jeli.io or the Learning from Incidents website where we've open sourced a lot of these learnings that a few of us have been going through for several years now.

Questions and Answers

Participant 3: You emphasize the importance of having an intern or junior engineer take part in this, and that the preparation for the experiment is one of the most valuable parts. Would you say then, if that intern or junior engineer says, I don't actually have a mental model. The docs, it's not clear where they are. Would it be useful to essentially do a chaos engineering thought experiment, write out the incident report after the fact with all the characters and everything?

Jones: As if it actually happened?

Participant 3: As if it actually happened as the way of onboarding before you have all these documents that somebody can come in, if we're introducing?

Jones: I focus a lot on inviting new employees or interns into the room as well, to get their mental models. What if they don't have a mental model yet? They might not. I'm guessing it's not their first or second day, in that case, it would probably be more of an observational role. A newer employee that's been there a few weeks or a month, I would actually recommend asking them first, what are your thoughts on this system? What have you learned so far, and not having anyone else talk or chime in? That's when you're actually going to learn things about how they're grokking the system immediately upon joining, what they're afraid about? What is maybe not getting disseminated to them? Which it will be, how that is different from the more tenured engineers in the room, the more tenured folks in the room?

Did you have something else you wanted to add to that?

Participant 3: What's a good way to make that psychologically safe for them?

Jones: Exactly. That's a tough part. Psychological safety, during this, I could make a whole separate talk on that. It's a big topic and it's really important. At some organizations it's not psychologically safe to do that. I have been at organizations where it is. I understand that's a rarity. I've also been at organizations where it's not. In the organizations where it's not, it's really important to have a facilitator that is just trusted throughout the organization. Just someone that's not going to judge. Someone that's well regarded. Someone that people can come to, and have that person have a one-on-one with some of these newer engineers before they enter the room. Know that what they say is not going to be used against them. This is something that we do in incident analysis as well.

See more presentations with transcripts

Recorded at:

Sep 30, 2020

Nora Jones

InfoQ Software Architects' Newsletter