Key Takeaways
- Complexity is inherent in all IT systems today. Many organizations try to fight or reduce it; however, it is more effective to embrace the complexity.
- With Chaos Engineering, you can better understand the sociotechnical boundary between humans and machines—you learn about both the technical issues or complexities of your systems and also expose knowledge gaps. In combination, these help people understand the properties of their systems and better respond when future challenges arise.
- Organizations like Slack, Google, Microsoft, LinkedIn, and CapitalOne are navigating complexity with applied Chaos Engineering, providing better resilience and availability of their products and services.
- Chaos Engineering isn’t about breaking things— rather, it is about learning things. This has been encapsulated in the Principles of Chaos, which is a set of practices centered on developing well-scoped, testable hypotheses that help identify weaknesses before they manifest in system-wide, aberrant behaviors.
- ROI and business justifications for Chaos Engineering range from simply increasing team learning and understanding, to building a full-fledged objective measure of the impact on business outcomes (such as KPIs or system performance metrics).
Advances in large-scale, distributed software systems are changing the game for software engineering. As an industry, we are quick to adopt practices that improve flexibility and improve feature velocity. An urgent question follows on the heels of these benefits: if we can move quickly, can we do so without breaking things?
Even when all of the individual services in a distributed system are functioning properly, the interactions between those services can cause unpredictable outcomes. Unpredictable outcomes, compounded by rare but disruptive real-world events that affect production environments, make these distributed systems inherently chaotic.
The book Chaos Engineering by Casey Rosenthal and Nora Jones explores how Chaos Engineering practices can be used to navigate complexity and build more reliable systems. Frameworks are explored for thinking about complexity. Key practices are proposed for embracing complexity via Chaos Engineering. Case studies are presented from companies that have applied Chaos Engineering to business-critical systems.
According to Rosenthal and Jones, businesses need to identify weaknesses before they manifest in system-wide, aberrant behaviors. Systemic weaknesses could take the form of improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when a downstream dependency receives too much traffic; cascading failures when a single point of failure crashes, etc. There is a business imperative and a customer expectation now to address these proactively, before they affect our customers in production. We need a way to manage the chaos inherent in these systems, take advantage of increasing flexibility and velocity, and have confidence in our production deployments despite the complexity that they represent.
An empirical, systems-based approach addresses the chaos in distributed systems at scale and builds confidence in the ability of those systems to withstand realistic conditions. We learn about the behavior of a distributed system by observing it during a controlled experiment. We call that Chaos Engineering.
InfoQ readers can download a free copy of the book Chaos Engineering (registration is required).
InfoQ interviewed Casey Rosenthal about the journey toward Chaos Engineering at Netflix, embracing complexity, what Chaos Engineering is and how it differs from testing, how companies apply Chaos Engineering, and building a business case for Chaos Engineering.
InfoQ: How did the journey toward Chaos Engineering start at Netflix?
Casey Rosenthal: Nora and I talk about this in the book. It was the culture of Netflix that allowed for Chaos Engineering to take root. To quote from the book, “Management [at Netflix] didn’t tell individual contributors (ICs) what to do; instead, they made sure that ICs understood the problems that needed to be solved. ICs then told management how they planned to solve those problems, and then they worked to solve them.” [from the chapter Introduction: Birth of Chaos]
Netflix’s move to the public cloud back in 2008 was rather bumpy. As the first large customer to wrestle with Autoscaling on AWS, Netflix was finding all of the interesting bugs and edge cases. Netflix needed a mechanism to get ICs to build robust software that could handle vanishing instances in the cloud. This is where Chaos Monkey comes in. It would pick one instance at random from an autoscaling group and then, without warning, terminate that instance. It runs during normal business hours every single day. This put the problem clearly defined in front of the engineers, and they rose to the challenge.
That was the start of Chaos Engineering. Chaos Monkey itself quickly became an industry-wide phenomenon, garnering a lot of buzz in the early days of cloud and DevOps. Then around 2015, Netflix decided to get more formal with the practice. We created a dedicated Chaos Engineering team and sat down to come up with an actual definition and guidance on how to do it and do it well. This led to the published definition, Principles of Chaos Engineering, and the practice really took off from there.
InfoQ: In the book Chaos Engineering you suggest that we should embrace complexity. Why is that, and how can we do it?
Rosenthal: Complex systems are inevitable. That’s the short answer, but we can expand on that a bit. As humans, we deal with complexity every day but the way we deal with it is to make mental models or abstractions about the complexity. In everyday life we deal with other complex systems such as automobile traffic, interaction with other people and animals, or even at a societal level. Decades of IT work has focused on making system models simple (e.g. the three tier web app) and that works great, when it is possible. For better or worse, the situations where that is possible are diminishing. We are entering a world where most and eventually nearly all software systems will be complex.
What do we mean by “complex?” In this case we mean that a system is complex if it is too large, and has too many moving parts, for any single human to mentally model the system with predictive accuracy. Twenty years ago, I could write a content management system and basically understand all of the working parts. I could tell you, roughly, what a change to the performance of a query would have on the overall performance of the rest of the application, without having to actually try it. That is no longer the case.
But we shouldn’t bemoan the loss of those simpler systems. The ceiling for success is tied to the complexity of a system. Customers pay for features and solutions, and those both demand an increase in complexity. Nowadays, if you limit complexity, you are putting an artificial constraint on how successful your system can be.
So the solution is to embrace the complexity, and learn to navigate the properties and attributes of a complex system so that we can solve the business problems at hand without capping our ceiling for success. Chaos Engineering is one practice that focuses on building tools to enable this navigation.
InfoQ: How would you define Chaos Engineering? What purpose does it serve?
Rosenthal: Having a working definition was really important to me. As I was building the Chaos Engineering Team at Netflix, I sat down with the early members and wrote the Principles of Chaos Engineering where we share this definition in dozens of languages. The formal definition is:
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
In the years since, my favorite definition has become:
Chaos Engineering is the facilitation of experiments to uncover systemic vulnerabilities.
InfoQ: How does Chaos Engineering differ from testing?
Rosenthal: Colloquially we might say we are “testing” a hypothesis in an experiment, but we go to great lengths to distinguish between testing and experimentation in the book. In our world, testing is an assertion of the valence of a known property of a system. In a test you want to know if X is true/false about your system. This is a great thing to know, and by all means we encourage testing.
For complex systems, that is insufficient though. Strictly speaking, tests do not generate new knowledge. We brought in a Popperian-inspired notion of experimentation to Chaos Engineering, where we say: we think X is how the world works. Then we set various things into motion to disprove that hypothesis. The harder it is to disprove, the more confidence we can have that X is in fact how the world works. But if we disprove it, then we have fundamentally learned a new thing about our system.
What we do with that knowledge is context-dependent. We might dig deeper, we might ignore it, we might even build a test around it to see if it happens again. The important thing is that we learned something new about our system that we didn’t previously expect.
Why is this distinction important in Chaos Engineering? Because testing, as defined above, places the burden on engineers to come up with assertions. Engineers can only write tests about properties that they know about. In complex systems, we already acknowledged that engineers can’t know how everything works. So this is a problem, because we put a responsibility on engineers that we’ve already assumed they can’t possibly meet.
Experimentation on the other hand matches complex systems perfectly, because it aligns a hypothesis with the purpose of the system along with our expectations and confidence. When those fall out of alignment (a disproved hypothesis), it generates a result that teaches us new things about the system bringing them back into alignment.
InfoQ: How are companies applying Chaos Engineering?
Rosenthal: The book contains great stories from industry experts at Slack, Google, Microsoft, LinkedIn, and Capital One. It’s great to see how this is expanding, how quickly the practice is being adopted, as people are doing more and more with Chaos Engineering. Here are a few examples:
In 2018, Slack introduced what it called “Disasterpiece Theater”—a series of ongoing exercises to identify vulnerabilities to availability, correctness, controllability, observability, and security. Along with following best practices around planning, limiting blast radius, and creating thoughtful experiments with explicitly testable hypotheses, the team at Slack made the process fun and engaging, so other people at the company were intrigued and curious about being involved (and willing to help dedicate some of their time to them). They’ve run over 20 experiments, in some cases identifying serious vulnerabilities they were able to fix before they impacted customers.
Google’s DiRT (Disaster Recovery Testing) program was founded by site reliability engineers (SREs) in 2006 to intentionally instigate failures in critical technology systems and business processes in order to expose risks unaccounted for. The engineers who championed the DiRT program made the key observation that analyzing emergencies in production becomes a whole lot easier when it is not actually an emergency. The experiments conducted by DIRT go well beyond technical outages to the realm of “people outages”—for example, what happens when a critical internal system starts going haywire but its senior developers are unreachable at a conference?
There is no way to identify all possible incidents in advance, much less cover them all with chaos experiments. There are just too many variables at play within any reasonable-sized software system. To maximize the efficiency of their effort to reduce system failures, Microsoft prioritizes different classes of incidents, identifying scenarios that are more important for their products to ensure they properly cover them. They ask questions like “How often does this happen?” and “How likely are you to deal with the event gracefully?”, to “How likely is this to happen?” Once they’ve enumerated and prioritized their events, they consider the degree of variation to introduce to each experiment. Constraining variation to a single component or a single configuration setting can provide clarity to the setup and outcomes. On the other hand, introducing larger scope changes in the system can be beneficial in learning more complex system behaviors.
LinkedIn takes the Chaos Engineering principle of minimizing blast radius very seriously, noting that, “Without our loyal users, we wouldn’t have systems to maintain, so we must put them first while carefully planning our experiments.” In particular, they suggest starting small if just getting started with Chaos Engineering, and being as granular as possible, targeting the smallest possible unit in your system. Their request-level failure injection framework, LinkedOut, part of their Waterbear project, has led to significant changes in how engineers at LinkedIn write and experiment on their software—they are only comfortable using it because they know that they can easily avoid wide impact to users.
You might not think of a large national bank as being an organization that would adopt Chaos Engineering practices, but Capital One is leading the change for financial institutions in this domain. The motivation for implementing a Chaos Engineering program started with a need to fully understand all the factors—internal, external, application, hardware, integration points—that could affect the service uptime. They called it “Blind Resiliency Testing.” In 2013, they started running exercises, held at regular intervals, once a month or after every release. The intent was to make sure the exercises were done and observed objectively. They started with a list of around 25 experiments, all done in a nonproduction environment.
Over the next few years, the number of experiments would increase, and the group started seeing tangible results: a smaller number of incident-inducing changes in production and the ability to react to external and hardware failures quickly.
Fast-forward to 2018. This same team from the core API is running their platform exclusively on the cloud and experiments are running in the production environment. But now, they are called chaos experiments. They use a homegrown chaos experiment tool that helps them schedule recurring experiments and trigger them automatically at the scheduled time and interval. They run every single experiment in their nonproduction environment making sure all the failure points are discovered and fixed before moving everything into production.
InfoQ: How can we build a business case for Chaos Engineering?
Rosenthal: One of the more difficult aspects of running a successful Chaos Engineering practice is proving that the results have business value. Aside from the challenge of attributing improvements in say, uptime, to Chaos Engineering instead of some other factor, the benefits of Chaos Engineering tend to open the gates to other business pressures. If Chaos Engineering improves availability, chances are good that the business will respond by releasing features faster. That in turn will raise the bar for how difficult it is for those teams to navigate complexity, which in turn can make maintaining that level of availability even more difficult. Sometimes the value can be directly tied to a business outcome; sometimes it can’t. This often depends on how much effort can be put toward measuring this value, as well as the method for going about capturing the value.
In the book, we look in detail at the Kirkpatrick Model of demonstrating the ROI of Chaos Engineering, which is commonly used in educational settings like academia and corporate training programs to evaluate the effectiveness of a teaching or training program. It comprises four levels, from Reaction to Learning, Transfer, and ultimately, Results. If you have a very light-weight program and it does not cost an organization too much, then Level 1 (a positive reaction from the team involved) might suffice in assessing the ROI. On the opposite end of the scale, if you have a team of many engineers dedicated completely to the program, then you may need to pursue Level 4. You might need a way to classify incidents, and show that certain classes of incidents are decreasing over time in frequency, duration, or depth. At Netflix, we got to Level 2, in that we were able to demonstrate learning about the impact of potential vulnerabilities (were they to actually show up in production) on streaming per second (SPS) metrocs.
Demonstrating the ROI of Chaos Engineering isn’t easy. In most cases, people will feel the value of experiments almost immediately, before they can articulate the value. If that feeling is sufficient to justify the program, there’s no point in pursuing a more objective measure. If one is required, I recommend people dig into the Kirkpatrick Model described in the book.
InfoQ: What have you learned from doing Chaos Engineering?
Rosenthal: First and foremost, I’ve learned that navigating complexity is hard. Seemingly easy or intuitive solutions don’t often suffice. I’ve learned in particular that a few commonly held beliefs for increasing resilience are counterproductive.
Intuitively, it makes sense that adding redundancy to a system makes it safer. Unfortunately, experience shows us that this intuition is incorrect. Redundancy alone does not make a system safer, and in many cases it makes a system more likely to fail. Consider the redundant O-rings on the solid rocket booster of the Space Shuttle Challenger. Because of the secondary O-ring, engineers working on the solid rocket booster normalized over time the failure of the primary O-ring, allowing the Challenger to operate outside of specification, which ultimately contributed to the catastrophic failure in 1986.
Intuitively, it makes sense that removing complexity from a system would make it safer. Unfortunately, experience shows us that this intuition is incorrect. As we build a system, we can optimize for all sorts of things. One property we can optimize for is safety. In order to do that, we have to build things. If you remove complexity from a stable system, you risk removing the functionality that makes the system safe.
Intuitively, it makes sense that operating a system efficiently makes it safer. Unfortunately, experience shows us that this intuition is incorrect. Efficient systems are brittle. Allowance for inefficiency is a good thing. Inefficiencies allow a system to absorb shock and allow people to make decisions that could remediate failures that no one planned for.
In order to successfully improve your resilience, you need to understand the interplay between the human elements who authorize, fund, observe, build, operate, maintain, and make demands of the system and the technical components that constitute the technical system. With Chaos Engineering you can better understand the sociotechnical boundary between the humans and the machines; you can discover the safety margin between your current position and catastrophic failure; you can improve the reversibility of your architecture. With Chaos Engineering, you can fundamentally improve the qualities of your sociotechnical system that support the value of your organization or business.
Tools don’t create resilience. People do. But tools can help. And that’s in part what led me to found Verica, to help other people use these tools in their organizations. I wanted to help other people besides myself and Netflix, and make this work available to others.
InfoQ: What made you decide to write the book Chaos Engineering?
Rosenthal: In 2017 we published a Chaos Engineering report with O’Reilly. It was one of their most popular reports. As I gave presentations around the world, both at public conferences and within private organizations, it became clear that there was a desire for a more comprehensive, canonical source on the subject. In particular, we wanted to add case studies from people outside of Netflix who have experience running Chaos Engineering programs at scale. We dedicated Part 2 of the book to those contributors.
InfoQ: For whom is this book intended?
Rosenthal: This book is intended for practitioners who are trying to build more robust systems. When people first hear of Chaos Engineering they typically have two reactions: first, that it sounds amazing and it makes sense as a methodology to improve system reliability. Second, that they doubt that they can get it to work for them at their company.
That second part often depends on how they are introduced to the topic. If they hear about “breaking things in production” at Silicon Valley tech companies, then yes, they are probably right. That won’t work for most people. This book helps reframe that introduction and the conversations around the practice.
Chaos Engineering works for companies of all sizes across countless industries. For people who build, maintain, and operate complex IT and software systems, we want them to know there is a proven, proactive method to improve availability and security. Pulling from decades of experienced practitioners at Google, LinkedIn, CapitalOne, Slack, and Microsoft, this is a book written by doers for doers.
About the Book Author
Casey Rosenthal is CEO and co-founder of Verica; formerly engineering manager of the Chaos Engineering Team at Netflix. He has extensive background in high availability systems, experience with distributed systems, artificial intelligence, translating novel algorithms and academia into working models, and selling a vision of the possible to clients and colleagues alike. His superpower is transforming misaligned teams into high-performance teams, and his personal mission is to help people see that something different, something better, is possible. For fun, he models human behavior using personality profiles in Ruby, Erlang, Elixir, and Prolog.