Netflix's Chaos Engineering to Advance Failure Injection

"Chaos Engineering", a term recently coined by Netflix, is an umbrella that embraces all Netflix's activities on controlled failure injection. Bruce Wong, Engineering Manager of Chaos Engineering at Netflix, wrote about what Chaos Engineering is, its aims, and the roadmap to achieve them. InfoQ reached Bruce to learn more.

Though the term "Chaos Engineering" is new, Bruce emphasises that it's not "a new discipline, but rather a bigger investment in our established discipline and philosophy [injecting failure into production to ensure fault-tolerant systems - Ed.]". In the end, Netflix wants to make the discipline of "chaos testing for regression in distributed systems at scale" as well understood as regular regression testing. This bigger investment reflects Netflix's need to handle several different challenges.

As the Internet continues to expand, Netflix expands with it. As Bruce puts it:

As the Internet itself continues to expand, web-scale also expands. Building and running distributed systems at scale is still relatively in its infancy. Building for scale is never just done especially when our user base and complexity is still expanding. While there's a lot of research and understanding of how distributed systems fail, we're still growing in our understanding and evolving technology, which creates new fascinating failure modes to account for. Chaos testing highlights the inability to accurately test resilience at scale, and inability to simulate "the Internet".

Netflix feels that Amazon Web Services (AWS), the cloud's ecosystem on which it runs, is getting more reliable over time:

AWS being more reliable actually creates the need to proactively inject failure. If the environment and underlying infrastructure weren't as stable as it is, that would serve as a forcing function for our software to be resilient. But because AWS is very stable, it doesn't keep fault tolerance top of mind.

AWS is not so stable as to avoid "once in a blue moon failures", though. Bruce points out to a notorious example, the Christmas Eve 2012 regional ELB outage, to picture those kind of failures. This specific failure prompted Netflix to invest in a multi-region Active-Active infrastructure. Each of these events "reinforces [Netflix's] commitment to chaos [failure injection - Ed.]" and represent scenarios that Chaos Engineering aims to proactively find and mitigate.

Netflix devised a three-pronged approach for the Chaos Engineering' high-level plan: establish virtuous chaos cycles; increase use of reliability design patterns; anticipate future failure modes.

Virtuous Chaos Cycles

Netflix finds that it needs more tooling to support these cycles. Bruce sees scope to enhance the Simian Army, which has helped Netflix in the resiliency space:

There are a few key areas where we want to invest more. Our investment should give us finer grain control to build confidence between say chaos monkey [randomly shuts down production instances - Ed.] and chaos gorilla [simulates an outage of an entire Amazon availability zone - Ed.]. As we build that confidence we're able to increase the frequency of chaos that we inject that reduces the chance of drift between simulations. By reducing that chance of drift, this also allows us to increase the speed and rate of innovation safely without trading off availability.

As our infrastructure grows in complexity and scale, we constantly come up with new ideas for monkeys to help out. Hack days for example are in a lot of ways a monkey incubator for future monkeys.

Netflix also practices blameless post-mortems with follow-up action items to prevent recurrence of previous failure modes.

Reliability Design Patterns

Given the microservices architecture espoused by Netflix, consistent handling of failure and graceful degradation of service are a must. According to Bruce, this is one of the areas where Netflix is researching for breakthroughs:

New forms of Chaos and Reliability Design Patterns are two ways we are researching at Chaos Engineering. As we get deeper into our research we will continue to post our findings.

For instance, Circuit Breakers, as implemented in Hystrix, were a "game changer when [Netflix] rolled this out", says Bruce.

Anticipate Future Failure Modes

Web-scale is still very much unchartered territory. In an ideal world, distributed systems would be so fault-tolerant that they would never fail. That world doesn't exist and AWS increasing reliability can instill complacency. Netflix counters this malign side effect with proactively searching for new failure modes. Bruce explains how Netflix finds them:

There's a few ways that we identify these. The first and easiest is experience. When we experience a "once in a blue moon" situation we always explore how to make sure we can withstand it. We don't rely on hope. The AWS ELB Xmas Eve 2012 outage, for example, was a blue moon we experienced.

Our strategy around complex failures is to make it possible and safe to simulate and test in a micro smaller scale chaos first. This helps us build confidence in running our larger macro scale chaos events. For example we started with Chaos Monkey before attempting Chaos Gorilla, and likewise for Chaos Kong [simulates the shutdown of entire AWS regions - Ed.].

InfoQ Software Architects' Newsletter

Write for InfoQ

Virtuous Chaos Cycles

Reliability Design Patterns

Rate this Article

This content is in the AWS topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter