Designing Services for Resilience: Nora Jones Discusses Netflix Chaos Engineering at QCon SF

At QCon San Francisco Nora Jones presented "Designing Services for Resilience Experiments: Lessons from Netflix". Key takeaways from the talk included: engineers should not lose sight of the company's customers and the experience they are having; designing for resiliency testability is a shared responsibility; configuration changes can cause outages; and engineers should have have explicit monitoring in place to detect antipatterns in configuration changes.

Jones, a senior chaos engineer at Netflix, began the talk by exploring how teams can design services for resilience or "chaos" testing. Concepts discussed included building services that support Failure Injection Testing, ensuring service-to-service communication is conducted via an RPC framework, implementing RPC call fallback paths and ways to discover them, implementing proper monitoring -- including key business metrics -- and enabling proper timeouts and ways to discover them.

There are well-accepted software development methodologies for increasing confidence in system resilience, such as unit and integration testing, but the nascent technique of chaos experimentation is also highly valuable -- particularly when building complex distributed systems such as a microservices-based application.

Chaos Engineering: Netflix ChAP

Over the previous two years the Netflix Failure Injection Testing framework has evolved into ChAP: Chaos Automation Platform. This platform enables chaos engineers at Netflix to automate resilience experimentation by splitting ingress traffic of the service under test between the existing service API, a control service API, and a chaos experiment API. The amount of traffic sent to the control and experiment APIs are deliberately kept small and of the same size, as this enables direct comparison of monitoring outputs and key business metrics between the two (such as the number of Netflix customer "streams per second"). If a large amount of divergence is detected between the control and experiment, then the experiment can be "shorted" and stopped, as this reduces the risk of customer-facing impact.

Jones introduced a sample skeleton failure injection library written in F#, and guided the audience through the implementation. Two types of failure injections were presented for engineers looking to get started with chaos experimentation: fail with an exception, and the introduction of latency. Engineers can create a hypothesis, design and run an experiment, and monitor the metrics required to prove (or not) the hypothesis.

Good monitoring is an essential part of ensuring resilience, and not just for the observability of system status, but also monitoring for configuration changes. A hypothesis was presented that configuration changes can be more dangerous than code changes. System configuration such as circuit breaker fallbacks, timeouts, and retries must be visible and monitored from a single place.

The Netflix team use Hystrix for RPC circuit-breaking within their system, and the fallback strategies that are available to for non-critical services include: static content, cached (potentially stale) data, or a fallback service. Jones cautioned that developers should be aware of global and local timeout strategies and configuration, and that immediately retrying a failed RPC call is usually not a good idea. Understanding the interaction between the timeouts and retry configuration is also important.

The ChAP platform has a "Monocle" dashboard component that shows core information on fallbacks, timeouts and retries, and when this system was first implemented, the global view of this information across the Netflix stack allowed inappropriate (or conflicting) resilience configurations to be easily identified. A "criticality score" was also defined, which allowed the chaos engineering team to calculate and prioritise fixes for services with a high number of requests per second, retries and RPC calls with no fallback.

Don't lose sight of your customer

Jones concluded the talk by sharing several success stories of the chaos engineering team's efforts and automation from other Netflix internal teams, stating that production incidents were avoided, and other undesired side-effects were identified and fixed before deploying the service in production. A key message was reiterated several times during the talk: don't lose sight of you company's customers. Resilience testing is one part of Netflix's overall approach to ensuring a consistently excellent customer experience.

The slides for Nora Jones' talk "Designing Services for Resilience: Lessons from Netflix" (PDF, 3MB) can be found on the QCon website, and the video will be made available on InfoQ over the coming months.

InfoQ Software Architects' Newsletter

Follow us on

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter