Choose Your Own Adventure: Chaos Engineering at QCon New York 2017

Nora Jones, senior chaos engineer at Netflix, talked about chaos engineering at QCon New York 2017. She presents different stages of chaos engineering adoption and gives stories from her previous experiences at Jet and Netflix.

Jones starts by explaining the rationale behind chaos engineering. Chaos engineering is about embracing the fact that failure is inevitable and will happen sooner or later. As Jones puts it: Computers are complicated and they will break.

The established way of ensuring availability is through the different testing methods: unit testing, regression testing and integration testing. Chaos engineering adds another level. Some considers only one of the two group is required, while Jones states both are required to achieve the highest levels of availability.

Chaos Engineering is about making the system stronger through experiments. One early tool embracing the philosophy is the more widely know Chaos Monkey, an automated tool designed to test resiliency by shutting down servers. While Chaos Monkey is several years old, chaos engineering is recently emerging.

Jones presents five phases for introducing chaos inside an organization. Each phase refines over the previous one, covering more specific scenarios and introducing new tooling.

The first phase is the initial introduction of chaos. Jones points out that introducing chaos in an organization where there is already chaos is difficult. Chaos induced from experimentation becomes indistinguishable from chaos coming from actual issues and outages. Thus a steady state is required as a starting point.

The question then becomes: How to start chaos? Jones recommends to start by recreating situations that already happened. Graceful degradations and restarts are a convenient starting point to start small. For example, restarting a redundant web server or database server reproduces an expected failure of most systems. Starting out in a QA environment is also recommended as a way to start small.

Jones also emphasizes adoption throughout the different phases. Shutting down systems in production on purpose can be seen as a radical idea, making adoption a delicate aspect to handle.

The second phase is about causing cascading failures. Cascading failures are a chain of failures, starting by a fault in one system triggering a failure in another and so on. Cascading failures often lie dormant for a long time until they are triggered by an unusual set of circumstances.

Jones describes an experiment she did at Jet. Her team determined what kind of failure they wanted to cause and proceeded in QA. The result was that a different failure than the one expected happened, resulting in QA being down for a week. The experiment proved successful in that potential production failures were revealed, but also stressed the importance of showing the benefits of the approach. While failures always cause some inconveniences, developers and stakeholders must be made aware that these failures would have been much more costly if they happened in a non-supervised setting on production.

The next phase is building a failure injection library. The library enables injecting chaos directly through code, allowing more control over chao experiments. Jones presents a sample F# library available on GitHub. The following snippet defines a function acting as an entry point to inject chaos:

let chaos (name:string) (shouldChaos:unit -> bool) (chaos:Async<unit>) : AsyncFilter<_,_,_,_> =
    fun (service:AsyncArrow<_,_>) req -> async {
       if shouldChaos() then 
            printfn "%s" name
            do! chaos
        return! service req 
}

The fourth phase is continuous chaos with Chaos Automation Platform (ChAP). ChAP is designed to overcome the shortcomings of FIT, both systems created by Netflix. The goal of ChAP is to run chaos experiment on everything and continuously. ChAP is focused on minimizing blast radius. Concentrates failures on dedicated instances. Adds orchestration to FIT.

The fifth and last phase is injecting chaos at targeted areas in a system. Experience at Jet, dealing with geo-replication. Highly dependent on Kafka and had some problems with it. The team established a list of scenarios specific to Kafka they would like to chaos test. The lesson learned was that, as stated earlier, failures induced by chaos experiments are hard to differentiate from regular failures. A steady state is required before running chaos experiments.

As a final word, Jones suggest defining a strategy for chaos engineering adoption. That strategy, to be most effective, must be tailored to a company's culture. For example, does it work better to force adoption or do people and teams need to be convinced instead.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter