Netflix Unleashes Chaos Monkey as its Latest Open Source Tool
Netflix has just open-sourced its much talked about “Chaos Monkey” software which intentionally takes servers offline as a way to test the resiliency of a cloud environment. This is another in a long line of internally developed tools that Netflix has chosen to freely share with the technical community.
The Netflix team first unveiled the Chaos Monkey in December of 2010 through a blog post explaining the lessons learned from hosting their massively popular video streaming service on the AWS cloud. One particular lesson, called “the best way to avoid failure is to fail constantly”, described how Netflix proactively sabotages their environment in order to discover points of failure.
One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.
In a blog post on July 30th of 2012, the Netflix tech team announced the release of the Chaos Monkey as an open source project. This post explained the purpose of the Chaos Monkey and considerations for operating it. Netflix claims that this software can run successfully on clouds other than AWS and is intended to be used by those who want to detect failure conditions in their own environment. Netflix built in multiple configuration settings that should calm the fears of those who can’t imagine setting the Chaos Monkey loose in their own data center. First, the Chaos Monkey can be set to run at times when support staff is standing by to resolve issues.
The service has a configurable schedule that, by default, runs on non-holiday weekdays between 9am and 3pm. In most cases, we have designed our applications to continue working when an instance goes offline, but in those special cases that they don't, we want to make sure there are people around to resolve and learn from any problems. With this in mind, Chaos Monkey only runs within a limited set of hours with the intent that engineers will be alert and able to respond.
Secondly, users can decide how aggressive the Chaos Monkey should be with new applications. While Netflix chooses to run the Chaos Monkey against all of their applications (unless the application owner explicitly opts out), others can take a more tame approach and only allow the Chaos Monkey to run against specific applications or with a low probability of instance termination.
Not every application can trivially handle an instance going offline. Sometimes it takes a human to manually recover instances, perhaps exercising backups to bring them back. Ideally, engineers work towards making that process easier and faster and eventually automatic. For those applications, there is the ability to Opt-Out of Chaos Monkey. There is also a tunable "probability" that Chaos Monkey uses to control the chance of a termination.
The public availability of the Chaos Monkey came shortly after another major open source release: project Asgard. According to another blog post by the Netflix team, Asgard is a web interface for deploying and managing AWS environments. While the AWS management console offers a full set of features for provisioning servers, the Netflix team found it missing the concept of an “application.” Netflix filled that gap by introducing Asgard.
When there are large numbers of those cloud objects in a service-oriented architecture (like Netflix has), it’s important for a user to be able to find all the relevant objects for their particular application. Asgard uses an application registry in SimpleDB and naming conventions to associate multiple cloud objects with a single application.
In addition, Asgard aggregates AWS Auto Scaling Groups into objects called “clusters” that can then easily be rolled back in the case of failed deployments. Applications and Clusters, along with support for automating deployment tasks, are designed to make AWS administration simpler and less prone to human error.
Both the Chaos Monkey and Asgard are part of a larger initiative by Netflix to aggressively open source software that helps the broader community. You can see all of the Netflix open source projects by visiting their Github.
Shane Hastie on Distributed Agile Teams, Product Ownership and the Agile Manifesto Translation Program
Shane Hastie Apr 17, 2015