How Netflix Handled the Reboot of 218 Cassandra Nodes

Amazon performed a major maintenance update at the end of September in order to patch a security vulnerability in a Xen hypervisor affecting about 10% of their global fleet of cloud servers. This update involved the rebooting of those servers, with consequences for AWS users and the services they provide, including one of their largest clients, Netflix.

Christos Kalantzis, Cloud Database Engineering Manager at Netflix, expressed their initial shock when hearing about the upcoming maintenance operation:

When we got the news about the emergency EC2 reboots, our jaws dropped. When we got the list of how many Cassandra nodes would be affected, I felt ill. Then I remembered all the Chaos Monkey exercises we’ve gone through. My reaction was, “Bring it on!”.

Netflix has a long history of using their Simian army - Chaos Monkey, Gorilla and Kong – to force reboots of their servers in order to see how the overall system reacts and what can be done to improve resilience. The problem this time was that the operation would affect some of their database servers, more exactly 218 Cassandra nodes. It is one thing to perform a live restart of a server streaming a video, and it is a lot more difficult to do the same to a stateful database.

“Luckily”, Netflix has started throwing the Chaos Monkey to Cassandra nodes a year ago, developing the automation tools needed to migrate such a node to another machine when necessary. During the last AWS maintenance, the end result was that

Out of our 2700+ production Cassandra nodes, 218 were rebooted. 22 Cassandra nodes were on hardware that did not reboot successfully. This led to those Cassandra nodes not coming back online. Our automation detected the failed nodes and replaced them all, with minimal human intervention. Netflix experienced 0 downtime that weekend.

InfoQ wanted to know more about this incident and how Netflix dealt with it, so we talked to Bruce Wong, Chaos Engineering Manager at Netflix:

InfoQ: The latest major AWS maintenance involved rebooting about 10% of their machines. Wasn't it possible to move at least some of the Cassandra instances to the other 90% ahead of time, representing non-Xen machines that weren't to be rebooted?

BW: We explored this with AWS and determined that it wasn’t possible to target the other 90% of instances ahead of time. While we could have moved some of the Cassandra instances, there was no guarantee that the new instance wouldn’t be affected. We did not have any advanced information around the root cause or details about the reboot. All we had was a snapshot of currently impacted instances that were going to be rebooted. We did take the opportunity to reboot a Cassandra node ahead of time to help anticipate impact.

InfoQ: Have you ever tried before to take down 10% of your machines with Chaos Monkey to see if your system can handle it?

BW: We’ve taken down 33% of machines with Chaos Gorilla before to see how our systems can handle a zone outage. We haven’t tried taking down 10% one by one spread across a few days, which would simulate a different type of failure.

InfoQ: How does the AWS maintenance event compare to your Chaos Gorilla or Chaos Kong induced failures?

BW: The AWS maintenance event is like comparing apples and oranges with Chaos Gorilla or Chaos Kong. For Gorilla we move traffic away from the entire zone; for the maintenance event we never stopped sending traffic to the zone under maintenance. Similar for Kong, we move traffic away from the entire region.

InfoQ: Why was Cassandra harder to integrate into the Chaos Engineering process?

BW: Part of it was a tooling issue. We manage our Cassandra Clusters much differently than we do stateless applications. So we couldn’t just piggyback on the work that had already been done. The biggest challenge was psychological. Once we got over the mental hurdle of introducing Chaos into the Database layer, it was just a matter of writing the right automation and tooling. Once complete, Cassandra became a full citizen of the Chaos Engineering process.

InfoQ: How do you deal with the state of a database that goes down?

BW: We have Cassandra configured to keep 3 copies of all data. At any one time we can lose up to ⅔ of the cluster and still be able to function. When a node is lost and needs to be replaced, the new node knows what data it should have. It will ask for that data from its peers. Once it is done receiving that data, it rejoins the cluster, and the Cluster is back at full strength.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Cloud topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter