Facilitating the spread of knowledge and innovation in professional software development

Contribute

### Topics

InfoQ Homepage Articles How to Avoid Cascading Failures in Distributed Systems

# How to Avoid Cascading Failures in Distributed Systems

Leia em Português

### Key Takeaways

• Cascading failures are failures that involve some kind of feedback mechanism. In distributed software systems they generally involve a feedback loop where some event causes either a reduction in capacity, an increase in latency, or a spike of errors; then the response of the other components of the system makes the original problem worse.
• It’s often very difficult to scale out of a cascading failure by adding more capacity to your service: new healthy instances get hit with  excess load instantly and become saturated, so you can’t get to a point where you have enough serving capacity to handle the load.
• Sometimes, the only fix is to take your entire service down in order to recover, and then reintroduce load.
• The potential for cascading failures is inherent in many, if not most, distributed systems. If you haven’t seen one in your system yet, it doesn’t mean you’re immune; you may just be operating comfortably within your system’s limits. There’s no guarantee that will be true tomorrow, or next week.

## What is a Cascading Failure?

In this article I’ll discuss public accounts of real production incidents. I have the greatest respect for all engineering organisations concerned. Everyone who runs distributed software systems experiences outages, but not all organisations are open and honest enough to write clear public accounts of their production incidents. Those who do deserve only appreciation and esteem (and sometimes, our sympathies!).

Cascading failures are failures that involve some kind of feedback mechanism—in  other words, vicious cycles in action.

The outage that took down Amazon DynamoDB in US-East-1 on September 20, 2015 for over four hours is a classic example of a cascading failure. There were two subsystems involved: storage servers and a metadata service. Storage servers request their data partition assignments from the metadata service, which is replicated across data centers. At the time the incident occurred, the average time to retrieve partition assignments had risen significantly, because of the introduction of a new index type (Global Secondary Indexes or GSIs), but the capacity of the metadata service hadn’t been increased,  nor had the configured deadlines for the data-partition assignment request operation. Any request that didn’t succeed within that deadline was considered to have failed, and the client would retry.

Figure 1: Services involved in September 2015 DynamoDB outage

The incident was triggered by a transient network problem, which caused some of the storage servers to not receive their partition assignments.

Those storage servers removed themselves from service, and they also continued to retry their requests for partition assignments. The metadata servers became overwhelmed with load from these requests and were therefore  slower to respond, which caused more requests sent to them to timeout and be retried. These retries increased the load on the service further. The metadata service was so badly overloaded that operators had to firewall the it off from the storage servers in order to add extra capacity.  This meant effectively taking the entire DynamoDB service offline in US-East-1.

The second problem is they’re an exceptionally hard type of failure from which to recover. They normally start with some small perturbation — like a transient network issue, a small spike in load, or the failure of a few instances. Instead of recovering to a normal state over time, the system gets into a worse state. A system in cascading failure won’t self-heal; it’ll only be restored through human intervention.

The third problem is that if the right conditions exist in your system, cascading failures can strike with no warning. Unfortunately, the basic preconditions for cascading failures are difficult to avoid: it’s simply failover. If failure of a component can cause retries, or cause load to shift to other parts of your system, then the basic conditions for cascading failure are there. But all is not lost: there are patterns we can apply that help us defend our systems against cascading failures.

## Feedback Cycles: How Cascading Failures Take Down Our Systems

Cascading failures in distributed software systems generally involve a feedback loop where some event causes either a reduction in capacity, an increase in latency, or a spike of errors; then the response of the other components of the system makes the original problem worse.

The Causal Loop Diagram (CLD) is a good tool to understand these incidents. Below is a CLD for the DynamoDB incident from earlier.

Figure 2: Causal Loop Diagram for September 2015 DynamoDB outage

CLDs are a tool from System Dynamics, an approach to modelling complex systems invented by Jay Forrester at MIT. Each arrow shows how two quantities in the system interact. A ‘+’ beside the arrow means that an increase in the first quantity will tend to increase the second quantity, and a ‘-’ means there is an inverse relationship. So, it follows that increasing the capacity of the service, i.e. the number of instances serving it, will reduce the load per instance. Adding a new type of index, or retries from failing requests will tend to increase it.

Where we have a cycle in the diagram, as we do here, we can look at the signs and see if the cycle is balanced, a mixture of ‘+’ and ‘-’. Here, we have all ‘+’ signs in the cycle, meaning that it’s not balanced. In System Dynamics, this is called a "reinforcing cycle" (hence the ‘R’ in the centre with the arrow around it).

Having a reinforcing cycle in your system doesn’t mean  it’ll constantly be in overload. If capacity is sufficient to meet demand, it will work fine. However, it does mean that in the right circumstances — a reduction in capacity, a spike in load, or anything else that that pushes latency or timeouts above a critical threshold — a cascading failure might occur, such as happened to DynamoDB.

A key realisation — a very similar cycle exists for most replicated services with clients that retry on failure. This is a very, very common pattern. Later in this article we will examine some patterns that help prevent this cycle turning into a cascading failure scenario.

Let’s look at another example of a cascading failure: Parsely’s Kafkapocalyspe. The systems involved here are different, but the pattern is similar. Due to a launch, Parsely had increased the load on their systems, including their Kafka cluster. Unbeknownst to them, they were close to the network limits on the EC2 nodes on which they were running their Kafka brokers. At some point, one broker hit its network limit, and became unavailable. Load increased on other brokers, as clients failed over, and very quickly all the brokers were down.

As with the earlier AWS scenario, we see from the Parsely outage how quickly a system can go from being stable and predictable to a very nonlinear and dysfunctional state once a limit is breached, and how recovery doesn’t happen until operators intervene.

It’s often very difficult to scale out of a cascading failure by adding more capacity to your service: new healthy instances get hit with  excess load instantly and become saturated, so you can’t get to a point where you have enough serving capacity to handle the load.

Many load-balancing systems use a health check to send requests only to healthy instances,  though you might need to turn that behavior off during an incident to avoid focusing all the load on brand-new instances as they are brought up. The same is true of any kind of orchestration or management service that kills instances of your servers that fail health checks (such as kubernetes liveness probes); they will remove overloaded instances, contributing to the capacity problem.

Sometimes, the only fix is to take your entire service down in order to recover, and then reintroduce load. We saw this in the DynamoDB outage. Spotify had an outage in 2013 where they also had to take the impacted service offline to recover. This is especially likely where the overloaded service doesn’t impose any limit on the number of queued or current requests.

### Antipattern 1: Accepting unbounded numbers of incoming requests

Anyone who’s done much benchmarking has probably noticed that individual instances of a service generally hit a peak in throughput; then, if load increases further, you see a drop in throughput and an increase in latency. This change happens because some of the work in any service is not parallelizable (there’s a good explanation of the maths in Baron Schwartz's talk 'Approaching the Unacceptable Workload Boundary'). In a state of cascading failure, individual service instances can end up with so many queued requests, or so many concurrent threads trying to execute, that the service can become totally unresponsive and may not recover without intervention (generally, a restart). Duo experienced conditions like this during an outage in 2018: "We determined that limiting was ineffective because of the way our application queues requests while waiting for a database connection. In this case, these queued requests had built up in such a way that the database could not recover as it tried to process this large backlog of requests, even after traffic subsided and the limits were in place."

This is why setting a limit on the load on each instance of your service is so important. Loadshedding at a load-balancer works, but you should set limits in your service as well, for defense in depth. The mechanisms to implement a limit on concurrent requests vary, depending on the programming language and server framework you’re using, but might be as simple as a semaphore. Netflix’s concurrency-limits tool is a Java-based example.

Failing requests early when a server is heavily loaded is actually also good for clients. It’s better to get a fast failure and retry to a different instance of the service, or serve an error or a degraded experience, than wait until the request deadline is up (or indefinitely, if there’s no request deadline set). Allowing this to happen can lead to slowness that spreads  through an entire microservice architecture, and it can sometimes be tricky to find the service that is the underlying cause, when every service has ground to a halt.

### Antipattern 2: Dangerous client retry behaviour

We don’t always have control over client behaviour, but if you do control your clients, moderating client request patterns can be a very useful tool. At the most basic level, clients should limit the number of times they retry a failed request within a short period of time. In a system where clients retry too many times in a tight loop, any minor spike of errors can cause a flood of retried requests, effectively DOSing the service. Square experienced this in March 2017 when their Redis instance became unavailable because of a code path that would retry a transaction up to 500 times. Here is sample Golang code for that simple retry loop:

const MAX_RETRIES = 500
for i := 0; i < MAX_RETRIES; i++ {
_, err := doServerRequest()
if err == nil {
break
}
}

When Square’s engineers rolled out a fix to reduce the number of retries, the feedback loop immediately ended and their service began serving normally.

Clients should use an exponentially increasing backoff between retry attempts. It’s also good practice to add a little random noise, or jitter, to the backoff time. This ‘smears’ a wave of retries out over time, so a service that’s temporarily glitching for a few milliseconds doesn’t get hit with twice its normal load when all clients simultaneously retry. The number of retries and how long to wait is application specific. User-facing requests should fail fast or return a degraded result of some kind, whereas batch or asynchronous processing can wait much longer.

Here is  sample Golang code for a retry loop with exponential backoff and jitter:

const MAX_RETRIES = 5
const JITTER_RANGE_MSEC = 200
steps_msec := []int{100, 500, 1000, 5000, 15000}
rand.Seed(time.Now().UTC().UnixNano())

for i := 0; i < MAX_RETRIES; i++ {
_, err := doServerRequest()
if err == nil {
break
}
time.Sleep(time.Duration(steps_msec[i] + rand.Intn(JITTER_RANGE_MSEC)) *
time.Millisecond)
}

Modern best practice goes a step beyond exponential backoff and jitter. The Circuit Breaker application design pattern wraps calls to an external service and tracks success and failure of those calls over time. A sequence of failed calls will ‘trip’ the circuit breaker, meaning that no more calls will be made to the failing external service, and clients attempting to make such calls will immediately get an error. Periodically, the circuit breaker will probe the external service by allowing one call through. If the probe request succeeds, the circuit breaker will reset and again start making calls to the external service.

Circuit breakers are powerful because they can share state across all requests from a client to the same backend, whereas exponential backoff is specific to a single request. Circuit breakers reduce the load on a struggling backend service more than any other approach. Here’s a circuit breaker implementation for Golang. Netflix’s Hystrix includes a Java circuit breaker.

### Antipattern 3: Crashing on bad input — the ‘Query of Death’

This ‘query of death’ is any request to your system that can cause it to crash. A client may send a query of death, crash one instance of your service, and keep retrying, bringing further instances down. The reduction in capacity can then potentially bring your entire service down as the remaining instances get overloaded from the normal workload.

This kind of scenario can be the result of an attack on your service, but it may not be malicious, just bad luck. This is why it’s a best practice to never exit or crash on unexpected inputs; a program should exit unexpectedly only if internal state seems to be incorrect and it would be unsafe to continue serving.

Fuzz testing is an automated testing practice that can help detect programs that crash on malformed inputs. Fuzz testing is especially important for any service that is exposed to untrusted inputs, which means anything outside your organisation.

### Antipattern 4: Proximity-based failover and the domino effect

What do your systems do if an entire data center or availability zone goes down? If the answer is ‘fail over to the next nearest one’ then your systems have the potential for a cascading failure.

Figure 3: Map of data center locations

If you lost one of your US East Coast data centers in a topology, like the one shown above, then the other data center in that region would get roughly twice the load as soon as users failed over. If the remaining US East Coast data center couldn’t manage the load and also failed, then the load would likely go primarily to US West Coast data centers (it’s cheaper than sending traffic to Europe, usually). If those failed, then your remaining locations would likely go down next: like dominos. Your failover plan, which is intended to improve your system’s reliability, has brought your entire service down.

Geographically balanced systems like this need to do one of two things: either make sure that load fails over in a way that doesn’t overload the remaining data centers, or else maintain a lot of capacity everywhere.

Systems that are based on IP Anycast (like most DNS services and many CDNs) generally overprovision, specifically because anycast, which serves a single IP from many points on the Internet, gives you no way to control inbound traffic.

This level of overprovisioning for failure can be very expensive. For many systems, using a way to direct load to data centers that have capacity available makes more sense. This is often done using DNS load balancing (for example NS1’s intelligent traffic distribution).

### Antipattern 5: Work prompted by failure

Sometimes, our services do work when a failure occurs. Consider a hypothetical distributed data store system that splits our data into blocks. We want a minimum number of replicas of each block, and we regularly check that we have the right number of copies. If we don’t, then we start making new copies. Here’s a pseudo-code snippet:

replicaChecker()
while true {
for each block in filesystem.GetAllBlocks() {
if block.replicasHeartbeatedOK() < minReplicas {
block.StartCopyNewReplica()
}
}
}
}

Figure 4: Replication of data blocks after a failure.

This approach will probably work fine if we lose one block, or one server of many. But what if we lose a substantial proportion of the servers? An entire rack? The serving capacity of the system will be reduced, and the remaining servers are going to be busy re-replicating data. We haven’t put any limits on how much replication we’re going to do at a time. Here’s a Causal Loop Diagram showing the feedback loop.

Figure 5: Causal loop diagram showing the feedback loop in the system

The usual way around this is to delay replication (because failure is often transient), and limit the number of in-flight replication processes with something like the token bucket algorithm. The Causal Loop Diagram below shows how this changes the system: we still have a feedback loop, but there’s now an inner balanced loop that prevents the feedback cycle from running away.

Figure 6: Causal loop diagram showing rate limit on replication

### Antipattern 6: Long startup times

The potential for cascading failures is inherent in many, if not most, distributed systems. If you haven’t seen one in your system yet, it doesn’t mean you’re immune; you may just be operating comfortably within your system’s limits. There’s no guarantee that will be true tomorrow, or next week.

We’ve listed  a number of antipatterns to avoid if you want to reduce the risk of experiencing a cascading failure. No service can withstand an arbitrary spike of load. Nobody wants their service to serve errors, but sometimes it’s the lesser evil, when the alternative is to see your entire service grind to a standstill trying to deal with every incoming request.

• 'Stability Patterns' chapter in Release It! by Michael T Nygard.
• 'Handling Overload' chapter by Alejandro Forero Cuervo in Site Reliability Engineering: How Google runs Production Systems

Laura Nolan is a Senior Staff Engineer at Slack Technologies in Dublin. Her background is in Site Reliability Engineering, software engineering, distributed systems, and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly 'Site Reliability Engineering' book, as well as contributing to the more recent 'Seeking SRE'. She is a member of the USENIX SREcon steering committee.

Style

## Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p