BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Managing the Risk of Cascading Failure

Managing the Risk of Cascading Failure

Bookmarks
40:19

Summary

Laura Nolan discusses some of the mechanisms that cause cascading failures, what can be done to reduce the risk, and what to do if there is a cascading failure situation.

Bio

Laura Nolan is a Site Reliability Engineer at Slack. Her background is in Site Reliability Engineering, software engineering, distributed systems, and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly’s 'Site Reliability Engineering' book, as well as contributing to the more recent 'Seeking SRE'. She is a member of the USENIX SREcon steering committee.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Hear from software leaders at our optional InfoQ Roundtables.

Transcript

Nolan: Welcome to managing the risk of cascading failure. I'm Laura Nolan. I'm a software engineer and/or SRE. They're not exactly a clean division between the two things. I've handled a few cascading failure scenarios in real life. Hopefully, I've prevented a few more from happening. I'm currently at Slack. I was formerly at Google. I was at Gilt Groupe before that, and before that was prehistory. I'm a contributor to some books that you may have seen, "Site Reliability Engineering: How Google Runs Production Systems." The current new book, "97 Things Every SRE Should Know," and, "Seeking SRE." I write regularly for ;login: magazine. Because I know that technology and software are terrible and fail all the time, I support the campaign to stop killer robots. I don't think that we should allow computers to make the decision to kill.

What are Cascading Failures?

First off, let's define cascading failures. Cascading failures are failures that involve a vicious cycle or a feedback loop. In a distributed software system, you normally have, something causes a reduction in capacity, or an increase in latency, or a spike in errors, or all three in combination can happen. What happens next is that the response of other components of your software system causes widespread failure, your load will increase, and your backends will get flattened. Because some of them may be unhealthy or capacity may be reduced, you can end up in a vicious cycle where all of your instances keep getting flattened by load. They may get marked unhealthy. They may get recycled by your orchestration system. Retries may be amplified. They may amplify the existing load. Either way, one of the notable things about cascading failures is you don't get any automatic recovery. Things don't like perturb. You don't get a bit of a spike of errors and things go back to normal. Cascading failures, because they're vicious cycles, you get into the system and your systems cannot get out without an intervention. This is why they're scary. We don't want them.

We're talking about the context of software systems. It is worth mentioning that this phenomenon is not exclusive to software systems and it wasn't first seen or talked about in the context of software. In real engineering TM, we can see them in the electricity grid, and they also happen in networks, computer networks, so everywhere. An electricity grid, so the phenomenon is similar. You'll trip some piece of your grid or it will fail, load will go elsewhere, and you'll trip out more parts of your system until you end up with a fairly widespread failure. It's a similar enough phenomenon to be worth noting. There's a lot of papers about this if you look on the internet.

Design to Failover

The interesting thing that I think we need to start off by talking about is the fact that in any system where we design to fail over, so any mechanism at all that redistributes load from a failed component to still working components, we create the potential for a cascading failure to happen. These are mechanisms like load balancers that route away from unhealthy backends towards healthy ones, with client retries. Even our orchestration systems that may replace unhealthy tasks with new ones. They can all contribute to this phenomenon, because these things inherently increase traffic, at least temporarily, to remaining healthy parts of the system. I think of it this way, it's like the light side of the force and the dark side of the force. As Yoda would say, have one without the other you cannot. What we can do is we can make good choices to reduce the risks associated with the dark side.

We're going to start looking at this just from the perspective of the client, because I think most of us more often than writing a big distributed system backend, most of us probably write more client code than we do server code. This section is more or less taken from the handling overload section of the Google SRE book. We distinguish between two possible situations. You've got an overload of a small subset of backends of your service, or you have a global overload. In the subset case, we may just have a couple of instances maybe garbage collecting or they may have had a couple of very heavy requests hit them or maybe a DoF network. It could be any sort of thing, but you have a very localized, very small set of overload. Then the other one is where the whole service is struggling, all or most of your tasks or instances are overloaded. The whole service is basically on fire.

Unfortunately, from the client point of view, the ideal response is different in those cases. If you are the client of a service that has just one or two hotspoted backends, you want to retry immediately and hope that your request goes to a healthy backend, and you can get a reply and go and make your user happy. If you have a global overload situation, what you want is you want that service to experience a reduction in load so it can recover, so things can go back to normal, and so you can better serve your end users. The best thing to do is to back off, if you're a batch job. If it's work that has to be done at some point, but it doesn't have to be now, you can just back off for a really long period of time, like many seconds. Or you can fail. If it's something user facing, there is no point in hanging around indefinitely, you want to fail fairly fast in that situation.

The problem is, of course, the client doesn't know which case you're in. This is why a lot of systems, they just retry n times, so three is a very common value for n here. Many of us work in systems that have built up layer by layer so if you have three retries in the client, three retries in the frontend, three retries at your load balancer, that's an exponential growth. You end up in situations where one response will get retried a whole bunch of times, 27, how many errors have you got? This is one of the major mechanisms that leads to cascading failures.

AWS DynamoDB Outage in 2015

We're going to take a look at a real world example, just to start us off. This was DynamoDB, Amazon Web Services service. They have a replicated metadata service, which is replicated across AZs on a bunch of storage servers. The storage servers, they request their partition assignments, so which parts of the data set should they be serving from that metadata service. The metadata service is the control plane of this system. In 2015, they had just added a new type of index, and that new type of index increased the load on that metadata service quite significantly. The average latency for these requests had increased. The storage servers imposed an RPC deadline to pick an arbitrary number, let's say 500 milliseconds. It expects these RPCs to always take less than 500 milliseconds, and if they hang for longer than that, it will fail, because you don't want to have your RPCs sitting around. If they're never going to succeed, you don't want them to tie up resources, like file descriptors, like memory, all these things. You want to fail and retry or do something else.

It's very good to have RPCs deadlines. In this case, the RPC deadline that the storage servers were using was very close to the average latency. The system was on a cliff edge. The engineers that were running the service were not aware of this fact. Any slight increase in latency at this point from the metadata service could tip you over, because now you'll have a lot of your requests failing and being retried. This is pretty much what happened. They had a transient network failure, so some of the storage servers were unable to contact the metadata service. After the small network outage, there was a wave of load on the replicated metadata service, which pushed the average latency for those RPCs past the storage servers' limits. The storage server was seeing that RPCs were taking too long, canceling them after the replicated metadata service has had load put on it after it's been asked to do work. It's canceling and retrying. Replicated metadata service is now under extra load from all of these storage servers that are constantly retrying to get their partition assignments. Now, replicated metadata service is being hammered by load from the storage server.

They wanted to add more capacity to the replication metadata service, but doing so requires the new instances to consume resources to synchronize data with the existing replicas. The replicated metadata service was under so much load that they were unable to add new instances without first basically blocking the storage servers from being able to access the replicated metadata service at all. This is basically taking DynamoDB offline in order to repair it. This is a very common thing that we see when we try to recover from cascading failures. They managed to get new capacity online, and everything was good to go.

The Causal Loop Diagram

This is a really great example of something that we can see here at this vicious feedback loop. I've done this thing called a causal loop diagram of this incident. I think the CLD is a very good tool to understand complex systems behavior, particularly with feedback loops, because it's actually designed for this. It's a tool from a thing called system dynamics, which is a modeling approach, which was invented by a group of people at MIT in the 1950s. They were led by Jay Forrester. Each arrow shows interactions between two things. If you've got a plus beside the arrow, say, for example, we have a plus between retries to metadata service and load on metadata service, which is just at the top of the loop there, that means that an increase in the first thing leads to an increase in the second thing. An increase in number of retries leads to an increase in the amount of load. You can do a little bit of math on these diagrams. We have this loop here between metadata service load, latency, timeouts, and retries. All of those have a plus.

In system dynamics, in your causal loop diagram, if you add up your pluses and your minus, you will have a mixture of pluses and minus in a balanced loop, but in a reinforcing loop, they will all be the same. In this case, we have all pluses in our loop so it is a reinforcing loop. This means it has the potential for a vicious cycle or a cascading failure effect. Having a reinforcement cycle in your system does not mean that it's going to constantly be in overload. If your capacity is sufficient to meet demand, it's going to work fine. That's the steady state of the system that we saw, the metadata service system. If you hit a certain set of circumstances, like a little reduction in capacity or a spike in load, if you push above a critical threshold, you can get into that vicious cycle behavior. This is what happened with that transient network failure to DynamoDB.

It's a key realization that a very similar cycle exists for pretty much any replicated service that has got clients that retry in failure. This is going back to what I was saying about the light side and the dark side of this force. The cycle is inherent. There are some ways that we can impose delays, or we can impose other things in the cycle that tend to dampen down the effect and protect us from this cascading failure loop.

What's Bad About Cascading Failures?

What's so bad about cascading failures? They take down your whole service, most of the time, because the effects of them spread, unless you have some sharding approach, perhaps. They don't self-heal. Once you're in this cycle, it stays that way until you intervene. You don't really get warning of them. You can think you're fine, everything looks healthy, then you're on that cliff edge, and you just step over it.

Recovering from Cascading Failures

How do we recover? You normally can't just scale out. The reason that you can't scale out of a cascading failure is that new healthy instances tend to just get hit with excess load and just become saturated. Very often, your instances will go out of memory, hang or get slow, and your new instances will fail health checks. Anyone who's done much load testing, you'll observe that you increase load on your service and throughput will increase up to a point, then you'll pass that peak of throughput, and adding more load to that system at that point will make it slower. In a cascading failure situation, typically there's too much load and instances get hit with so much load that they pass that peak throughput, and they get slow and they start to hang. Then they may start failing health checks. It just perpetuates the whole cycle. To recover, typically, you need to reduce load until you can get sufficient healthy serving capacity up. How you do this depends on your system. You might be able to turn down some clients, if you control them. You might be able to reduce retry rates. The other way to do it is typically to use some mechanism to block the client's ability to connect. You could take your backends out of DNS, if that's what you use. You could lock the writes in your load balancer, something like that.

One of the things that can help you to get enough healthy serving capacity up is to disable health checks, because this can stop this effect where you end up lasering load on a subset of healthy instances. As soon as something gets a little bit overloaded, the load starts lasering on to the next most healthy instances. It can reduce this effect of churning that an orchestration system can create. This is one of those things where your colleagues may look askance at you if you suggest turning off the health checks, but it really can help. Health checks, liveness, readiness checks, whichever you're calling them. This applies both in your load balancer systems and your orchestration systems.

Antipattern 1: Accepting Unbounded Numbers of Incoming Requests

We'll talk about ways to avoid cascading failures, ways to, rather than getting out of them ways to avoid them happening to you in the first place. There are certain antipatterns which we can avoid in order to make our services more resilient against cascading failures. First off, when you're building a new backend service, avoid accepting unbounded numbers of incoming requests. Conceptually, there's generally some queue of incoming requests. Depending on the frameworks you're using, it may not quite look like a queue, but you may have a pool of threads or whatever that can accept incoming loads. You may have an explicit queue. In most systems, there is a way to limit the amount of inflight work and queued work in your system. Figure out what that is, and use it. Because this queuing effect, this effect of trying to juggle too many threads, or coroutines, or workers, or too long of a queue is the mechanism that leads to your service getting slower and hanging after you pass your peak of throughput. You want to avoid this.

You want to set up your service so that you can get to that peak of load but it won't push past that to the point where your service just gets gummed up. It's better to shed load quickly, and say, "No, I can't deal with this request," than it is to queue that request, hang on to it for a long time, and just churn and do nothing. It's better for clients. It's better for the service. This is an antipattern that really can extend the outages very significantly, because if you allow your services to get into this overloaded, jammed up state, you'll typically have to go restart them before they become healthy again.

Antipattern 2: Dangerous Client Retry Behavior

The second and one of the biggest things is client retry behavior. Client retries can easily go exponential. You're trying to avoid that. Try and be intentional about which layers in your system do write retries, and which do not. That can avoid that exponential effect. Bound your retries. I've seen some amazing outages where it was caused purely by one operation having a retry limit of 500 retries, which is beyond the bounds of reasonable. For user facing requests, users don't wait for more than a couple of seconds. That's a really good reason to limit the number of retries. Batch jobs. You're going to have to continue retrying but let those intervals get exponentially longer. Several hundred seconds can be a reasonable maximum in a batch job. All clients, no matter whether it's a client of your own system, or you're holding someone else's system, be gentle with your backends. Use exponential backoff, so 100 milliseconds, 200 milliseconds, 400 milliseconds, and so on.

Jitter your retries. This means to add a little bit of random noise. Don't retry exactly on the 100 milliseconds, 200 milliseconds, add plus or minus 10% or 15%. This prevents an effect where you can get waves of load that get amplified. Think about a very transient network situation that you could have that could block requests, even just for a few milliseconds. If everything retries instantly after that, you're going to have the normal load that you would have in that instance, plus all the retries. Then you may end up overloading the backends, getting another wave of retries, then getting another synchronized even higher wave of retries in the next retry period, and so on. You can get cascading, amplified waves of retries if you don't add jitter, and it's very much to be avoided. Jitter is just going to smear the excess load of retries over time.

A really useful Persian is called the circuit breaker. You can look for this in Martin Fowler's website. There's an excellent book called "Release It" by Michael T. Nygard that talks about this extensively as well. A circuit breaker is where you have a framework or a load balancer, which uses its knowledge of multiple outgoing requests to a backend service. If it notices that there are multiple failures in a short period of time, the circuit breaker closes, and that means there will be no more outgoing requests to that service. They will fail fast. Then after a period of time, the circuit breaker will half open, try one request, and if that is successful, it will open up again. Circuit breaker is very good because it's very protective of backend services in overload while still allowing fast retries in the common case of just one or two backends overloaded. I think it's something that every software engineer should be fully aware of.

Antipattern 3: Crashing On Bad Input

The third pattern is crashing on bad input. We call this poison pill or sometimes query of death. It's another thing that can cause cascading failures, because what happens is if you allow a request to your system to cause a crash, that means that a request to your system can reduce the capacity of your service. Now, combine this with retries, and you can just think about your backends being knocked down like dominoes by a malformed request. The first one will go down, then it will retry to a second one, that will go, and so on. This can be the result of a malicious attack. This can be done deliberately, but it's often just bad luck. It's a best practice never to crash or exit on expected input. Your program, if it's a serving program, should only ever crash if you think that your internal state is wrong or corrupted in some way. It will be unsafe or wrong to continue serving. There's a practical fuzz testing, which can be really helpful for detecting unintentional or intentional crashes on inputs. There are some fuzz testing frameworks available now. It's worth looking into what exists for your programming language. It's especially useful if you're getting inputs from outside of your organization, untrusted inputs if you don't control the clients.

Antipattern 4: Proximity-based Failover

Another thing to think about is if you have a geographically distributed service, and you have a setup that lets you fail over from one failed region to another failed region, this is not an antipattern in itself. If you have this pattern, you want to think about imposing maximum capacities, or really overprovisioning to deal with the possible implications of this failure. Think about this scenario here. We have a couple of data centers in U.S. West, a couple in the U.S. East, a couple in Europe, and then we bunched a few in APAC. If we lost one of our U.S. East Coast data centers in this topology, then, pretty much we can say almost for sure that most of that load would go to the second data center in that same region. It will be quite unlikely that much of it would go anywhere else. If you don't have a system that can cap the load, and redirect some of it elsewhere, for example, some DNS systems let you impose a maximum request limit if you're using DNS load balancing. NS1 has this system. It's pretty good.

If you don't have something like that, you could end up overwhelming your service in your second East Coast data center. Then, soon enough, if that fails due to excess load, you could end up spilling load over to your U.S. West Coast and overwhelming those as well. Again, one by one taking down each of your data centers. Geographically balanced systems like this need to either make sure that the load fails over in a controlled way that respects the maximum capacity of each data center, or else just maintain a lot of capacity everywhere. IP Anycast systems like DNS, and a lot of CDNs, they don't really have control because of the nature of IP Anycast. It just goes to the closest network distance IP, for instance. You don't have any control, so you just tend to have to overprovision those services and have a lot of smaller nodes. That could be quite expensive.

Antipattern 5: Work Prompted by Failure

A fifth antipattern is anywhere where you have a work that gets done on a failure. This is actually a pretty famous thing that happened inside of Google, on one of the first big leap second outages. One in seven of the Network Time Protocol daemons failed, and gave a different result to the other six out of seven. What that meant was that a bunch of the storage nodes were deemed to be unreliable or broken. This led to a wave of copying. The problem here is, if we have a system that rebalances unreplicated data blocks on the failure of one system here, like the one with the lightning bolt, we have work that gets done on failure. The problem is that this failure increases load on other parts of the system. That increase in load could cause those instances to fail or appear to be failed, which could cause more load, and so on.

That will work fine if you just lose one or two blocks, or one server out of many. If you start losing a big chunk of your services like a big rack, or if you lose one in seven because one of your Network Time Protocol daemons out of your seven has failed, then the entire serving capacity of the system gets reduced at the same time that you need to do a lot of extra work. You get a lot of re-replication, and you end up with a feedback loop here. The way to reduce this feedback loop is to delay replication, because sometimes your failure is transient, or to use a token bucket algorithm, which limits the amount of inflight replication to put a brake on that reinforcing loop and prevent the feedback cycle from running away. This is a very common pattern for preventing cascading failures.

Antipattern 6: Long Startup Times

The final antipattern is long startup times. Avoid services that do a whole lot of work before they can serve. The reason is that it makes autoscaling difficult. It can cause an outage. If you have any coordinated failure, you are stuck waiting that long startup period for your services to come up again. That is not ideal. This, in combination with a poison pill or a query of death, or an overload, can cause a lot of problems.

Summary

The TLDR here is, where there's failover, there is risk of cascading failure. Know what the antipatterns are, and think about how they can apply to your systems. Know what to do if you end up in one of these situations. Possibly block your load, possibly turn off health checks, and get your backends in a healthy state and then ramp the load back up again.

Resources

If you want to know more, there is an article. I've linked to it here, www.infoq.com/articles/anatomy-cascading-failure, written by me that has even more details, including lots more references to different cascading failure, and public write-ups from different companies.

Questions and Answers

Betts: I wanted to start by asking you different ideas you had for how to avoid these cascading failures in the future. How much of that is on the individual development teams to handle and how much of it is platform and SRE and whole system problems to solve?

Nolan: I think that's very much an organization dependent question, because different organizations have different models. Some shops are very much you build and you run it, and some shops do split those responsibilities up a little bit more. I think it depends. I think in all cases, an SRE organization is well placed to be thinking about these concerns. In many cases, you would do this in conjunction with developers, so things like avoiding crashing on bad input. That's a code level thing. That's definitely something that developers should be paying attention to. It is something that you could potentially cover in a CI pipeline as well, if you had that fuzz testing that I'm talking about.

Betts: It might be that the SRE team helps identify some of the common antipatterns, and then gives advice to say, "The developers need to go implement these details. We can't solve them for you, you have to take care of them."

Nolan: Yes, in some cases. Then other things like the geographical failover, I think is more obviously infrastructure related.

Betts: You mentioned or alluded to an organizational solution. Do you have any concerns with getting a business buy-in for what can sound like a very technical problem, like, "Oh, the servers went down." That's a tech problem. That's not a business problem, even though your business is now down.

Nolan: Yes. This is the endless struggle of, how do you prioritize reliability, things that may never happen against the immediate needs, so things that you need right now? Unfortunately, that is not something that I have an immediate, easy solution to. That is, I think, some of the quintessential problems in software reliability. This is why we have started to see, I think, this trend towards SLO is an error budget, so that's an attempt to try and put numbers on your reliability and your reliability risks. One of the places where error budgets and SLOs fall down as a warning mechanism is giving us insight into these kinds of systemic risks. I think that is one of their weaknesses. I don't necessarily have a good answer for this apart from an organization that wants really high reliability, needs to figure out a way to prioritize this work, or you won't get it.

Betts: You referenced a couple times that causal loop diagram. I had one question on that. There's an arrow in the middle of that, counterclockwise, anticlockwise arrow, what does that mean?

Nolan: That means it's a reinforcing loop. I should say that these diagrams are a little bit complex. I did a course with Open University in the UK called systems thinking in practice, so we covered a whole bunch of not only causal loop diagrams, but also a bunch of other tools from the systems thinking disciplines. What a reinforcing loop means is that you have that potential for the vicious cycle. You don't have the balanced loop that has a different, an uneven number of pluses and minuses.

Betts: Is that something you're able to surface in your dashboards or any observability tools to watch one of those loops as it's happening, seeing it unfold and getting out of hand?

Nolan: That will be the observability dream, wouldn't it? Sadly, no, I do not know of a way to do that. Potentially in the future, when we have a world where we have more service mesh type things in our paths, and we can maybe glean a little bit more. When we can, as an industry, maybe do a better job of automatically gleaning the flow of traffic and the influence of retries, that could be easier. I find that for the most part, and in most places, the information is tied up in pieces of configuration for this application here, or something that's hardcoded in the code over here, and it's very hard to observe and generalize. I've never seen a monitoring system that was capable of constructing anything like a causal loop diagram. I would love that. That's a Holy Grail for me in my project roadmap, actually.

Betts: Someone asked for you to repeat the name of that Open University course.

Nolan: It was called, "Systems Thinking in Practice."

Betts: I've seen people say, "We've seen this problem," or they have one type of failure and they think, just slap a service mesh, or slap x insert technology solution here, and that'll solve our problem. I feel like that's a naive approach. Do you ever encounter that? How do you tell them it's not that simple?

Nolan: No technology will solve all of your problems. Sometimes some technologies can solve some of your problems. Service mesh would be a useful thing. A service mesh can bring a consistency of observability to your system, so you can get the standard set of metrics for your services, and a standard configuration. That actually can be a helpful thing. The downside is a whole bunch of extra complication to manage. That's the downside of that. Nothing is for free. Everything comes at a cost. I will say that these technologies can work really well. The number of years I spent at Google, I think Google would have told you that it didn't have a service mesh, per se. It had a very standardized RPC library framework, and a set of load balancing control planes that if you screenshot it, it's basically a service mesh. It worked brilliantly. It worked brilliantly because it was so baked into the infrastructure that you didn't have to think about using it. There was a whole bunch of people who were very good at running the system who pretty much ran it and kept it out of your way.

I think that's a very different situation than a small startup or even a medium sized company that's going, should we use a control plane, or should we use a service mesh? You will have a whole bunch of work to do to get to that point, because it is not baked into your infrastructure already. A lot of the benefits of these tools is that they come when they are fully baked into all of your infrastructure. You have a migration to get there, and you have extra services to run. I think whether or not it's worth it, very much has to be figured out on a case by case basis, based on, what does the rest of your roadmap look like? What's your priorities? How many resources do you have to throw at this problem? It's all connected. It's all holistic.

Betts: You have to have someone looking at the whole system, and saying, the problems we have in the system are different than the problems we have in one little part of the system. How do you prevent that from cascading over? I think you referenced Martin Fowler and Michael Nygard's "Release It," and circuit breakers. There's also bulkheads. Were there any other resources or terms that people should be aware of for, follow these types of patterns to avoid these?

Nolan: Not off the top of my head. "Release It" is an excellent book. I strongly recommend anybody who cares about distributed systems to read "Release It." It is just essential. I'm surprised it isn't more popular than it is. I feel like it should be one of the books. It's certainly a very popular book, but it's essential in my view. Read it and do those things.

Betts: One bad practice is to assume that the resource limits are managed by our runtimes. I think there's a bigger concern of, "I don't have to worry about that. Someone else has taken care of that problem. I won't get overloaded." Is that a common misassumption that I don't have to worry about it?

Nolan: Yes. Again, this very much depends on a specific organization, and how those responsibilities do get carved up. I think for most of us, yes, we need to worry about hitting our resource limits. When you start thinking about it, there are resource limits everywhere, there's any sort of quota you use for any external system that you need to talk to you. There's all your physical compute limits, your RAM, and disk, and network, and all these things. Then there's logical ones. Think about things like mutex contention. These things are everywhere. The particularly insidious nature of those kinds of resource saturation issues is that you can run into them very suddenly and without warning. Therefore, it bears thinking about. It's one of those things that I'm always looking for whenever there's any design changes happening to any of the systems I run. Is this adding a new resource dependency? What is it? How do I monitor that? How do we get warning if we're going to start approaching those limits? How can we fail more gracefully, and not slap into a wall? These are hard questions, but I think it's something that's just useful to look out for in all your designs, all your changes, all your existing systems.

Betts: I think there's two types of things you have to worry about is the scanning for the cliff edge, actually being aware that you might be coming up on it. Then once you've gone over the edge, how do you deploy a parachute? How do you fix it? Do you agree that that's two different problems you have to solve? You have to be doing the first one, but also, you have to be ready for after the cascading failure?

Nolan: I do think so. It is very useful to know, how can I turn off the load to the system and turn it back up gradually, because that can save you at times. How you do that very much depends on your deployment setup. It's useful to know where those knobs are. Exactly as you would know how to turn off the water in your apartment in case you get a flood. You also want to do your preventative maintenance to make sure you do not get that flood.

Betts: I think there's a lot of maybe, seemingly counterintuitive advice that you gave, things like, turn off health checks. You mentioned something like, "Don't do that, you need those on. It's the way you know the service is back up," but you realized the health checks are compounding the problem. Is that saying that you run into in this domain that you have to give the counterintuitive advice?

Nolan: Yes. I've done this in real life as well. I definitely got a bit of side-eye when I said, "The health checks have to go off now just temporarily." We have to realize that where there's two modes of operation, your regular mode, where you're thinking about spot failures, single one-off problems, and how you relate to that. Your health checks and everything are great there, and your retries are perfect. Then you have this other mode where the world is on fire. Oftentimes, the things that help in the first world, the everything-is-mostly-fine world, will really hurt you in that other world. You have to be able to toggle between them in some ways. Again, know how to turn your health checks off, if you have to.

 

See more presentations with transcripts

 

Recorded at:

Jul 11, 2021

BT