Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Continuous Resilience

Continuous Resilience



Adrian Cockcroft talks about how to build robust systems by being more systematic about hazard analysis, and including the operator experience in the hazard model.


Adrian Cockcroft joined Amazon as their VP of Cloud Architecture Strategy in 2016, and leads their open source community engagement team. He was previously a Technology Fellow at Battery Ventures. Before joining Battery, he helped lead Netflix’s migration to a large scale, highly available public-cloud architecture and the open sourcing of the cloud-native NetflixOSS platform.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Connect, see, and speak with like-minded people. Join us to accelerate your learning, be better informed, and drive innovation.


Cockcroft: My name is Adrian Cockcroft from AWS. I'm going to talk to you about continuous resilience. In the past, many of us have had disaster recovery plans. Nowadays, it's a bit more fashionable and there's a lot more work going on in chaos engineering to generate systems that really can recover and are well-tested from a recovery point of view. In the future, though, what I think people really want is continuous resilience. I'll explain what I mean by that, and some ideas about how to get there.

This matters right now, because what we're seeing is data center to cloud migrations that are happening for most business and safety critical workloads. We're seeing this across AWS partners, and all kinds of customers in lots of different types of industries. A lot of these industries say we really don't want chaos in production. It just doesn't sound good. If we change the name from chaos engineering to continuous resilience, will you let us do it all the time in production? Some of them say, "Continuous resilience, definitely want that. That sounds like a good thing." Just don't tell them that it's the same things, we just changed the name.

We tend to have this fairy tale that once upon a time, in theory, if everything works perfectly, we have a plan to survive the disasters we thought of in advance. It didn't really work out that well. It's not a good bedtime story. What happens if you forget to renew your domain name? A SaaS vendor did this. Their product was hard down. They had no email. Every service that they'd signed up to, used the same domain, so they just had no domain. They were offline for a day or so. The CEO got good at apologizing on Twitter. Security certificates expire, they take systems down, lots of systems. If this hasn't happened to you, then you just haven't been running for very long. Another friend of mine discovered that computers don't work underwater. It's not a good idea to put machines in the basement. It's also not a good idea to put the generators in the basement. It turns out, it's not a good idea to put the fuel tanks for the generators in the basement either. All these things were learned when there was a lot of flooding. Hopefully, bad things don't happen to you tomorrow. How can we think about how to build more resilient systems?

Learning from System Failure

This is a really critical book, because nobody sets out, "Today, I'm going to build a system that's going to keel over every time somebody looks at it strangely." No, we all set out with the best of intentions. We do our best work, and the systems still fall over. This is actually the normal state. "Drift into Failure" is a book that explains why this happens. It really highlights that you need to capture and learn from near misses in particular, and then test and measure your safety margins so you can see how much space you have before things go wrong. There's another book that really talks about ways to get there, "Engineering a Safer World" by Nancy Leveson. This brings lots of interesting new acronyms. This is a systems theory approach to failure mode. If we think about it, there's a few different things here. The overall mechanism here is called, STAMP, Systems Theoretic Accident Model and Processes. STPA is the technique that's used for actually doing hazard analysis.

Observability - STPA (Systems Theoretic Process Analysis) Model Control Structure

I'm going to explain what STPA is. To start with, you draw a control structure. These are the control flows in the system. You can see there are three things here. At the bottom, there's the control process, which has some inputs and outputs being disturbed a bit. Then there's an automated controller that's watching over it with sensors and using actuators to manage it to control against these disturbances, and to control against the inputs because the input or the disturbance could both affect the behavior. Then watching over the automated controller is a human controller who has displays. They have controls. They sometimes go and hit the actuators directly. Sometimes they look at the sensors and the control process directly, but mostly they interact by watching over the control system, the automated controller.

They have written procedures, and they have training, and there's environmental inputs. Have they got a hangover or are they distracted by something else? They're generating control actions based off a model of the automation and how it behaves, and a model of the control process. There's a few different things here. Also, the automated controller also has the model of the control process, but, obviously, it's not the same one because what you automate and what's in a human's head will never be the same. This is a very generic diagram, but it's incredibly useful. It actually applies to all kinds of things. For example, think about a Boeing 737 MAX 8, the pilots knew how to fly the plane, that's the control process. Their model of the control process was correct, they could fly the plane. They're just given the stick, they can fly it. The automated controller that was doing anti-stall had been updated, so the model of the automation that was in the head of the controllers, the pilots, was wrong. When the anti-stall system kicked in and started controlling the plane strangely, they were just completely confused. They didn't know what to turn off, or how to manage it. That is a recent catastrophic example of what happens if the models get out of sync.

This is what observability is all around. We've heard more people talking about observability recently. Observability is a control theory word, it was defined in 1960 as a part of the whole. One of the part of the terminology of control meant control theory. What we're looking at here is what can you see about the system that you're managing? Do you have the right information to manage it? That is what actually observability is about in its purest form. The model here is to understand hazards. If we take it from the abstract model, and I've overwritten some of the words here to give you something that's more relevant to what you might see. We have a web service. Customers are sending requests to it. They're getting completed actions out. That's the data plane. There's a control plane like an AWS autoscaler, or something like that, or a fraud control algorithm or kinds of algorithms. Anything that is not the business logic itself, all the code you wrote that sits outside that business logic is the control plane. Then the human controllers are looking at throughput to make sure that the customer completed actions are moving fast enough. If something weird happens, they'll go look and see what's going wrong.

STPA Hazards

Here's some of the different hazards. There are three sets of checklists, which are very simple, easy to run through. For each of the arrows here, in each of the different parts of the system, you say what could go wrong. To start with this, look at model problems. Model mismatch, as I just mentioned with the Boeing 737 MAX 8. Missing inputs, the model just doesn't know about something it needs to know. It's getting missing updates. They're just not getting things coming in. The updates might be coming in too fast. How many times have you looked at a log file zooming by on the screen and wondered what was actually going on? Updates may be too infrequent. You're getting an update every once an hour for something that's changing every few seconds, that's obviously not going to work. Updates might be delayed. It's very common for the observability display system to be several minutes behind the real world. Typically, if you got 1 minute interval updates, it takes several minutes for those to arrive. Then there's coordination problems where maybe you have more than one model and they don't agree. Then degradation over time where the model you've got is actually getting out of sync with the system that it's modeling. Just a normal rate of change.

STPA Hazards - Human Control Actions

Here's another typical thing, all of the arrows that point down here are controls that manage the layer below them. In this case, the human control action is what I'm looking at, but you can apply this checklist to all of the control flows. Did you just not provide the control you were supposed to, or you did something unsafe? You did too early, too late? Did it in the wrong order? You started but then you got confused and stopped too soon, because you didn't see the results coming from the system, maybe. You spent too long applying it, again, because there's a lag in the sensors. There's conflicts where maybe there's more than one human controller and they aren't coordinating over what they should be doing, and two people are independently changing the system. Degradation over time, your runbooks that tell you what to do in the documentation is probably out of date. Those are the control side.

STPA Hazards - Sensor Metrics

If we look at the sensor side, basically all the arrows pointing up. You run through the same list. Are there updates? Are you just getting zero all the time? Did you get a numeric overflow and suddenly have a very large or a very negative number for some reason? Did the data get corrupted or truncated? Is the data arriving in the right order? Again, the flow of information may be too rapid, or too infrequent, or delayed? Again, coordination problems and degradation over time. Your sensor data works fine, then there's some software change, and you didn't notice that gradually it's just not reporting the right thing. Those are the generic things that could go wrong.

What Happens If There's a Big Enough Disturbance to Break The Web Service?

What does it mean for the system to have some disruption that's big enough to cause a failure, a disturbance? A large scale failure, what the definition of this is the control plane is unable to compensate for the disturbance. If it can control it without the human intervention, then it's still in control. Maybe the automation itself failed. The network could have partitioned, so there's no route from the application to customers, the control plane can't fix that. Or the application got crashed or got corrupted and isn't easily restartable. To mitigate that, we're going to add some more redundancy to the system and replicate the web service in an independent environment, say, another availability zone. We're going to add a new routing service and a routing control plane, to look over the top of this and send the data in and out. I've connected the dots for the different arrows. Those are all the same arrows. Then I had to add something for the routing service to send the data to and from the web services. There's a lot more to go wrong here, and much more complexity for the human controller to try and model. This is the worry we have is that we're trying to make the system more reliable but if the least reliable part is the model the human controller has of how it works, you're actually making it worse, quite often.

The Cloud Provides Consistent Automation

Cloud provides consistent automation that implements modern control planes that have useful symmetries. This is very difficult to do in data centers, but it's much easier to do in cloud. These architectural symmetries simplify the human controller's mental models, then use tooling to enforce that symmetry. This is the essence of the highly available architecture that I worked on at Netflix about a decade ago. It was a very symmetric model. There's a high level of automation. The configuration as code is consistent. Everything is configured the same way with the same code so all the systems are the same. There's no handcrafted things in there. It's all cattle, no pets, if you know that analogy. Things like instance types are consistent, and services, versions across zones and regions, if you're working with the same cloud provider. There may be a slight version mismatch for a very short period of time when something's being rolled out. That's the extent of what you need to watch out for. The principle is, if something is the same, make it look the same from the user's model point of view. If it's different, make that clearly visible. You don't want things that are different. You don't want to pay for them over, and try and make them look the same when they're going to fail and behave differently. The other thing is, if you say it's the same, test your assumptions. Really prove it's the same continuously.

Scenario: AWS Availability Zones

That previous diagram was getting a bit complicated, I wanted to do a triple replication, so I just made this simpler version of it. Same thing, just without all the lines on top, and with the three A, B, C at the bottom. This lets me apply this to the AWS availability zone model. We have some assertions here. One of the assertions is the services and the data stored is consistent across three zones. Any data that's written you will write to all three places. The failure modes are independent. We'd get to that by having availability zones that are between 10 and 100 kilometers apart, separate data links, separate power supplies. They're not in the same flood zone. Those kinds of things. You're also asserting that the system will work with a zone offline. It doesn't matter which zone. I can remove any one of these zones, and the system will continue to run on the remaining two zones. Those are the assertions.

If we do have a zone failure, then the router control plane has to detect that, stop routing traffic to it. Retry requests that are in flight on the online AZs, the ones that failed. This is an automated response, so it should automatically do this. Two-thirds of the traffic should be unaffected. The third that was going to the zone that failed should hit some retries and then keep going. That's what we'd like to see.

Using STPA to Look at What Might Go Wrong

What could go wrong? Let's go through those model problems again. The confused human controllers disagree amongst themselves about whether they need to do something or not. They're seeing floods of errors. The displays are lagging reality by several minutes. They have out of date runbooks. That doesn't sound good, their models are not aligned. Then if we look at the actions, they should do nothing. The system is actually taking care of itself. They should call up and say, when is this zone going to come back online? They shouldn't touch the system. They're confused and working separately, they try and fix different problems. Some of the tools don't get used often. They're broken and misconfigured to do the wrong thing. Looking down the list at the left, you got all this brokenness happening. I crossed out a few that don't apply, because in this automated environment, the humans are not supposed to provide an input when a hazard occurs. They're supposed to just watch and pay attention, and make sure that everything is happening ok.

Finally, if we look at the sensor side of things, again, the routing control plane doesn't clearly inform humans everything's taken care of. Maybe it's delayed too long. The broken zone causes a huge flood of errors. Everything that was trying to talk to it at that time times out, so the system is just throwing errors in all directions. That flood of errors could actually just break the entire logging system. You may get outages, or overloads caused by those errors. That's just trying to get a little bit more detail into how you can use STPA to look at what might go wrong. You can use this on your own systems.

Availability Zone Failover Testing Is the Biggest Win

I think availability zone failover testing is one of the biggest and easiest wins, because cross AZ data replication is synchronous and consistent. Failover should be automatic. You should only have a few seconds impact if you tune your system up. You'll only get a few seconds impact if you regularly test it and you make sure, because the first few times you try this, I'm sure your entire system will collapse, even if you think you've got it right. Once you have it working reliably in test, obviously, then you get to try it in production really carefully. Ultimately, you should be able to run this in production without anyone noticing. We got to that state at Netflix probably 8 or 9 years ago. Then we got really solid at doing that. Then a year or two later, we actually went to multi-region with a similar set of principles, and had a good attempt at getting that as well.

Multi-Region Failover Is Getting Easier To Do, But?

I'm working on a multi-region hazard analysis. Obviously, it's far more complicated. Server support for multi-region from AWS is increasing rapidly. We have lots of services now that have various types of replication, and we're doing more to make this easier. Multi-region failover is getting easier to do, but cross region data replication is asynchronous and eventually consistent. Failover is usually manually initiated. Primary, secondary failure has downtime for apps like marketplaces that need consistency after running one region at a time. Active-active failover minimizes downtime for consumer services like a Netflix example. It's much harder to regularly test recovery from region failure in production, and it's significantly more expensive and it's still hard to get right.

As data centers migrate to cloud, these fragile and manual disaster recovery processes should be standardized and automated. Testing this failure mitigation is going to move from this scary annual experience to automated continuous resilience.


There's some papers, if you want to get into the details of how to configure networks, the "Building Mission Critical Financial Service Applications on AWS" is a really good paper. Pawan Agnihotri is a solutions architect, manager, and a very senior network architect. I've written a bunch of blog posts. "Failure Modes and Continuous Resilience" is an earlier version of this work. More recently, I've written a couple of blog posts on, "Why Are Services Slow Sometimes" and something about timeouts and retries, if at first you don't get an answer. You'll find source code for a lot of my slides and my talks on GitHub. If you want more book recommendations, there's a link there.


See more presentations with transcripts


Recorded at:

Dec 27, 2020

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p