Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Managing Failure Modes in Microservice Architectures

Managing Failure Modes in Microservice Architectures



Adrian Cockcroft explores how to apply some industry standard techniques (including Failure Modes and Effects Analysis) to cloud native microservices architectures. He looks at how chaos engineering techniques are driving the industry from annual datacenter disaster recovery testing of monolithic applications to continuous resilience assurance for cloud native microservices.


Adrian Cockcroft joined Amazon as their VP of Cloud Architecture Strategy in 2016 and leads their open source community engagement team. He played a crucial role in developing the cloud ecosystem as Cloud Architect at Netflix and later as a Technology Fellow at Battery Ventures. Prior to this, he held positions as Distinguished Engineer at eBay and Sun Microsystems.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Cockcroft: This is new content. It's pretty much the first time I've presented it. I was working on my slides yesterday. This is really the first of an installment of things. I'm setting out the groundwork for some work I'm expecting to do a lot more on 2020. First, I'm going to start with, why do I care about this? This is what's really happening. I'm starting to see more of AWS customers, not just move like the front-end or the back-end for a mobile app or something like that to the cloud, they're closing all the data centers and they're moving everything to the cloud. This includes airlines - I'm a power user of airlines, I want all the airlines to keep flying - finance, healthcare, manufacturing.

We have substantial businesses that are moving their critical infrastructure, and they're moving that to the cloud. As they do that, we've got to develop the patterns that make that work reliably. We got to understand the failure modes. We've got to have a much more sophisticated discussion about what could go wrong and what to do. That's why I'm poking at this, and if anyone wants to talk about that if you're in a safety-critical industry, I'm very happy to continue the conversation. In the past, we had disaster recovery. Back in the '70s, in this study, you had a mainframe over here, and you could have another mainframe over there, and you'd copy stuff back and forth. If something went wrong, you could switch to the other one. That was the traditional disaster recovery model. That then became more institutionalized and that's why I have a backup data center. I'll talk a bit more about that later.

Then, between Amazon and Netflix, and a few other people, we came up with chaos engineering, a decade or so ago. We were starting to induce failure. The difference in chaos engineering is it's API-driven. Everything was highly automated. We could start using automation to do some of the disaster recovery and do it a little bit more proactively. What I think we're heading towards is continuous resilience. I had this question - nobody seems to want chaos in production - do you want chaos in production? If I call it continuous resilience, will you let me do it in production? It's really just changing the name. We're doing exactly the same thing. When we're doing it in a production, we'll be doing continuous resilience because I think that's easier to sell to management. Nobody doesn't want continuous resilience, but they're not sure about anything with the word chaos in it. It was a bit of a branding thing here. What I'm really talking about here is trying to productize and automate the things that the unicorn companies have done with chaos engineering, and take that to be a common pattern that's more productized that we'll be able to use across the industry as building blocks. We're in that transition right now.

Availability, Safety and Security

Here's the real problem: you can only be as strong as your weakest link. Whether it's security or availability that one chink in your armor is the thing that will break. The only way you can find that chink is by having somebody go there and test every link to find where the weaknesses are. This really doesn't just apply to availability, safety, and security. They all have similar characteristics. If you think about it, they have hard to measure near misses because you don't see the near misses, they're invisible. They're the ones you need to watch out for. They have complex dependencies that are hard to model and then have catastrophic failure modes. When it goes wrong, it really goes wrong. We get stories almost every day, something went wrong, ran out of capacity at launch or there's failure, or security failure, or an outage, or something like that.

They also have similar mitigations, and the way you protect against all these things is to have some layered defense in depth. If one thing breaks, there's another thing behind it to save you, and behind that maybe there's another thing. Depending on how critical it is, there's layers of defense. That applies to security, availability, safety. They're all the same.

There's bulkhead. If something goes wrong, you want to contain it. You want to stop it from spreading beyond that. That's one of the reasons that microservices work. If a microservice crash, it doesn't crash everything. If a monolith crashes, the whole thing's gone. That bulkhead is part of it. Also, you really want to minimize dependencies and privilege. If you think about a monolith that touches credit cards, the whole monolith has to be PCI compliant and is subject to all these controls. If you have a set of microservices, the ones that touch credit cards, probably in a different account, manage very differently than the ones doing personalization or the ones choosing what color the screen should be or whatever, the user interface. You can iterate and go much faster and have much lighter weight controls on the parts of your system that have minimum privileges and dependencies and you can really concentrate on the pieces that need it. That's another reason for spreading apart things and not bundling everything into one big monolith.

They also break each other. How many times a security failure is taking down your system? In fact, if there's a breach going on, the first thing they do is shut everything down, so you're not available. That's one of the things. Then the other thing is, if a system that's managing safety crashes, you don't have the safety. If your self-driving car controller crashes, then you might drive you off a cliff or something. You don't want these things to happen. These things are all linked across. We can think about all the different linkages. What I'm going to concentrate on today is the availability piece. We'll extend this and some of these techniques can be used for security and safety as well. I'm particularly interested in safety right now in safety-critical industries.

What Should Your System Do When Something Fails?

What should your system do when something fails? There are a few choices. One of the things you can do is stop, if you're not sure what you should do, if it might not be safe to continue, so you should stop. The other thing is maybe try and carry on with reduced functionality. That sounds nice. What actually usually happens is, it collapses horribly, because the least well-tested code in your system is the error handling code, and the least well-tested processes you have are the disaster recovery processes. It's just the fact of life. These things are the ones that no one wants to touch because if anything goes wrong, then it makes it even worse. We've got lots of examples of a small failure triggering a bigger failure and blowing up to take out an entire system, an entire company. We've seen airports and airlines go down because of a single route or a single power supply failure, or a small error in one place. Nuclear power station failures, those kinds of things. Think about this.

Think about a permissions look up. It's a fairly minor thing. You're running through your code and you say, "Should this customer be allowed to do this thing?" and you get an exception back so you can't tell because the subscriber service is down. Should you stop or continue? It depends. What are you about to do? Are you about to move a billion dollars from one bank account to another or you're about to show somebody a movie? If you're gonna show someone a movie, just keep going. We eventually trained most of the Netflix engineers to do permissive failure because the cost of showing them that movie if you weren't sure if they were a subscriber in good standing was negligible. It's probably not even a cent. If you're in banking, and you're not quite sure what the state the world is in, you stop. That's the way they operate.

There's a really good paper on this. You saw Pat [Helland] yesterday morning. He's got lots of great papers. "Memories, Guesses, and Apologies" is a high-level view of what databases are. Databases are things that try to remember what you told them. Then they try to guess what you told them before. If they can't figure out what to do next, they try to apologize for it. The apologies part is actually the bit that's interesting here. It talks about, sometimes rather than writing a whole lot of code to try and capture every condition, you say "Call this number." Punting to customer service, punting to a human, it's actually one of the best strategies for handling failure modes. You can overdo it, but it's actually better than running through a whole lot of code which is very poorly tested and probably going to make it worse.

Quite often the thing to do is handle the easy stuff, when it gets complicated punt to a human. Humans have a very adaptable way of being able to manage failure modes. There's a whole section on this. There's a book called "The Accidental Anarchist" by Sidney Dekker, which is really worth reading. If you look at the way processes are written and the way the processes are actually operated in safety-critical industries, they're different. If you automate the way it was written down, you're not actually doing the right automation. The human element is there to make it safe.

Who here has a backup datacenter? A few of you. How often do your failover apps do it? If you work in banking, the answer should be at least once a year because the auditors come by and make you do it. If you don't work in banking, the answers are usually never or very occasionally. I have one person who gave the best answer I've ever had, which is every weekend. "If the week number is even, we're on this datacenter. If the week number is odd, we're running on this datacenter. We flip it every weekend. If anything goes wrong on a Wednesday, we push the button, we flip, and everyone carries on and it's perfect for us." That is the best operational discipline I've ever heard of, and it happened once.

At Netflix, the last time I heard was every couple of weeks, they do a region evacuation test. They're testing stuff reasonably often. One of the big banks we were working with did a one-time region evacuation test. There was a lot of work and it went well but they don't do it every few weeks. It's something that people do to test. That's apps, and then that whole datacenter at once. Does that ever happen? Typically, the order doesn't make you test your critical apps. They don't pull the plug on the datacenter. This is very rare. I call this "Availability Theater." You've spent a lot of money building a big datacenter, and if you ever had to use it, you know that you'd be in terrible shape and it's horrible and all kinds of things would go wrong. The only time you actually test datacenter failure is when it actually fails. That's when you discover how little of it actually works.

A Fairy Tale

We had this nice fairy tale we're operating on, once upon a time, in theory. If everything works perfectly, we got a plan to survive some of the disasters we thought of in advance. It's not a great story. It's not a good bedtime story to keep you up at night. Here's a few things. Think about your company's domain name. Think about what happened if that domain name no longer existed or was taken down because somebody forgot to renew it. How much of your systems would work?

It turns out there was a SaaS provider who had this happen a couple of years ago. All of their internal email was down, so that means all of the accounts they had to do anything were on that email address. All of their product was on the same domain. What was left? One Twitter handle. CEO apologizing on Twitter for about two days, solid. I've already got this, "I'm sorry, we're trying to fix it." That was the only communication channel they had left that was branded to that vendor. Don't be that company. Everyone hopefully is being frightened now. DNS is the fastest way of taking down just about any service. We'll look a bit at that later.

This has happened to everybody in the room pretty much that's been running anything. I'm just guaranteeing that you have had a certificate site. It took down Netflix a bunch of times. It's happened to AWS. We have so much process around trying to find these stupid certificates and make sure they don't all expire at the same time, and have tracking systems for them. The operational effort going into just this one thing is huge. We're pretty good at it. Every now and again, some new service comes up that makes it sneak another one in, but the mature ones are pretty good at this. That happens all the time.

It turns out computers don't work underwater. It's very sad, really. Then when the water goes away, they're full of bits of sand, and seaweed, and dead fish, and things. They still don't work. A friend of mine had a lovely time rebuilding a datacenter that was no longer in a basement in Jersey City. Another trick: don't put the generators in the basement. That's also not a good plan. Not putting the fuel supply for the generators in the basement is also not a good plan. Try to keep stuff above flood level, if possible. Also, hopefully, it's not due tomorrow. Watch out for that.

There's a good book here, "Drift into Failure." The point of this book is that at every stage you're making the right decision, you're doing the right thing. It actually isn't good to test your backup systems because you know they're not gonna work and it's gonna be expensive and it might not go wrong. The locally right thing to do is to just concentrate on making your product better and hope that you don't get the failure. That is actually the rational thing to do. These rational decisions get put into a chain until every possible mitigation has gone away and then you do get this massive failure. I tell people don't read the book on the plane because it's mostly full of plane crashes and people dying in hospital, and stuff like that. While I was flying to Chicago for GOTO Chicago a year or so ago, I wrote an entire section of this book about a plane crash while I was on a plane and made myself do it, and then gave this talk about plane crashes. I had to learn about that failure.

What you need to do is learn to capture and learn from near misses. The airline industry is very good at this. Every time you get stuck at the gate because one tiny thing was wrong with your plane, that tiny thing is being logged centrally shared with every airline, shared with every manufacturer. They all share the engines. They share the airframes. All of that information goes into a central database that is managed for the safety of the airline industry. That is one reason that putting ourselves in metal tubes and shooting ourselves across continents, which is one of the scariest things we could probably ever think about doing is actually very safe. It's because it's so dangerous that they got very good at making it safe. That's why, when things aren't that dangerous, you don't spend so much effort on it.

Chaos Architecture

This is the architecture l think about. You've got infrastructure at the bottom. You're going to have redundancy there. There's some switching between things. If you've got two ways to do something, you have to have a way to decide one of them isn't working and switch to the other. That switch has to be really robust and reliable. That's the problem with disaster recovery. Your switching processes, and code, and practices are not well-tested. You've got an unreliable switch between your primary and backup datacenter. You might as well just not have the backup datacenter.

Above that, you got the application. Make sure your code doesn't fall over when it gets annoying error messages and things. Then at the top, you've got people because given a perfectly good system, people will mess it up. If you're not trained, you don't know what it's doing. You'll reboot the wrong machine, you're not practiced. You need game days and you need to practice the exercises that will let you know what to do when there's a failure.

This is defense in depth for availability. You need experienced staff that have been on calls. They know how to handle the main failure modes. You've got robust applications that you've done a bit of fuzz testing, and you've tried to break a few times. You have a dependable switching fabric, and there's a redundant service foundation.

This part of the talk, I've given a few times. It's an intro to why this is interesting. This is the first time I've presented it. I'm just bridging into it. One of the things about failures is that, as you get good at handling them, they start to become more and more strange. The easy ones, you've sorted. Then, every error you get is something you've never seen before in some weird combination of things. The best defense is to get fast at figuring it out.

I did a talk about five years ago at Monitorama, where I was complaining about the fact that a lot of monitoring tools gather data every minute. Then they actually sit on the data for a few minutes and about six, seven minutes after, something actually happened in the real world, you finally have enough points on a graph to say, "That doesn't look good." Then you've tried to fix it, and then you have to wait another six or seven minutes before the graph gets better again. Minutes is maybe fine, but at the rate we're running out with online web services, that could be too long. It's very hard. The feedback loop is too slow.

If you have one-second updates with some of your critical measures, then within 30 seconds, you can tell something's gone wrong. One of the effects here is you've deployed some code, you go, "I can't even tell whether this is good or not for five minutes, I'm going to go get a cup of coffee." You're off by the coffee, and everyone starts running around because the site's down. "What's going on?" Eventually, you wander back and discover you broke the site and you were off chatting to people over a cup of coffee. If you know you're going to be able to see it in 30 seconds, you'll wait to make sure it's Ok before you go and get the cup of coffee. 30 seconds is the limit. 10 seconds is great for responsiveness and 30 seconds is the maximum, and after that, you're going to be off doing something else. A human attention span is an important characteristic of these safety systems.


What we want to do then is make it possible to see what's going on. Observability is a very important characteristic. The word observability was really defined in this paper in 1961, by Kalman, who did a lot of work in control theory. If you can tell how a system is going to behave by looking at its inputs and outputs, then it's observable. That's actually a strong property of a control system. What that means is that if it's doing something inside, you have to have a way to look inside it and do logging or expose some of the internal behaviors so that you can predict what its outputs going to be. That's really what we're doing. Adding logging to a system is just making it more observable. Sticking printfs in your code, that's observability.

All these things are models though. What you have outside the system is a model of how you think it's going to behave. All models are wrong, some models are useful. The key here is to have one that's simple enough that it's tractable to work with, but complicated enough that it captures the behavior of the system.

This book is voted the most interesting book right now. They run workshops. There's a workshop next March, I've actually applied to go to it. I want to go hang out with Nancy Leverson and the team at MIT. I'm really into Systems Thinking in general, I've done talks on that. Using system thinking to look at safety, there's a lot of interesting work here. The systems that are described in this book, a couple of fairly insignificant systems, like the air traffic control system, the nuclear launch system, they worked. This is the team that worked on making those planes not crash into each other, and not have accidentally launch nuclear missiles all the time. The software in those systems needs to be done right. There's a lot of examples in there.

I'm going to dig into this a little bit and try and explain how this works. This is a screenshot from the book. It's got three levels in it. If you look at this, the bottom's the control process. That's the thing that does the work. Above it, you've got the Automated Controller. A lot of the code we write is actually controlling things. It's controlling the process. Think of auto-scalers or a lot of the security code is permission checking, "should you do this thing?" All of that is control process. Above that is a human, and the human is trying to understand, is the automatic control working right and is the thing itself working right?

I'm going to change the labels on this to maybe make it make more sense. Here's a Financial Services application at the bottom, customers are making a requests to it. At some point, it says, "Whatever you've requested has some completed action." It's just some part of the flow of some financial services application. Above that, there's a control plane, which is maybe auto-scaling the number of machines that are running it or it's doing fraud analysis to decide whether the system should be allowed to do things or something like that. You've got all these controls that you've put around your code. Above it, there's a human watching this. If the human sees throughput goes to zero, they're supposed to pay attention and go, "What happened?" Maybe there's an alert get sent to them, then they say, "Is the control plane telling me the wrong thing? Is the metric telling me the wrong thing or is it really going to zero?" You don't know. Your model of automation is your model of how that control plane is working, and the model of control process is your understanding of how the actual thing works. If you imagine getting somebody that's never seen the system before and putting him in charge of it, and they get the alert, they don't have the model. You're saying, "What should we do?" "I don't know. Call someone. You phone a friend." This written trained procedures are about trying to get people to understand the model of the behavior of the control system, and the model of the thing being controlled. There's two separate models, and a lot of the problems we have in the usability of that interface.

This is the way STPA works. Instead of looking at the boxes and saying, "What could go wrong with this box?" we look at the wires. We look at the connections between them. We look at the information flows. In this case, let's say the human controller sees throughput go to zero, while the throughput goes to zero and the human is supposed to generate a control action, but there are all these different things, human was not paying attention, so not provided. They didn't do the action so maybe the system runs out for a while, or they do an unsafe action rebooting everything in sight maybe.

The system was actually running fine, but they got freaked out by something and did something that wasn't necessary. Maybe the reason that the throughput went down was the Super Bowl just started on TV, which is something that causes about half of Netflix's traffic to go away. In fact, we had a big outage once in Brazil and Mexico at the same time because Brazil and Mexico were playing each other at soccer, and everyone stopped all the TVs with big DoS attack on all the people trying to watch Netflix on TV.

There are external reasons why throughput might drop, and you might get freaked out, and do something that isn't needed. That's an unsafe action or safe but too early. Then you might say, "The system's down, you provide the right action, but too late." You could do things in the wrong order. Maybe you see throughput go up, and the system getting unhappy, and you want to increase the limit on auto-scaler but you don't increase it enough. That's a sort of a stop too soon. You didn't provide enough of a fix or you took too long. Then there's conflicts. Maybe there isn't one human controller, there's a room full of them and they can't agree what to do. It's that kind of conflict, coordination problems between people. Maybe two people independently fix the system without talking to each other and the combination of the two break things. Or, it degrades over time, meaning maybe the written train procedure about how to fix this particular problem was written years ago, and the system's changed enough that it no longer reflects that system.

This is a good checklist for hazards. If you're looking at how to control your environment, and you're looking at that human interface, these are the standard list of hazards. Go through them one by one. Think about, how would this apply to my system? You can go through and you can make a list, and think "What would we do here? How are we going to make sure only one person at a time is making a change or get everybody on a Con call and have that coordination happen on the Con call?" You can't make a change unless you're dialed into the call. Those kinds of rules are there to give that coordination. We've got to follow them.

Here's another area. What if the system, just for some reason stops reporting the right numbers? It's working fine, it's completing its actions, but it stopped reporting throughput. What happens? Maybe it gets frozen at the last number, so you just stop getting updates. Maybe you start getting zeros. Maybe the number overflows, and you're getting effectively random numbers or negative numbers for throughput or it's corrupted. There's some bit error somewhere along the way and some random number comes through, that might trigger an alarm. Particularly, if you've got a control algorithm, that it's automatically applying things, if you feed a random number into it, you're going to get weird stuff. You got to be careful what those control algorithms are doing.

Out of order - sometimes things get delivered in the wrong order, that something gets delayed, something else gets past it, you can get issues with that. Maybe the system just starts sending things too quickly and your sensor, your reporting system, your monitoring system collapses under the load. That can happen a lot in a fast-growing business. Or it just gets delayed. Like I was pointing out before, if it takes eight minutes to notice something has gone wrong, that may be too long. Again, coordination problems may get bad over time. Say you've got a memory leak and whatever is recording your sensors or garbage collection, and after a while the GCs poses start getting too slow and the system starts getting unreliable or it's fine when you first deploy it, it gets bad over time. Run through this list, try to figure it out. Go read the book, there's a lot more detail on how to do this.

I'm just trying to give you a flavor for this mechanism because the thing about this is, I'm not saying, "Did the app fail?" I'm saying, "Am I providing the right information to the other parts of the system? What could go wrong with that information?" It's looking at the wires rather than the boxes in the diagram.

Another problem: model problems. My model of automation, I wasn't trained in how the system worked, so I have the wrong model. I think the system behaves in a certain way, and it actually behaves in a different way. Or, I don't have the right inputs, like the system isn't actually out of CPU, it's out of memory, but I have no data on the memory usage. I'm not getting updates of similar issues. You can have a problem with the model in the human or the model in the control plane as well. Think about what could go wrong there. Here's a particular example of that: the Boeing 737 Max 8, well-known problem right now. The model of automation that was in the pilot's head had not been updated from the previous 737. This is one part of what went wrong. When the plane started nose-diving, they didn't know why. They couldn't understand that the anti-stall system, which was a new system, had a model of a control process and it was automating something. It was trying to manage the system, but their model of the automation had not been updated. There's obviously other reasons, but that was one of the big contributing reasons that they wanted to make the plane as similar as possible to the last one, so they didn't want to retrain everybody to make it cheaper to introduce, but that was a critical piece of training here for a system that turned out to crash the systems.

That's STPA. I think it's a really interesting way of looking at what goes wrong. It's very good that the operator interaction and thinking about the overall control stability, so it's a top-down method. You're looking at the system from above, you're working down. I'm going to be developing more models on that in future versions of this talk. We're now going to do more of a bottom-up version looking at risk.


If we look at risk, typically financial risk, it's usually severity and probability. What's the probability of something going wrong and how bad would it be? Multiply those two together, you've got some idea of risk. These are economic-financial risks that are usually thought about in this way. However, if you add engineering risk detectability matters. s is because if a plane crashes, you know the plane crashed but that's a big thing. If you lose money, you can count the money. If there's some weird thing happening deep in something and you can't see it, you're building up failures and you're building up risk that you can't see. One way to mitigate risk is to add observability to expose silent failures. Actually, it reduces the risk because you can tell they aren't happening and you can tell they're starting to happen and you can get in there and mitigate something before it actually takes you out. This is a component level and there's a bunch of prioritizations you can do here to basically figure out what's the highest risk you should work on.

FMEA is a technique invented in the 1960s. Also, it's an old engineering technique. Anyone who did mechanical engineering or electrical engineering, they taught it to you in college. They don't seem to teach it to software developers, but I think it's a useful technique because it gives you a way to prioritize, and list what you're doing, and discuss things. I'm going to go through some FMEAs now.

I think the way I'm layering this is four different FMEAs. There are four different groups of people that should get together to think about what could go wrong. The people that write the unique code for whatever business you're in, the unique business logic. Product managers need to be in the conversation because a lot of the what goes wrong touches customers and the behavior of the system. And then developers that are writing that code. That unique business logic is a layer that you want to think about just in isolation.

Underneath that, what I'm calling the software platform team is all the code you got from somewhere else that you didn't write, what could go wrong with that? Some of that code is libraries, some of it is running systems, commercial off-the-shelf, software that you've installed. Its cloud web services. All of that stuff that you depend on, what could go wrong with that. That includes things like cloud control planes and things like that, those sorts of services.

Underneath that infrastructure is hardware, like networks, buildings, machines, those kinds of failures, power, and all those sorts of stuff. Finally, Resilience Engineering. Things can go wrong with the teams that are trying to manage the processes of failure not knowing which phone number you're supposed to call into when there's an incident or not knowing how to log into the dashboards or having a monitoring system that fails along with the system. There's failures of observability which can actually leave you blind and that's a whole different set of failure modes, and whoever's managing that should go and figure out what could go wrong.

I've got sample versions of these spreadsheets. I put stuff on my GitHub account. I put slides on GitHub, I put PowerPoint in GitHub. In this faith, thought that one day Microsoft might let me do a pull request against an individual slide in a PowerPoint since they own this thing, but no. Anyway, it's a theory. I keep mentioning that and maybe one day, they'll integrate that. This deck isn't up next, I'm still working on it.

If we look at FMEA, there are severity levels and we do a ranking of 1 to 10. I modified this a bit for infrastructure. The effects on the left are the standard names. Hazardous without warning. What does that mean? It means an earthquake or a meteorite destroyed your whole building - no warning, people injured. No warning. What could take you out with no warnings? For earthquakes and meteorites, you don't get much warning.

The next level down, you get a couple of days warning because they knew that Hurricane Sandy was coming for several days. You have time to do something to the system, but it still could destroy everything. The next one up is floods. The building is sort of ok, but the stuff in is been busted. Then if you have a fire, you get partial failures. The fire suppression system will take it out. You probably still have some of your data, but you lost some.

The ones I put in red are where you've got some permanent loss of information. You've lost some capacity or you lost some data, it's not coming back. Those machines are dead, those disks are dead. The yellow ones are temporary. It's down, but if you put power or cooling back, it's going to come back. There are various levels of failure mode there. There's a bit on the bubble here, where quite often if you power cycle a machine, they don't all come back. Not every disc starts back up, not every machine starts back up. Sometimes that disruption of power cycling can trigger latent faults in the machine. There's a bit on the bubble there. The way of thinking about it, people say, "Datacenter went down." Do you mean, is it ever going to come back or is it different? If it's a smoking hole in the ground, that's a different outage than it just lost power for a few hours.

Then we have some standard probability things, this is just a standard set, an exponential idea that how likely is this thing to happen? You guess something in this level. The idea here is to be relative. You're not trying to be very accurate. You're just trying to get a relative probability. Then likelihood, this is how well you can detect it. If you have no way of measuring this thing, you have to give it a high ranking for unlikely to detect. If you can probably see it, you're in the middle, and if you've got really good monitoring observability alerts set up and people pay attention to the alerts, you'll be down at the bottom there. Don't get too hung up on picking the number. It's just trying to get a number.

FMEA Example

Let's look at an example. What could go wrong with this? You could say, Lookup, and then no response. Ok, DNS is down. That's one thing that could go wrong with the system. I got my mobile app and whatever. This is that "I forgot to renew my domain" or something.

Let's look at another one. Let's assume we've managed to look it up. We're going to send in a request. There's no route to that, this is a fast fail. If there's no route to host, it comes back immediately. It says, "No. You've given me this IP address, there's no one at home." Maybe the service is down or whatever. There's something going wrong here, but there isn't a route, so the network routing is down. This is a fast fail. Fast fails are nice because you can track them quickly.

Connect to host - hang around, wait a bit, it takes a little while. Undeliverable, but it timed out getting there. You have to wait tens of seconds or something before you get a request back. This is another failure mode. For every microservice that calls every other microservices, you can get these kinds of failures in them. This is looking at that low-level thing. Looking at the protocol as a series of failure modes.

Here's another one: connect. Let's try again. It worked in the end, but you are kind of logging something about some time out, maybe. If you can measure the timeout, maybe log that. This is what that looks like in an FMEA. What I've got here is I'm trying to call an API endpoint, and these are those four situations I just ran through but set out in a spreadsheet, where it says service unknown, service unreachable, request undeliverable, delivered but no response. You've got all these different things here. I've made up some numbers for what I think the various severity is the first column, probability is the next one, and detectability is the third one. You multiply them together, so 5 times 1 times 10 is 50 because I have no way to know. If DNS is down, nothing can get to your site. In fact, your site will just be fine.

The dataset is down for some proportion of your customers, it's an invisible failure. The only way you find out is when customers start tweeting or phoning your support line. I've seen that failure a few times. It's very hard to detect. How do you work around it? Maybe you use an endpoint monitor, like those DNS endpoint monitors. There's a number of ways you could try and catch that. I think people, in general, underestimate the number of ways DNS can take you out and it's an area people should put more effort into. Quite a high score 50 of that. The other ones are pretty unlikely and not going to be that much of a problem, but you can retry things.

The previous one was TCP level, "Can I get the packet that...?" Let's pop up a level and look at some authentication, "I'm this user." "No, I don't care. I'm not going to talk to you." We're going to log that, it took certain amount of time to say this user failed to authenticate for some reason. Here's how you do that as an FMEA. A couple of either it fails completely, or it's slow and unreliable. It's two different cases. they're reasonably high, you should probably make sure you log an alert and have some way of tracking these things. You can work through up one level. This is now authenticating, once I get through this, I now have a connection that's authenticated, I can actually send a request on.

Finally, I can say, "Get my homepage." All kinds of things could come back here. Think about the things that could come back. Think about, you should log them. You should log how long they take, what you were doing, what went wrong. Quite often the logging doesn't have all the information you'd want.

Here are a few examples of things that might go wrong: time bombs, things that die over time, like internal wrap-around memory leaks. Everyone here should be subscribed to the End of Unix Time. If you're on Facebook, there's an event you can subscribe to. It's 2038. It's when the number of seconds since 1970 goes negative as a 32 bit number, and that's Unix time. You can just subscribe to it and then we'll have a little party or something. Another 19 years. QCon 2038, we'll get together and get it on our zoomer frames and whatever. Date bombs, leap years, leap seconds, epochs wrapping around, Y2K - are you testing for those? How did those break your system? This is the stuff that could cause you to get a different response back.

Content bombs, incoming data that crashes the app. I have a few examples of that. I'm just going to skip that. You want to fuzz the input. You want to generate random things to see if you can crash things. Config errors, versioning errors, retry storms, I have a whole talk on retry storm I used to give. This is where chaos testing at the application level works, where you get your copy of Gremlin and you try to break one microservice at a time or you get in there and just mess with it.

The nice thing about microservice is, it does one thing, and hopefully, you did it this way. It has one verb or one noun it implements. You ask it to do the thing and it either does it or doesn't do it, and you should be able to see how fast it does it and poke at it. That isolation of functionality makes microservices much more tractable for knowing how they behave under failure. When you have a monolith that does 100 different things, there's so much internal state and complexity. It's very hard to understand and to get good test coverage on it.

What I'm focusing on here is the interactions between the microservices, and what can go wrong there, and how those would propagate. Popping down a level, if we look at, for example, the cloud service control plane. What could go wrong here? You're making a call to EC2 set, "Give me an instance." It says, "No." That's bad. "Why? "We don't have any more of those left." That's one answer. We try not to do that too often. Or maybe the control planes down or something's wrong. Something's unhappy with the system today.

You try calling to say, "Can I have an increased limit?", try different instance type, or switch to a different zone, or a different region. That's the provisioning part of EC2. There's a bunch of other things that could happen. You could get one, but it doesn't stop or it could just take a long time to start. You can think that what would you do in your situation? Similarly on the networking. One of the availability patterns that works well is to pre-allocate. If you're doing multi regions, pre-allocate all the networking so that if in the middle of some outage handling, your networking is already there. The cost of the networking structures is very low. You should make sure that they are always set up everywhere you want to be, even if you're not currently using them. It's a good practice. That's that pre-allocate network structures in all regions.

There's something you can't increase the table size. There's all these kinds of things. You want to basically try to pre-allocate state so that your control plane isn't needed as much. If your systems are there and you can just send traffic to a backup system if you're doing a failover, that's nice. As long as you don't have to provision new stuff, that will be easier. There's a number of failures here where you can mitigate them by pre-allocating. That's the point really.

If we look at the infrastructural level, I talked about this a minute ago, the availability zone durability, permanent destruction of a zone, fire or flood. You really want to make sure the system can run on two out of three zones in a region. The zones AWS has are at least 10 kilometers apart. They're not in the same floodplain. They're not in the same power supply. They're not on the same network interfaces. They have as little as possible in common. Then our regions are typically states apart cross country or in different countries, so we get a lot more separation.

Then, if you're trying to connect across regions then there's a bunch of different things. It's the same setup here in terms of the requests types. I'm running through the protocol, "Can I get to it? Can I look it up? Can I get to it?" It's the same set of things, but now we're thinking about it in terms of region connectivity, instead of thinking about it at application level. What can you do if you can't deliver packets to a region? After a while you go, "I'm just going to decide that this region is not behaving and I'm the networking trap. I cannot get to it. The network to a region is down or the network to a zone is down. I can't find ways of routing around." You got to be able to block the traffic and show that your system knows how to reroute traffic to another road zone or another region. That's what we're talking about here.

These are just examples. If you can look at the spreadsheet and fill them out, put your own things in them, extend them. At operation's level, it's better, I think, to use STPA to look at the monitoring operations failure, most of those hazards. Look at the hazards between the user, and what they see, and what the monitoring system shows you, what your dashboards show you, what the system's got. There are some low-level things, like just authentication. Maybe your monitoring agent cert timed out, so it can't authenticate, or the user can't log in, or whatever. There's a bunch of failure modes here just to do with authentication.

I tend to look at authentication connectivity as like the TCP level connectivity, authentication, and then what application-level at this low-level the traffic between systems. That's a pattern to follow, to think, "Have I covered everything that might go wrong?"

STPA has this top-down view of control hazards and FMEA has a bottom-up, lets you prioritize failure modes. They get better coverage with STPA, in general, for the things that go wrong, particularly for human stuff. You really want to use both in conjunction. There's good places to use both. Some people get religious about one or the other. I think there's lots of tools in the toolbox. These are interesting ones that can be useful.

Good Cloud Resilience Practices

One of the reasons chaos engineering happened when it did is that we had cloud and we were able to automate a lot of stuff. I'm going to go through some resilience practices just to wrap up here. One of them is this rule of 3. If there's three ways to succeed, then you're in good shape. if you want to replicate data, three zones in a region, that's what the AWS services do. If you want to failover from a primary region to a secondary region, it's actually good to have two different secondary regions because if one failure failover fails, you want to have another option to fail over to if you're really paranoid. Then workloads across three regions active, active, active, that's the Netflix pattern that we set up, back when I was there a long time ago.

Another thing, if you're failing over, it's good to fail up. If you're worried about having a large workload and you're doing a multi-region primary, secondary failover, it's good to failover from a small region to a big one. The regions aren't all the same size. There are some workloads which run fine in one region, in one of the brand new regions that just came out in some little country around the world, if you failover to that, you may find we don't actually have enough capacity there to run. We'd like to try and make them all look infinite, but they aren't all infinite, and you want to do some capacity planning and estimating.

The other thing is, during a failover there's lots of extra work to do. If you suddenly discover your application is latency-sensitive and stopped working when it's further away, that's a really bad time to discover that. The best practice would be say, if you're a Wall Street bank is to run everything in Oregon, and then failover to Virginia if everything goes wrong. You're more likely to succeed that way. If the problem is trying to survive a disaster, that's a useful attribute. Then you got a bit more latency, maybe for your daily work.

Another thing is to build your resilience environment first. The first app that we got running really at Netflix was the Chaos Monkey, and everyone else had to put up with the Chaos Monkey because it got there first. We were enforcing autoscaling. We wanted to be able to scale down. If you think of an autoscaler scaling up and scaling down, to scale down, it has to be able to kill instances. The Chaos Monkey was there to enforce the ability to scale down horizontally scaled workloads. That was actually what it was for.

It was to make sure you didn't put stateful machine, stateful workloads in autoscalers. Then you can have this badge of honor gamified a bit. "My app survived all of this chaos testing, and it's running in this super high availability environment, and your app didn't. Do you mean your app's not important?" You can gamify it a bit.

If you think about continuous delivery, which hopefully, many of you are doing now, you need test-driven development, you need canaries. If you put a check-in and it flows all the way through to production, you've got that automation. Think about continuous resilience as an automated testing environment which will stress test that something really does have all the disaster recovery failover mechanisms in there. You're not just testing the functionality, you're testing its ability to withstand all of these failover operations. If you get that in there and do it right, you'll have really high confidence that whatever you've got in production is going to be really resilient and makes that failure mitigations a very well-tested code path. Like I say, "I don't care if you call it resilience engineering or chaos engineering."

As these datacenters migrate to the cloud, these fragile manual disaster recovery processes, I think they can be standardized, automated. We're starting to move to continuous resilience and build something that is more of a productizable repeatable thing.

There's a paper which has a lot more information about how to set up networks and things written by Pawan Agnihotri, who's one of the essay managers in the Wall Street region. I contributed an early version of this material to that paper. There's a section in there on FMEA. A more expanded version, I posted yesterday on medium, so you should be able to find this long written out version, and there's links to the long versions the FMEAs.


See more presentations with transcripts


Recorded at:

Jan 07, 2020