Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Chaos Architecture

Chaos Architecture



Adrian Cockcroft takes a look at best practices and challenges in getting to a chaos architecture mindset.


Adrian Cockcroft has recently joined Amazon as their VP of Cloud Architecture Strategy. He was previously a Technology Fellow at Battery Ventures where he advised the firm and its portfolio companies about technology issues. He was a founding member of eBay Research Labs, developing advanced mobile applications and even building his own homebrew phone, years before iPhone and Android launched.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


A Cloud Native Availability Model

So this is actually an interesting anniversary event, seven years ago, 2010, the very first presentation of the Netflix architecture was at QCon San Francisco, it was right here in this building. And Randy Shoup was doing the Architectures That You Have Always Wanted to Know About track, and he had been to QCon the year before. It is my first time to come to QCon, I came and explained what we were doing and it baffled everyone in the room, we think. Everybody thought we were pretty crazy at that point. But this is, you know, this is a great conference, and I really like the way that QCon is run, and it has been one of my favorite events to go to.

So I will run through a whole bunch of stuff, and I'm going to lead into chaos architecture, and first I will talk a bit about what is going on in kind of the evolution of business logic, as we have gone from monoliths to microservices, and then talk about the cloud-native principles, and I will do an intro, and then I will talk about how we can think about applying chaos engineers into chaos architectures.

Evolution of Business Logic

And so business logic started off pretty monolithic, and then we came up with these nice microservices, and now we are transitioning to functions. And now, why did that happen?

Splitting Monoliths Ten Years Ago

And so the thing is, if you go about 10 years ago, monoliths were the thing, that's what we built for good reasons; the networks we had were slow, and the interprocesser communication we were using was mostly XML, SOAP, and those nasty things. People are probably flinching thinking about it, it was-heavy and slow. If you divided your application into a lot of pieces, after two it was stuck and it took too much time to respond. It was too inefficient to break things into chunks. We had service-oriented architecture, but it was big; monolithic bundled services that did too much. If we tried to break it into individual, single function level components, it would be too slowly. There were two attempts to do true SOA and they died because they could not keep up.

Splitting Monoliths Five Years Ago

And then five years ago, we sped things up a bit. We had JSON, REST, and we are able to break our application into chunks which do one thing at a time. Your bounded context now is a microservice that does one thing, a few up stream consumers, down stream dependencies, and you want a single thing it tries to do, that is more testable, easier to reason about. When they go wrong, it is easier to figure out what went wrong. This one service that does one thing is behaving strangely, that's the part of the migration here.

And then we get to, this is a simplified model of one of the -- by the time we got to Netflix, we were building things like this. This is just the home page, there's an API call at the end or something. And there's a whole bunch of stuff going on, and the things on the right-hand side are the storage based services, and the stuff in the middle are the storages in the between. We started building these microservices systems, and a lot of the services we were building, one corner of the monolithic application was a library that did some queuing thing, and now we have -- we built a microservice that just does queuing, and we don't need to build that service. We can just run it as a thing.

Microservices to Functions

So we start migrating to have standardized services, so a lot of the off-the-shelf pieces turn into standard services. We turn into SQS, and we turn it into Kinesis because I don't want to have to scale it, I can just call Kinesis and these different things. So you end up building a fairly large chunk of application that was embedded in this monolith and then go broken away, and what is left is the business logic that you were trying to build in the first place; there is less work and less stuff to write, but you have more dependencies around the edge.

Microservices to Ephemeral Functions

And this business logic, we break into individual functions and this is where we get to the serverless model. Instead of having the same functions built into each microservice, we just implemented it once and you just pass the events back and forth between them. So this has come along, the way I have drawn this, they are grayed out, they are not really running and the interesting thing is that, when something comes in, it wakes up these services and, you know, you only wake up for the things you need and then they go away again.

So this is pretty high level, I know most of you know how serverless works, but I give this talk to managers of big companies so that they try to understand what serverless is. Nut the real point here is that when the system is idle, it shuts down, costs nothing to run, and that's why we are tending to see huge cost savings with this serverless architecture and there is less code to write, it is faster to write and you are gluing building blocks that scale and relyable, SQS or something, and so you are writing less and it scales more easily and costs less to run. This is a double-win, faster to write, cheaper to run, and that is a reason why serverless is interesting. And so there are different cases; if you hit a serverless function enough times, you kind of want to run it permanently rather than invoking it every time.

Cloud Native Architecture

So we will look at cloud native architectures, and I will talk about the principles here. So what do we mean by cloud native? Let's think about data center native. You have a data center and every now and again, you send another rack of systems into it, and the systems sit there for years. And what you wanted to do is to shift it out of there and put it up in the cloud. So there we go. I have a company that -- this is the best thing about my job, I have a company that makes slides for me. I was like, make a fork lift truck go into the cloud, they came up with this, it is brilliant. My graphic art skills are inadequate, as you can see by looking at my older decks. So, yeah, this is cool.

Pay up Front and Depreciate over Three Years

So the point here is that, instead of paying up front and depreciating over three years, you are paying a month later for the seconds you use. A month or so ago, we stopped charging by the hour and then you pay for the second after the first minute; we are fine grain for instances, lambda functions are per 100 milliseconds, so you are consuming what you need and starting to get good at turning stuff off. So this is the principle: pay for what you use the first month.

File Tickets and Wait for Every Step

So the old world, you file a ticket, you wait, and then what you really want to do is to make it self-service, and on demand, just make a call. And then sometimes somebody is having a coffee, or lunch, and then nothing happens to the ticket, but you can just keep calling the service and everything happens.

And so what we really wanted to have here is deploy and do things by filing a ticket. You don't want to do that. You want to make an API call, self service just gets stuff, that's the other benefit of this cloud native architecture, is that they are demand-driven, self service automated.

And instead of moving from a lot of tickets, you want to move to having a tracking ticket that records that you did something, but you are not asking for permission, the ticket is not asking other people to do things. So the ticketing system turns into one ticket per deploy, that records the stages that the deploy went through as it happened. So you have the audit trail, and then you are not -- maybe if you are deploying a new finance system, some manager has to sign it off to say, yeah, I approve this update. But that is generally; you can get down to a single approval for a deploy.

And that is sort of, I guess, the current state of the art. If you are trying to get there and this sounds like an alien from another planet and we have a water fall that takes a year, what I recommend people do is we measure the number of tickets and meetings per deploy, and write that down and make a graph of it, and then you can tell people that this is a bad thing and make the numbers go down. So that's a strategy for exposing the fact that there are so many meetings and tickets, and you want to have less and you can drive it down over time. And so, that's another principle.

Um, and so -- so here is another one. We have all of this infrastructure scattered all the way around the world; if you want to deploy something in Latin America, or Japan, or Germany, it is a different drop down from the menu, and that is cloud native. If you want to build a data center in Brazil, anybody who has shipped hardware to Brazil knows how hard it is to do that. You have to hire, find a building, takes a long time. It is a drop down on a menu, it is no harder to deploy in another country than it is to deploy in your own country.

And so this is another principle that you get to build instant, globally-distributed applications by default. If you want to do some long latency testing, you know, set up a machine in Oregon and Dublin and run the test, it is no harder than setting up two machines in the same region. This is another new principle.

Regions and Zones

All right. So if we look at an individual couple of regions, and the sort of typical data center architecture, I'm going to get topical here. So let's have a hurricane. Hey, hurricane Sandy comes and floods New York, there have been a lot of hurricanes this year. So you fail over to Chicago or something. And the way that we build it in the cloud is we have regions with zones in them, and the zones are between 10 and 100 kilometers apart, so they are not in the same flood plain, or in the same fire area, they are far enough apart, but they are close enough together for synchronous replication. That's why the zones are closer -- are no more than 100 kilometers apart.

So you -- we write stuff, you want it to go to all three zones, you don't have to wait. But across regions, it is asynchronous. If you do a data centered migration to cloud, you have a MySQL primary and secondary, you are using one zone, you don't have resiliency, it is like a cloud native fork lift. If you are moving to the data center, you want to replicate copies of your application if it is horizontally scalable and across three zones and then you have a more resilient system if a hurricane comes in, and it is less likely to take out all the zones in a region. It is more inherently resilient. So distribute by default.


We will talk about elasticity. It is hard to get beyond 10 percent utilization; somebody else was saying that theirs is 8 percent. That's the average data utilization, and in the cloud, you can argue, 40 percent is plausible. You want to average, you know, target 50 percent and overall average maybe you will get 40 percent. But it is many times higher.

And you don't run out of capacity, because you are just scaling up more. So the effect here would be, if you had 1,000 machines on the cloud, on average, then you would need 4,000 machines in the data center to do the same workload. That's the 4 to 1, because you are running these machines 40 percent busy, these 10 percent busy, that's the same capacity. And the cloud, if you need 10,000 machines, you can get them. So you want to auto scale predictable heavy workloads, but for lighter, they go all the way to zero. So if I want to deploy across zones by default, I need at least three machines of every type. And in Netflix, we want at least two machines in each zone, a minimum deployment of six machines. You can get fairly small machines, but there are still six; if you have something that doesn't need six machines, go to serverless and have lambda creates the machine when you need them and it is cheaper. And so here is another principle: turn it off when idle, many times higher utilization, huge cost savings and avoids overloads.

Versioned Delivery Pipeline

So the way deliveries are done in the cloud- the developer builds something, you want your time to value to be really short, maybe a day. I say state of the art right now, you write a change, sit save, and save it into the build system, it should go and build it, deploy it, and it should put it in a canary test to to see if it looks good. When you come in the next morning it will be deployed in one region and over the next period of time it deploys globally. That is the state of the art from time to value, I wrote a line of code, and how soon does the customer get value from that code, a day? Some people, it is years. I think a day is good, that is probably -- you can go faster, but that is a reasonable thing to do.

The point here is that you are keeping old versions of things and you don't replace the previous version of the system. This is the old way of doing it, it is so hard to get, you know, to get VMware to get somebody to push buttons on VMware, and once you get it, you update it, you use Jeff or puppet to keep updating it and you do replace and place. And here, I can get a new machine in a minute and then I will get a new alongside the old and run side by side, you can run multiple versions at once and you just need to route the traffic between them. So here is the principle: immutable code, automated build, ephemeral instances, blue green deployments and versioning your services.

Full Set of Cloud Native Principals

So that's what I mean by cloud native. This is the full set of principles just to revise them: paying as we go, self service, globally distributed, cross zones and regions, high utilization, and immutable code deployments. If you are doing these, you have a code application, and if not, you are fork lifting.

Chaos Architecture

I will talk about chaos architecture, and I would like to describe this as four layers, two teams, and an attitude. I will go through and explain what the layers are, what the teams do, and what the attitude is.

Infrastructure and Services

First of all, infrastructure and services. And so basically, you want to set it up with a no single point of failure, you have zones around the world in regions, it is easy to do. And you get this region, and you deploy multiple things. And if you zoom in on that, you see that everything is interconnected, you have multiple versions of it. And you zoom into one of those and there are more versions, you have more regions, and so you are using lots and lots of replications at multiple layers as you go in.

No Single Point of Failure

And the key point here is to get to have no single point of failure. And what I really mean is, no single point means it has to be distributed. If there is no single point, it is a distributed system, and if there is no single failure, it has to be a replicated system. So we are building distributed replicated systems and we want to automate them and, you know, cloud is the way that you build that automation. So what we're talking about here is systems of distributed that are replicated automated cloud and that is how you will build something to give you the infrastructure reliability that you need. So that's infrastructure. Beyond this is more interesting.

Switching and Interconnecting

And switching into the interconnect level. If you have data in more than one place, you figure out how you get the data there and in sync, there is some data replication going on. There is routing between where you get customers, or traffic, it has to be routed to more than one place, because if the place goes away, you have to route around it. So we will show a little diagram here. So this thing went away, and the customers routed somewhere else. So that is traffic routing, avoiding an issue. And then I heard, and I was telling someone earlier in the previous session, we had, years and years ago, at EBay we had a data center outage; it took a day or so to get it back up, and a month to fail back afterwards, because we didn't -- everything after the fail over was such a mess that we could not figure out how to get it back safely. It was not a simple switch there and back. So if you fix it, put the customers back in and then anti-entropy recovery, the data is out of sync and you have to re-synchronize it.

Who Has a Backup Datacenter?

Who here has a back-up data center? What is the best description of it; you have never failed over to it? If you are trying to sell chaos engineering to the management team, try to get the CEO to ask the CIO if they have ever failed to the back up. That's an embarrassing question that will cause them to fund a chaos engineering team, that's a theory anyway.

Or infrequent partial testing, if you do a fail over to the data center, if you do that, there are some compliance rules and industry regulators; that's the minimum to pass audit if you are a bank or something like that. If you do individual fail overs.

But regular testing and maintenance window; so the weekend when you are not doing anything, you are doing an entire fail over to prove that you can do it and exercise, some people do that. That is kind of a good practice. Or, you can do frequent fail overs in production to prove that nobody can tell that you are doing it, right? And, you know, we are hearing from Dave Hahn, so every week or two, Netflix shuts down a region and moves the traffic somewhere else, it takes them less than 10 minutes and nobody notices it is happening. If it fails, they do it more often. This is that continuous delivery principle that Jeff Humble came up with: if it hurts, do it more often. If you do this frequently enough, it will stop hurting and if there's an issue in the system, you will switch away from it and everything is fine.

So here is the problem: you want to route updates in customer requests to specific regions and services, you want to replicate data and re-route requests during incidents, and the switching over needs to be more reliable than the things you are switching between; this is one of these principle of highly available principles. If you switch, the failure rate of the switch has to be better than the failure rate of the things, otherwise you are making it worse. The combined reliability of the whole thing is dominated by the fact that the switch is less reliable than the things you are trying to switch between, and what we find is that many cases, you get some small failure in your system, and then the system starts to respond to it and then you say, let's fail over, and then it turns into a massive outage because everything collapses and all of your switching software, you find every error code path that has not been tested in your system.

That's the problem. The only way you can exercise the error code paths is by running these tests and exercising it. I keep going on about Netflix, but they are state of the art, I don't know that many people who are doing that. I know banks that do this, even weeks, this data center was the primary, and odd weeks this was, and back and forth, and if there was a problem midweek, they did the flip and keep going. That's a good state of the art way of doing this if you are in that kind of an environment. Okay. And so that's a switching layer.

Application Failures

Let's move up a layer: applications. And so what happens if the application gets an error it wasn't expecting, or it gets a slow response, or the network connection drops? Does it crash, does it fall over, does it write the wrong data to the wrong place? You don’t you know, you see all kinds of nasty behaviors. And so, that's a bit of a problem.

Microservices Limit “Blast Radius” for Software Incidents

We would like to be able to test these things better, but with microservices, it actually gets a little bit easier. With a monolith, it is hard to test; it does so many things and so many conditions that it is basically going to be difficult to test across all of its error cases. But a microservice does one thing, it takes one input and it gives you one output and it has a bunch of dependencies. You can reason much more about what it is going to do. You can put circuit breakers in there to limit the data source, bulk heads to prevent its spreading.

There was an architecture, there was a talk, and I think it might have been the QCon London event. I think it was QCon London, from starling Bank. Yes, it was Starling Bank that did a nice talk. I reviewed the long thing, and he had a very long acronym for his architecture, part of it was DITTO, Do Idempotent Things To Others, so once delivery to things and then it doesn't matter and all of your transactions are nicely behaved events and that way you can scale. It is resilient, reliable and you can reason about the system. If you have something that has to be delivered once, you will have a system that has trouble to be available because you have to free the system to see what is going on; that's the consistency problem of once delivery. If you set it up to multiple deliver and not have a problem, then you can deal with it much better.

And so really, what we are trying to avoid is avoid update and delete, those change the state. Append, and write, are fine. You are appending a write log. So you build these logs of what is going on in the system, and the current state of the system, if I can see the logs, I will figure it out. This is double entry bookkeeping, a 4,000 year old accounting algorithm invented by Babylonians with clay tablets, and they just had to append. And the reason we had update and delete is because computers were too small, so we optimized by updating in place. If you are doing anything that matters, just keep logging, and maybe purge it every now and again. But update is a terrible thing, because if you are caching and pointing to something and it changed, you didn't know it changed. That's the problem we are trying to avoid. You delete by writing tomb tones over things, that's the Cassandra method of doing deletes.


So the system is behind, acting strangely, why did it do that, you reboot, if there's a button, you push the button, turn the knob the wrong way, people will screw up a perfectly good system that is highly available, and there are lots of examples of this in the real world. And so, what are we going to do about it?

We have to train the people. And I have a nice analogy here. Who has never had a fire drill? Everyone who has worked in an office will have a fire drill where you stand in the parking lot and they count and made sure that everybody got out of the building, nobody tried to use the elevators, etc. And now and again, there's a fire and it saves lives. And before the first time I gave these slides, there was a huge earthquake in Mexico. Two hours before the earthquake, there was a public earthquake preparedness drill in Mexico city. Everybody was trained two hours before the real one happened. You know it is a real one because the earth is moving a few feet side to side under you, it is not just, hey, we are sending off a siren.

Who Runs the Fire Drill for I.T?

But here is the question, who runs the fire drill for IT? If your infrastructure and applications are on fire, what are you supposed to do? You didn't train everybody in how to respond to that. And so this is, this is where I think -- this is one of the prime functions of the chaos engineering team, and I have a link to the Netflix, or the book on this subject.

You've got people, applications switching, and infrastructure, and the chaos engineering team's job is to build tools that exercises all of those layers, so here are some of the tools. So there are people running game days, what to do during an outage, find the information you need, do you know how to be on the call where we are discussing the outage, there's a set of behaviors that work, if you don't know what you are doing, you can disrupt the call. And so, an efficiently-managed outage incident call is a really powerful way of getting the outage to be short. If you don't, you can make it worse.

And the simian memory, the open source chaos monkeys and things like that. You are building failure injection tools that put things in the right place, and the chaos automation platform, ChAp, there's a blog post on it, they are automating failure injection and the process of where to inject failures and the operation of those failures. If you start noticing that the failure, your test, your failure test is causing a customer-visible problem, you back out of that. And then Gremlin, which is a productized version of this. So you can go and find it, and then -- I don't think that you have a competitor out there. So you are a category creator, it is a brave place to be in a market.

And I think that we need more tools here, there is actually- Russ Miles is doing, there's a GitHub account that is doing bits and pieces of this. So this is, it is an interesting time. If you want to get involved in building stuff, this is a great opportunity to contribute open source tooling and concepts and ideas. It is a really an emerging area. And there is another team that does something similar, some of you have security teams to protect your systems, and the more advanced companies have a red team whose job is to break into the systems to prove the security team is doing a good job. You heard about this from Shannon Light with Intuit, and the red team goes and breaks everything, they find every buffer overflow that was introduced and they get into the system. They build tools for this, too.

And so one of the neat ones, the human level, right? That is like game days. So safe stack AVA is an open source package that does phishing attacks on companies. It has an email that generates links and says, you should click on this link for a special offer, and then it if you clink it click it will say, bad person, you are not supposed to click on the link. And then at Netflix, we would get these emails that they would send in, or they would have USB keys around the cafeteria, and if you plug it in, bad person. You are not supposed to plugin random keys. And these are a bunch of tools and there are other companies doing the other stuff. And these are the four layers to the teams, and the attitude is that you are trying to break it to make it better. And so let's think about risk tolerance.

Risk Tolerance

And so, who is at risk for what, and what kind of risks are you trying to guard against? And you can kind of choose a little bit here, whether downtime is a bigger risk than getting the wrong answer. Do you want consistency and security, which means that you have to stop if you cannot guarantee? This is the Cap Theorem; if you cannot guarantee the state of something, you just have to stop the system. And that's downtime; it might be the right thing to do if your state costs trillions to go to the wrong place, just stop and make sure your system goes into a state where you can do the next thing. But in other cases, the biggest challenge or risk is downtime. And you can actually provide a degraded service which is actually fine, and so you want to be more permissive. And one of the things, again, going back to my time at Netflix, we would -- one of the hardest things was to teach people about permissive failure, and conceptually, if it is our fault that we cannot tell whether you should be able to do something or not, then we should let you do it.

Right? If we are not sure if you are a customer in good standing, trying to watch a movie because of the failure in the system, we should let you watch it. Most of the time, you are. The probability that my downtime occurs at the same point as somebody trying to get into the system is small. You want to be permissive on failures, so if you cannot renew a ticket or there's a security system that is in there, look at the actual cost of giving away the thing that you are currently trying to protect. That's the concept here. It means you can build much more available systems; it is a really powerful technique for coping with partial outages in various parts of the system.

Incident Lifecycle

And another thing is this incident life cycle. You know, something has gone wrong, what are you going? Are you going to try to mitigate it so it does not spread, do you want to restore the system into a working state, and you want to adapt the system so it doesn't do it again. So that's basically the cycle.

And that's kind of the antifragile feedback loop. If you read the book Antifragile by Nicholas Taleb, he has a lot of examples of working out, if you work out too hard, you will break something, in the hospital, you cannot walk the next day, that is too hard. If you don't go to the gym ever and you have to run to catch a plane and have a heart attack, that is a bad thing. So there's a level of working out that is good, most people have too little exercise, and running for planes is probably my main exercise at this point in my life. You want enough exercise that you are slightly stronger each time you do it; that's the right level of exercise for an antifragile system, and that's what we are doing in chaos engineering, enough so that we don't break the system.

Break it to Make it Safer

We are trying to break it to make it safer, and there is research on here, and as I said, break it to make it better. But really, we are trying to make it safer. There is a whole lot of stuff going on in safety, and there's this whole area about the new view of safety, and this is about industrial safety. This is about people operating nuclear power plants and machinery and industrial processes. And what do you do when people die, or get hurt, or something goes wrong? And Tod Conklin is a leader in this space, and he has a podcast, the PreAccident podcast. If you want to learn more, listen to this and add it to your podcast list. It is really interesting. He used to be the safety guy at Los Alamos lab in New Mexico. So he knows a lot about that, big nuclear issues.

And so that's interesting. I actually went on that podcast once and did a whole talk about two years ago about the chaos monkey, and that whole idea, and he was quite interested in it. It is not just industrial, every once and a while he will have software people on it. John Ospore was on there, and he introduced him to me, and he left ETSY and is doing a lot of work in this space. There's the stellar report, the url is, that's exactly the url. And it is a whole discussion about how to build safer systems and how to deal with the failure mode and reporting and things in this space.

And really, the core ideas I got from this was from reading Drift Into Failure by Sydney decker. It is a good book to read unless you are on a plane or have a loved one in a hospital. I had to sit it down once on a plane, it is full of plane crashes and people dying in hospitals. Anyway, it is one of the inspirational books, and Sydney Decker is a figure behind this area.

Failures Are a System Problem- Lack of Safety Margin

This is a new emerging area, and this is the core concept; a failure doesn't have a root cause in a component failure, or a human error. If a component or a human error causes a system failure, it is a failure in the system, it is not a failure of that human. Like, humans do bad things all the time and systems don't fail. Just because this one time you did that bad thing and you actually tipped it over the edge, you should not fire that person or replace that, you know, replace that component.

It is because there wasn't enough margin. You are building systems with a big enough margin that can absorb human error and failure. So the question then is, how much margin do you have? This is an example. If I was blindfolded on the edge of a cliff, I would say, I think the edge is here. That's where the edge is, I can back away from that. So I have just established a safety margin of four feet in that dimension. I can go and feel my way over there and discover, this, okay I have a fairly reasonable amount of margin here. That's what we're talking about, and I can fall over and take a step this way, I know that I can take a step this way, because I have established I have a margin to do that. So that's the concept that we are trying to get here. I was trying to come up with an analogy for it. Hopefully that works.

Hypothesis Testing

So this is the hypothesis that we're going. We think we have a safety margin of a certain dimension, we can take 50 percent more traffic and we would be fine, or this service could fail, or, you know, this thing could go wrong, and we would still be up. That's the safety margin. So we think that, so we carefully test it by, like, pumping -- moving traffic from somewhere else, or overloading something or pushing something up closer to that limit and if you notice things are actually falling over, I find the edge of the cliff earlier than I was expecting, I stop quickly and back out. That's what we are trying to do; we don't want to cause an issue.

How to Select a Test?

And so then, this came up in the panel session earlier. And so how do you select a test? There's so many places you can inject failures, what are the dimensions and where should you be trying them? And lineage driven fault injection was a talk one year ago at QCon here, by Peter Alvaro, a fascinating idea, that effectively you take your business transactions that matter; the Netflix case, I pushed play, or I'm signing up for a service, the sign up flow, you follow the path. This is an important business transaction; what is the lineage of the transaction and the dependency chain and you work your way through that chain, injecting faults into the chain so you know that that business transaction is going to work. So you are not testing everything at random. You can call it fuzzing, when you throw random things, it is good to throw random injection, but the lineage, you are trying to find a key piece of business traffic you are looking at and you try to inject faults into that place to make sure it doesn't go wrong.

So summarize, what we really want to have, then, is really experienced staff who have been, maybe you have real outages, and planes don't crash every day, but pilots are trained in what to do if there's an incident. So that training is really important. And so you want your staff to be experienced in the processes involved in dealing with an incident. And your applications, you want them to be robust. You put them in test environments and simulate failures in various kind and you want to do that in production, too. You want a dependable switching fab request, you exercise all of its failure mode co-paths, you know what it does in these different circumstances, it is hard to go wrong if you don't test these, but if you don't test at all you will be screwed. And you want a redundant service foundation, all of your-service and capabilities, you want to make sure that you have those distributed out. That's what we would like to have, and that's my cloud native ability model. I have a few minutes for Q&A.

Any questions?

Is there a chance that you are given a --


Is there a chance that you are given rise to the same people who caused you to go to the permissive model by breaking your system?So if there's a hack attack on your site --

So if there's an attack on the system --

-- that made your, you cannot authorize the customers.

Yeah, authorizing it.

So now you allow people in this authorization.


So can it cause this never-ending loop?

So if you couldn't tell whether this is a valid customer or not, and they want to do something that is relatively cheap for you to give them, say, watch a movie, and your system is basically down, to make that decision, then just let them watch the movie, most likely they are a good customer. All right? So I think that it is okay, but I mean, you don't want people to go in there and move a million dollars to some other account. So there are cases where it doesn't -- where you want to be, put the hard control in there and you want to just stop the system. But I think, in most cases, people tend to error on the side of blocking if they can't be sure, whereas they should actually think through, could I be more permissive in this failure. You want to log that you did it, and maybe you will do some fraud analysis later to see if you were exposed. But there's a good paper by Pat Helen, Memory Guesses and Apologies. This is the apology phase, and it is like, I could not figure out what to do, I will punt you to customer service, or just say sorry, or just try and deal with it. We have another question?

Thank you for the talk. I wonder, can you clarify what you mean by turn off the service for saving utility bills? Is that what you meant?


-- for hardware, like, servers.

Yeah, well I can -- what I mean by turn off, well, you can save power in a data center by powering them down. But what I meant is on a cloud, it is cloud native, right? So what I meant was just stop using them, de-allocate the resource. You can -- you know, you can auto scale down. I have 100 machines, and the loud traffic halves, I run it on 50 machines. So I keep all the machines busy by shutting them down. If you have a QA environment, how -- there are 168 hours in a week, hopefully you are not working more than about 40 or 50 of them. So you should be able to shut down your entire QA environment for 2/3 to 3/4 of the week because there should not be anyone using it, so that gives back a higher utilization over time, right?

How safe is your yield for -- (speaker far from mic).

Well, it saves you a bit. If you are on a cloud, you are paying by the second. So you just stop paying for things you are not using. That was the point. Another question?

So in your graphs, you had a lot of connections between all of the nodes. How do you inject in between those nodes, do you have to have middleware in between the different services so that you can throttle?

Yeah, that's a good question. I think, a number of different ways of doing that. The Netflix way of doing it was to build of set of libraries, most of the Java based services at Netflix was built from a template. You start with a base server that you copied and you added your own business logic to it, and it had a standard HTTP request handler with additional code to inject failures into it, and then the outbound side that made requests, there were hooks for injecting into it. So you are injecting the middle of this framework, it was a web server or Tomcat server. That's one way of doing it. If you do the spring cloud framework in Java, you can pull that in. Ribbon is a service that Netflix has.

And you can use a service mesh. Istio uses envoy, and that's the actual service mesh point, it comes from Lyft. And so what that mesh gives you a point where you can get in there and configure it to inject failures and things like that. So I think this is a property of the service mesh, the service mesh can be an additional process, which is running along side your server where you are talking to hosts to get to it, or embedded in an application if you have a library that supports it. With Java, by code, you can do injection of code to do things. So AppDynamics, you can inject the monitoring into an application that was not instrumented, or you can just hard instrument stuff. So there's a number of ways of doing it. But I think those are the common ways I have seen. You can do it at the network layer as well if you are going through, you know, at the packet level as well. It is a bit harder to coordinate that.

Time for one more? Anyone else? This is the precarious run down the side.

So AWS itself, is it doing any (indiscernible) testing?

Well AWS has a chaos engineering team -- Amazon has a chaos engineering team, and AWS has a lot of testing, and each team manages itself. We do a lot of testing.

So my question is, what is appearing to me is most of the time, doing effective chaos engineering is a statement on complexity architecture, like how well you can understand and manage complexity, which is -- I don't think it is a content application layer, it is up and down the chain.

Yeah. I can get into a much more philosophical question, but the point is that if -- most systems are, right now, are so fragile that if you poke them, they fall over. That's the case. So if you just get incrementally better by training the people in how to deal with it, that's the starting point. If you exercise people so they know what to do during an outage, start there and then get individual applications. If you get to microservices, you have a boundary context that does one thing, you can reason about it, and you can say what happens if that one thing does something wrong. You can inject failures into that one service and work through the system. There are systemic problems that occur. If you have your time outs set wrong, you can get injection collapse, thundering herds and a lot of other larger scale problem and you need to test for those as well. I did a talk at Craft Conference a couple of years ago where I talked about some of the issues like timeout and stuff like that. So, if you want my slides on microservices, they are on my GitHub account, Adrian Cockcroft on GitHub, and there's a layer there. So from when I joined Amazon, the whole deck is there.

Can chaos engineering turn into a DDOS attack against the system? How do you make the decision you should not go further?

If you are doing it right, you should be monitoring the system so that if it starts to look bad you back out, and hopefully it stops, but you cannot tip a system over. There are classic cases where systems are not restartable. If you take it down and back up again, it does not come up again. That's a common property of systems that are grown over time. The system only works for the caches are warm. So if you kill your caching layer, you cannot re-warm it and you are down. You have to be careful about some of these properties. But it is good to know that, because then when you do it during the day when everyone is there and watching, rather than at 3:00AM on a Sunday.

So one more Netflix thing, the most annoying thing about Netflix is it would break at 7:00 on a Sunday evening because that's when everyone is watching TV with their kids. That was the peak traffic every week. And when you have a week on week growth rate, and every week is an all-time record, having the peak be an off time is a pain. You want to create enough chaos during the week on a Wednesday when everyone is at work and fresh, that's when you want to exercise the system. You don't want to do stuff out of hours and having a bunch of people that are sleep deprived trying to make important decisions. That's a bad recipe.

A principle is containing the blast radius. So with DDOS, you start small- what is the smallest thing that can teach you something, and as your trust in the system grows, you build it. So eventually you are running a production scale, but you are not starting there.

If you poke one back end database, and it is set wrong it can ripple through and take out everything. It is going to happen one day. You should do it live. Do it when you are ready for it. I think we are out of time. Thank you, everybody.

Live captioning by Lindsay @stoker_lindsay at White Coat Captioning @whitecoatcapx

See more presentations with transcripts

Recorded at:

Mar 30, 2018