BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations What Breaks Our Systems: A Taxonomy of Black Swans

What Breaks Our Systems: A Taxonomy of Black Swans

Bookmarks
50:46

Summary

Laura Nolan talks about Black Swan events - unforeseen, unanticipated, and catastrophic incidents - that may happen in production and can take the system down. She examines some of the strategies that can be employed to discover such possible incidents during canary and how to address them.

Bio

Laura Nolan is a Site Reliability Engineer at Slack. Her background is in Site Reliability Engineering, software engineering, distributed systems, and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly 'Site Reliability Engineering' book, as well as contributing to the more recent 'Seeking SRE'. She is a member of the USENIX SREcon steering committee.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Nolan: One of our earlier speakers said that an outage is an unplanned investment. One of the big things that we need to do with outages is to reap that investment by having good learning paths or postmortems as most of us call them. I think that the best way to get value out of outages is to get value out of other people's outages first, that might save you having an outage. It might not, but it might. I should say as well that the black-swan event I'm currently most worried about is this stage. It is as rickety as anything, so if I fall through the stage, that is an event that I'm actually kind of predicting, so let's hope this doesn't happen.

Black swan is something that's an outlier event. It's something that's hard to, or most likely impossible to predict and it's something that's severe in impact. Really what I've done here with this talk is I've gone through the last several years' worth of public internet postmortems that I could find and I've distilled out the ones that were really impactful outages like multiple hours of down-time, really scary stuff, like those really bad days on call that you sometimes get and I've distilled them into this talk.

Every black swan is unique and every incident is unique, but there are patterns and this talk is about those patterns. By the way, apologies to anybody who is from Australia. The black swan metaphor is from Nassim Taleb. He had a book about 10 years ago called "The Black Swan" and it's all about events that we think are unlikely, but in retrospect we look back and go, "Oh well, there's all this correlated stuff that's going on." In retrospect they look inevitable but apparently, in fact, Australia is full of black swans, so apologies to any Australian who doesn't get the title of this talk. In Europe, they're super rare.

One of the reasons for this talk is I think that the black-swan, unpredictable, unknowable, severe events - sometimes we can build processes or tooling or just resilience or robustness in our systems that can turn these into routine non-incidents. Briefly, canarying is when I've made a change in production so I've built a new binary, I've changed a config and I start by rolling that out to a small percentage of my production fleet first, so maybe 1% then 5% then 10%. I monitor that small percentage of my fleet for anomalies; is it throwing a lot of errors, are they crash looping? Am I starting to see drops in my key business metrics? If so, that's a bad push. It's a bad config, it's a bad binary, we're going to automatically roll that back or revert it. This is something that's not bleeding edge; a lot of organizations have been doing this for years and I expect many of you are doing this. This is really important because this is a defense against an entire class of problems.

Sue Lueder at Google studied causes of outages for years. It turns out that according to her, the biggest cause of outages and production problems is change. I think this is fairly intuitive to most of us who have run systems. Normally we change something and something unpredictable happens, and that's why we have a problem. Canarying is a very valuable technique for countering that. It should be said as well that so is continuous delivery, because it's much easier to spot and roll something back out quickly if the person who did make the change is still there, if somebody's watching it; if it's understood you can roll it back a lot faster and with a lot less impact. That's another way of mitigating this very common class of problems.

Before I get into the rest of this talk which is mostly looking at postmortems, I should say, I am talking about other people's public postmortems here and I just want to say that the organizations who've shared these postmortems are good engineering organizations. I'm not throwing any shade at any organization here about their incidents. In fact, I'm going to say the opposite and say that they're doing us all a service by helping us, by sharing that hard-won knowledge that they have, so kudos to all these organizations.

I should say as well, with very few exceptions, pretty much all of these incidents are very similar to something that I have seen myself with my own eyes, but for which a public postmortem does not exist. You can think about a lot of these as stand-ins for stuff that I've seen myself.

There are basically six categories of incidents that I want to discuss here. One is hitting limits that we didn't know were there. Spreading slowness - when our system starts to lock up because something has gone slow somewhere. Thundering herds - when we have a big spike of coordinated demand. Automation interactions is becoming a very hot topic as people are automating more. Also, there are cyberattacks and dependency loops.

A quick word about me. I've been fascinated by failure since my childhood. My dad was in software and bedtime stories were sometimes tales of fail, "They deployed this system and it had this terrible bug and everything was locked up and nobody could find anything in the warehouse," because he was logistics expert. Then, I got into scuba diving very heavily for maybe 15 years as an adult. I used to do some very deep diving, technical diving, lots of tanks, spare torches, many reels, and that also gave me a very deep and visceral appreciation of failure because I liked to not die.

I think these life experiences are probably what have funneled me into the SRE pathway, because I was a plain vanilla software engineer for quite a long time before I moved into the SRE field. If anyone's seen the O'Reilly Google SRE book, I wrote a chapter in that and also, the "Seeking SRE" book. I was at Google for quite a long time. I'm a very shiny new production engineer at Slack at the moment. Fun fact about me, I campaign against lethal autonomous weapon systems or killer robots. This is not Skynet; we don't think that they're going to develop smarts and come get you. We think that they're just going to be dangerous and incompetent and have bad effects.

Hitting Limits

This our first class of failure of six. Instapaper is about just over two years ago now, and I remember seeing this on Hacker News or whatever and going, "These people are having a bad time." They had their production DB on Amazon and it was backed by RDS. It turns out that behind all the layers, there was an ext3 file system which had a 2-terabyte limit. They didn't know about this and they just ran slap bang into it which meant that they could no longer do any writes on their production database. As you can imagine, this is bad for their service. What they had to do to recover it was to dump all their data out and reimport it into a new database on an ext4 file system. They were down for over a day. That's a really long outage, and they were limited for five days because they hadn't reimported all their data yet; they did the most recent data first.

I've got an interesting contrafactual here because it turns out I've spoken to some people who were working at Amazon at the time and they said, "Actually they could have converted their ext3 files system to an ext4 and they'd have had a quicker resolution." But a contrafactual is something that we say when we look at something in hindsight and go, "Actually they could have done this. We don't know that they could have done that" and they were human beings under stress with imperfect information. Prior to this outage they weren't even aware this file system was behind their stuff, and clearly Amazon people didn't think of it either, so that's a really interesting example of how we can look back at postmortems and say, "That could have gone very differently." That's our first instance of hitting limits that we didn't know were there.

This is a very different one. Sentry RAM, an error collection as a service company, were down for most of the U.S. working day. What happened was they maxed out their Postgres transaction IDs. I could talk for ages about Postgres and their multi-version concurrency model because it's really cool, but the short version of this is when you make an update or an insert in Postgres, it inserts a transaction ID as a visible field in that new row that you've added. What that means is only transactions that have an ID greater than that number are able to view that row. This is a 32-bit value and it gets increased every time that you make a write and then eventually it will loop around.

What you have to do is run a process that reclaims these transaction IDs. This is called vacuuming. The problem is vacuuming doesn't keep up with your usage of transaction IDs; you can run out. What happens when you run out? Postgres stops accepting writes. Again, a little bit of a commonality here with the Instapaper thing, but for a different reason. Here's another way that your database can completely fail to accept writes anymore. They didn't have any quick or easy way out of this now. They were sitting there waiting for this lengthy vacuum process to complete and they ended up truncating some ephemeral database tables to get that running in a reasonable period of time. This actually is not an isolated kind of failure. I've seen very similar postmortems involving Postgres transaction IDs from Joyent and also from MailChimp.

Here's another completely different way again. SparkPost is a bulk mail provider, and in May 2017 they had a very bad outage where they weren't able to send mail for several hours. They have a very high DNS workload because when you're sending email you need to look up your MX records. They had recently expanded their cluster so that they could serve so they'd have more capacity, which you would think should be a safe, good, and easy thing. It turns out no, not quite ,because unbeknownst to them Amazon Web Services is doing connection tracking in the backend, so all these short DNS connections were using up an entry in their connection tracking table which has a limit. It was not documented and they weren't aware of it, but they hit that limit. What happened there was their requests started failing and they weren't able to do their DNS accesses, so they weren't able to send their email.

Foursquare was a complete site outage for 11 hours which, when you're on call with your site down for 11 hours is a very long time. They were using MongoDB as one does, and basically what happened was one of the shards got bigger than the amount of RAM in the machine and that meant that yes, you can still do inserts and updates but it's going to be much slower because not you can't essentially cache your entire content in memory. They ended up with a big query backlog and they just weren't able to keep up with the read and write demand. This of course is back in the day when MongoDB had that amazing global lock so it could get quite slow at times. Of course, they were trying to reshard while under load, and that is not a great place to be.

Platform.sh. is a very small, little, web service. One of their regions was down for four hours. They were using this orchestration software. It was going to query all of the Zookeeper nodes. They were using this third-party library and what it was doing was it sending all these queries through one pipe with a 64 gig buffer. They had no idea that this was happening, but eventually they ended up with so many nodes in their Zookeeper installation that it filled up this pipe. Then, it just got an exception and failed, so because the pipe was full, it wouldn't read from it. What they had to do while their site was down was go in, debug - this was basically a concurrency problem - and add a [inaudible 00:13:00] around this pipe. This is not work that you want to be doing while your site is down.

Those were a half a dozen different ways that we can hit limits problems. We can get them on system resources like RAM, like we saw with the MongoDG outage with Foursquare. We can get them with logical resources, like the buffer size on a pipe. We can get them with logical resources like Postgres transaction IDs. We can see limits imposed by providers, we can see limits that are posed by the type of file system. These kinds of limits exist in all sorts of places, and the scary thing is, very often they're hard limits, they're very hard to change quickly, and they're limits that we can run into completely without warning. So what can we do?

Part of this is fixable by load and capacity testing. I spent something like 15 months of life once doing very little apart from doing load testing and then tracking down and resolving the issues that I found by doing load testing. That was quite a fun year actually. Especially if you're imposing an unusual workload in the same way that, say, SparkPost was on a cloud provider, it's even worth load testing that your cloud provider is going to be able to handle your current load plus the margin. It's very important to include write loads in your load test. A load test that doesn't include writes is very unrepresentative, because writes involve taking a lock on something and they can really slow something down, a small percentage of writes. Read-only load tests are not very useful.

Capacity testing - grow your data stores past their current size. That will tell you how much headroom you have or whether or not you're going to run into an issue like, say, Instapaper did. Obviously do this on a replica of prod. Don't forget your ancillary data storage, like your ZooKeeper. The important thing is don't just test your happy normal operation with these increased size data stores. Also, test for things like start up or shut down, and resharding, because only that gives you a good idea of what could happen in the worst.

The best documentation of a limit that you know about is the monitoring alert. Literally, create a ticket. You probably don't wanna page somebody for this, "Wake up because you're going to run out of space in six months because you're going to hit this limit." Nobody wants to be woken up by that, but think about a ticket. Put lines on your graphs. One of the really interesting things about the Instapaper postmortem, if you read it in more detail, is we had an instance here where there was a team that had taken over this system from a different team who had built it. That original team may well have known about that 2-terabyte limit, but it's a new service, you've just built it, you're way away from that, it's tiny.

If they had thought to add some alerting for, "You're at 1.5 terabytes, do you know that you're going to run out when you hit 2?" That could have been something that could have stopped them from having that problem. If you do set up an alert like this, include a nice link to a REM book that explains the nature of the limit so that people will understand why they're being alerted, because this could happen years in the future, you could be gone. Try and embed the knowledge in your systems and not in your people as much as possible. You need your people to understand your systems, but people are pretty lossy, and especially things like this that don't come up often - document them. The more complicated the response is going to be, "We need to switch to a different type of data store, we need to do a big migration," give more warning for those kinds of things.

Spreading Slowness

Here's our second class of failure, spreading slowness. I think people who have worked a lot with microservices will probably have seen this. Here is an example. It doesn't always come from microservices; my first example isn't really. February last year, HostedGraphite went down because of Amazon Web Services problems, but they weren't on Amazon Web Services. What happened? How could this happen? It turned out that a lot of their customers were in AWS, and the customers connecting from AWS to HostedGraphite, had slow connections and they were hogging all of the connections that they were allowed to have on their load balancers, which I think were HAProxys. So they got saturated. I'm going to talk a little about utilization saturation and error in an upcoming slide, but this is a saturation problem, and when systems get saturated they're going to start either locking up and getting slow, or they're going to start rejecting a load and in this case they're just locking up and getting slow.

Anybody who's familiar with DDoS attacks, or DoS attacks, will recognize this as basically being an accidental slowloris. This is actually an attack that you can do deliberately. The Iranians used this quite heavily about 10 years ago. There was a very contentious election, political thing going on, and they wanted to attack their government's infrastructure without taking down the whole of the internet in Iran. They used the slowloris attack where basically all you do is you open connections to a web service and you go very slowly, send byte by byte just fast enough so that it's not going to close the DCB connection.

The second instance - Spotify April 2013. They have some microservices; one of them is a playlist service that manages your playlists. It got overloaded because another service spun up and started using it, and it wasn't planned for capacity-wise. They rolled that back, but the problem didn't go away after the roll-back, because now the systems are in this bad, clogged-up state, where all their in-bound request queues are all full and the systems are just grinding away managing that queue. An earlier speaker talked about this perfect storm concept, where there isn't just one root cause of an incident. What you have is you have several factors that come together to break your system. This is a great example of that because they had this accident where they had misjudged their capacity, but they also had set up verbose logging which was making their system's lives much harder, because it's spending all this tie managing this queue of inbound requests. Also, all this logging.

One of the things you find with a microservices' architecture is when you start getting your services into the state where they've been overloaded, because buffers fill up, queues fill up, they just get very sad and they tend not to recover until you've taken things down and brought them back up in a very controlled way, then making sure they don't get overloaded again. This is what they had to do. They had to restart their systems behind a firewall in order to get back into a healthy state.

Square - their auth system. Their log-in management system slowed to a crawl. Their Redis had gotten overloaded, and the reason why their Redis was overloaded was because clients were retrying Redis 500 times with no back-off. This again is one of those perfect storm things, because this retry loop had been in their code for a while as I understand it. It wasn't a new thing, but you just end up getting this triggered. They clearly had some little wave of errors for some reason, and then this just spirals and spirals because the things that receive errors are now sitting in a very tight retry loop, hammering their service. Because their auth system was slow, everything else was calling that system; it was a very core part of their architecture. Because their auth system is slow, everything else gets slow and locks up as well.

The big defense against this is failing fast. Letting your system sit there with a huge request queue, you have filled up buffers, that's really dangerous. You end up with these locked up systems and they're very hard to unravel safely and get back into an operational state. You end up with some systems where one big spike of errors or spike of demand can cause them to break until a human comes and fixes them. As one of the humans who tends to be on call for systems, I'd rather that didn't happen. These can be really nasty outages out of all proportioned to the event that causes them.

Enforce deadlines for your requests. If you're writing a request from service A to service B, don't just let that request from service A sit there forever, put a timer on it. How long do you expect this request to take? How long is it going to be useful for you to wait? If you're doing some work for a user on the internet, they're not going to wait around forever; there's no point really in having time-outs that are longer than 100 milliseconds, 200 maybe. Your users are going to get bored and go away, so you're better off saying, "Ok, I didn't get an answer, I'm going to do the best I can without it," or if you retry, retrying is good but don't retry infinitely. Retry maybe three times; not 500, but three, because especially if you retry very quickly, what you're doing is you're hammering the service that is already returning errors so it's already potentially unhealthy in some way.

Exponential back-offs - if you must retry into a service, retry quickly once and then wait a longer period of time, then wait a longer period of time again and add a bit of variation gesture to the wait so you don't have a wave of errors causing a repeat echo of the wave exactly a second later or exactly two seconds later, so spread it out. The circuit-breaker pattern is essentially this apart from shared across multiple out-band requests. If you have a service that's talking to multiple instances of the backend and it notices that a lot of these are returning errors, it can share state across multiple requests and say, "Ok, the service is unhealthy, I'm just not going to even make requests for the next few seconds. I'm going to let it recover."

Dashboards: I think a couple of the earlier speakers said that no matter what you do, incidents are going to happen. I think that's true. Even with good hygiene around, load-shedding, and time-outs, and retries, it's possible to get into bad situations of course. Have dashboards that help you to understand your systems and what's happening in them so that you can fix it fast. Latency and errors are fundamental. They're golden signals about what is the health of this service? If you’ve got a bunch of microservices, you can quickly scan through to say, "This is the one that has high latency, this is the one that has high errors." Of course, in a situation where it's also locked up, a lot of them are going to have high latency and errors because the root service will be sitting here with its problem, whatever it is, but all the ones that call it are going to lock up as well, if you're not doing the proper deadlines and proper load shedding.

Saturation metrics are really interesting. Utilization is what percentage of the time is my resource busy or what percentage of it filled up? Saturation is, my thing is full. I'm using every thread on this thread-pool and I have things queued up that will be using it if there were more. That's saturation; saturation is excess demand.

Most of these distributed systems, if you look into them deeply enough, you'll find that they stem from either an error, something is misconfigured or bad or wrong; or saturation, something somewhere has gotten saturated and has excess demand that it can't serve. I'm a big proponent in good, utilization-saturation error metrics for your services. Think about physical resources, so your RAM and your CPU time, but you have to also think about non-physical resources, so connections in my connection pools, what's the utilization and saturation there, threads, locks, file descriptors, anything that your program can run out off. You're not always going to know these things ahead of time, but there are at least a set of common things that you should think about. By the way, utilization, saturation, and errors come from Brendan Gregg and he has a very good blog [inaudible 00:26:20] and also a really good book called "Systems Performance," where he talks about this stuff extensively. It's well worth reading for anyone who's interested in this stuff.

Thundering Herds

Our third kind of cache [inaudible 00:26:34] problem. This is actually fairly closely related. This almost like a sub-species of our spreading slowness. Thundering herd is correlated demand. If you start thinking about complex systems - Jason Hand was talking quite a bit about these - you think about things being correlated and you think about how things are interconnected and affecting each other. We tend to think of events as being independent, but actually they're not. There's a lot of correlation in the world and there's a lot of correlation in our computing systems, and that's why we see black swans more often than we really should see black swans. You can't just say, "We see this event X one in a thousand times, and therefore seeing it three times in a week should happen only one in ten million weeks." The problem of just some correlation that could cause multiple of these things is just not true, and this I can say happens in our systems too.

Coordinated demand, big spikes of demand, where do they come from? One source is users, organic demand from actual human beings clicking something. Typically, when get that it's because there's some kind of special event happening, so New Year's Eve, if you work for a photo-sharing service. Black Friday is a great example of this for anyone who works in e-commerce, and flash sales of any kind. There's a very popular mobile game called Puzzle and Dragons, and it has this in-game currency called magic stones. They used to have an event where the magic stones would go on sale for less money at a certain time of the month, and people look at the payment service for this and go, "Where is this demand coming from every month?" because suddenly everything would go off the charts, and it turned out to be this magic stones thing.

Sometimes you know about your causes of correlated demand from users because you're running a flash sale or it's Black Friday, and sometimes they will surprise you, but very often they're not from users; they're from systems. If you have the mobile client, be super careful about any piece of code that says, "I'm going to asynchronously update my data at midnight," because this may be fine when you have a very small number of users, but when that grows you'll be very sad because you're going to have a huge spike of correlated demand at the exact same time, so again, like with retries, you need to jitter that over time.

People always love to have cron jobs running at midnight or just generally on the hour. If you look at a lot of systems, anything that's backend to a cron job, you'll very often see a system pootles along, "It's the hour. Look we've got a little spike," and then it goes down, so hopefully your cron jobs are not heavy enough to overwhelm their backend systems.

Big batch jobs starting are an amazing source of this. The classic example at Google was always the intern MapReduce, although that was really unfair on interns because most of the time it wasn't an intern who started that. This is the grown-up devs blaming the interns. Coordinated demand can actually even come from your own systems doing self-repair tasks. For example, you're running a Voldemort cluster and you have it set up to have each piece of data on n of your m shards. One goes down; it's going to start rereplicating that data so that you maintain your replication factor for each piece of data. That's going to start causing coordinated demand on your system. When you're building a system like that you have to be very careful about how you smear that over time, and how much of your system's resources you're going to take up by doing that kind of recovery.

Here are some more examples: CircleCI, July 2015. This is a continuous integration company. They receive build hooks from GitHub all the time. GitHub was down, so they weren't receiving these webhooks. It came back and they essentially had a whole bunch of demand queued up. Everyone goes, "GitHub's back." You commit your thing and wait for the magic to happen. They had a big surge in traffic that wasn't normal for them. They had at the time some very complicated scheduling logic, so scheduling a build was quite a heavy operation, and that resulted in very big contention on their database.

One of the patterns that you see very commonly in robust distributed systems is where you're ingesting expensive things from the outside world, and you don't want to drop them. Instead of immediately doing the big, expensive thing, put them in a very cheap queue and then have an asynchronous process that's going to read out of that queue and do the actual expensive operations. Therefore, you can use the queue as a buffering mechanism so that you don't have to drop your incoming requests, and you also don't have to overload your backends. Most monitoring systems that I'm aware of do something like that with incoming metrics as well.

Mixpanel, in January 2016 were intermittently down for five hours. What happened here, again, is one of those great, perfect storm examples. They were running into data centers or availability zones. One of them was down for maintenance. Then they had some kind of unexpected spike in load which they don't really explain what that was from, maybe they don't know, and that caused saturation in their disk I/O. They had things backing up and then that was exasperated again, another factor in this perfect storm, by their Android clients retrying with no back-off, so again we see retries causing problems, so common in microservice architectures or any sort of architecture you've got clients in fact.

Discord had two two-hour incidents on one day. For one day they were down completely and then another one they were up, but direct messages were broken. They have a sessions microservice that depends on a presence service. They had a cluster of presence services, one of them had disconnected from their cluster. They explained that this is because of CPU soft lock which happens when you have one process in the CPU sitting there with the high priority waiting for a spinlock and therefore the supervisor process can't interrupt it, but that's not really the root cause at all. What is the root cause? They had this outage and what happened when their one instance of their present service disconnected was everything that was connected to that mutually tried to reconnect to something else, causing a thundering herd that they couldn't deal with.

A contributing factor here seems to be that they were running a fairly low cardinality of services, so the more services that you have to spread this increased demand over when you have this kind of disconnect, reconnect events, the less likely you are to have this kind of problem. Another factor is well-behaved retries and well-behaved behavior generally in shedding load when your system can't take it.

I think the big defense here is actually to realize that basically any service, any internet-facing service I've said here, but potentially even an internal-facing service, can face this kind of thundering herd. We think this is an unusual and rare occurrence. We say, "Our normal traffic is 10k QPS. How can we possibly ever see 500k QPS?" It can happen. It can happen under the kind of circumstances that I've been talking about here. It's not that unusual. It's not this outlier day; it's just a day when a thundering herd happens. You've got to plan for it, and I'm going to call back here to the talk that immediately preceded mine, Lorne's [Kligerman] one. You have to plan for having a degraded node. If I have to drop traffic, what traffic do I drop? For example, you might have a free tier and a paying tier of users; you probably want to drop your free tier first, for example.

Think about queuing things that you don't have to process right away, not everything has to be processed synchronously. Test it; test what happens to your system. When you overload it, does it gracefully drop the load and carry on serving what it can serve, or does it just do that spreading slowness thing and just lock up, get saturated and contaminate the rest of your system?

Automation Interaction

The forth failure mode, automation interactions - I was actually at Google for this. This is one of the ones I've seen with my own eyes. An engineer tried to send a rack of machines to disk erase, just one rack and then they accidentally - because of a RegEx accident - sent the entire Google CDN, basically everything that's caching video for YouTube, everything that's caching APKs that you would download from the play store, all gone. Yes, you're going to say, "How is it that you can physically erase all these disks in the stroke of a key?" You can't, but you can logically erase them by throwing away the encryption key which is good enough. That's very fast.

The upshot of this was actually that YouTube and everything stayed up, Google stayed up. Things were a bit slower and the networks were a bit unhappy for a couple of days, until basically everything got rebuilt and the caches got warmed up again, but they were lucky to survive this. They survived this by the way, because the CDN was a latency cache, not a capacity cache. A latency cache is a cache that you have just to make stuff faster, but you can serve without it. A capacity cache is one that is basically papering over the fact that you cannot serve everything that you need to serve over time from the source of truth, which is usually some disks somewhere, so it's very interesting to think about when you have cache of any kind including your CDN, what kind of a cache is that?

Reddit were performing a ZooKeeper migration and because they knew that their Zookeeper wouldn't have accurate data for a while, they turned it off. They turned off their autoscaler because their autoscaler relied on Zookeeper having accurate information in it. However, they had some automation that was intended to make sure the autoscaler was running. The automation turns the autoscaler back on and then the autoscaler went, "I don't know what's going on here. This looks like we shouldn't have a sight up," so it took the site down. This is one of the lessons of automation, in that automation is great, it's very efficient, it gets things done really well, and one of the things that it can get done really well and really fast is taking down all of your stuff.

Complex systems are systems where you have multiple interacting parts and a non-linear sequence where you've got memory in the system. This is the technical definition of complex systems; it isn’t just complicated, it is multiple moving parts, interactions between them and memory in the system. They are prone to unpredictable behavior and all the systems that we deal with tend to be complex systems. It's this nature of the distributed-systems beast. Automation systems are a particularly fun instance of this.

What can we do about the fact that our systems are complex, unpredictable systems, our automation systems in particular? You can create a constraint service and what this is, is a service where your automation systems basically say, "Constraint service, is it ok if I do this thing? Is it ok if I take down this network link or take down this instance or anything else?" You can set lower bounds for your remaining resources. For example, if the Reddit people had a constraint service and said, "I always want to have 2000 web server instances up," but that is the kind of thing that could potentially prevent this kind of outage. I'm not going to ever say that you can ever 100% prevent them, because one of these days your constraints service will go bananas and then you’re hosed anyway, but it will prevent a lot of it.

You can also set up things in your constraints service to say, "My automation wants to reduce capacity in some way, but I know that this service has recently received an alert so I'm going to say no." That's an example of some constraints of service logic that I've seen. It's very important that your constraint service should not limit what human operators are allowed to do, because human operators are smarter than you're machines and your human operators need to be able to jump in when the robots have destroyed everything and bring it back. It's very important that if you write a constraints service like this, if it applies to human operations at all, it should be only in an advisory mode, rather than preventing people from doing the work that they need to do, because humans are smarter than the machines.

If you have any kind of automation service like the ones we were talking about there, you should make it easy to turn off. This is called the big, red, button model, and if you built on one of these constraint services that puts limits on what your automation can do, you can also use that as a place to put in a big, red button. I do literally mean like a web page or a command that people can run that says, "Automation, just stop doing stuff, you're making it worse." It's important if you have such a thing to test it periodically; actually go every quarter, or every month and hit that big, red button and make sure that the automation does stop and turn it back on again. This is a great exercise for people who are new to on-call in your team. Ask them to do all this sort of drills. It gives them practice, it gives them muscle memory and it makes sure things work the way they intend.

Very important if you're building automation that changes your production environment is to have it log to one searchable place: IRC channel or Slack channel, into Logstash, just somewhere where you can easily go and see a log of everything that the automation has done to your systems, because very often you'll find that, "Why don't I have any web servers? I wonder what happened?" You'll go in and find what the automation interaction is that's caused that. I've seen robots fighting with each other. We do a particularly fun thing where we had n places where a particular resource could be served, and we had a piece of automation that could say, "This one's overloaded so stop advertising these resources." Then what would happen is when that would happen, more load would go to the other place that this stuff could be served from, and then the automation would go, "Ok, that's overloaded. Stop advertising some resources," and then it would slosh back and forth until everything was fully drained.

Nobody envisaged that when they wrote the system, but it's a complex system and you can't predict it. The real message here is, make it controllable by humans; give humans all the power to fix things in an emergency, limit what your automation can do to something sensible, and log it so that what it's doing is visible to humans.

Cyberattacks

I've only got one incident in this category and I'm really not a security expert. I know a little bit about the domain, but I’m not an expert. The reason I added this here is because this is the most impactful major outage in this entire slide deck. This actually doesn't come from a public postmortem released by Maersk. Everything else from this slide deck comes from a public post mortem that was released by the company. Maersk were the subject of a really good article that Wired did. Somebody leaked all these details to a Wired reporter. Maersk is a big, global, shipping company. Probably half the stuff that you are wearing right now or laptops that you are holding were shipped by Maersk, if not more. They got infected by a worm called NotPetya which was probably Russian warfare worm. They got infected because it hit one of their accounting systems running in their corporate office, and from there it was able to leapfrog into their production systems. To stop the propagation they had to literally turn off everything, their whole systems everywhere. They couldn't unload ships or anything. A 20% hit to global shipping worldwide - everyone, not just Maersk. It cost billions, this was the most impactful outage.

Why do I talk about this when this is not a securities setting? Some of us here are working on setting up production systems. It's really important to think about what is the extent of damage that you can do to your production systems. One of the big problems with Maersk was they hadn't, in any way, firewalled off their prod systems from their corp systems. One of the ways that IT is moving at the moment is towards really segmenting your production systems down as much as possible. Don't just have one big production that everything in your corporate network can connect to. Block it off. There's a big movement towards having proxies where everyone has to jump through this proxy to access production just as [inaudible 00:45:05]. Validate and control what's running in your production. Minimize the worst possible blast radius for your incidents. This should be really sobering to anybody who has a set up where corp machines can just talk directly to prod.

Dependency Loops

The last category is dependency loops. Can you reboot all your services from scratch? If you work for an early-stage system, like something that was set up over the last couple of years, new services, you can probably confidently say yes. Those of us who are working with older systems where things have grown up over the course of several years, I think most of us cannot comfortably say yes to that. This is bad because simultaneous reboots can happen, and that is a very bad time to notice that you have a circular dependency, so maybe your storage depends on your monitoring to figure out where to put its shards or something, and then that depends on your storage being up because it's storing, and now you have a hairball and you can't easily start your systems when they're down. GitHub, not quite as bad as what I just said there, but they had a two-hour outage. What had happened was they had this Regis cluster, which was supposed to be a soft dependency, just a cache that they could do without. They had a power disruption and a large chunk of their data center rebooted. Some of their machines didn't come back and they had to rebuild their Regis cluster from scratch, pretty much while they did that. They had never intended to depend of Regis. They had thought that it would be a self-dependency that would just provide better latency. It's very hard to make sure that these self dependencies don't unintentionally become hard dependencies that you cannot do without. You have to be constantly testing it. If you're in a situation like this, you have to turn off your Redis for a few hours every month to make sure that that's still true, or one day it won't be true.

Trello had an S3 outage. They keep their front-end web app assets there and they have an API for their mobile clients. That should have been fine, but it was during a completely spurious check that the S3 stuff was there, and it noticed the web app wasn't up, the API refused to go, and this made their outage worse. They were going to have an outage on their front-end regardless, but they didn't need to have an outage of their API as well.

The defense here is layering. Take a look at your production systems and say this is the basics. This stuff needs to depend on nothing else and then layer one. That can depend on my layer zero. Layer three can depend on my layer two. You’ve got to build it up in layers, that's the only way to avoid loops. Test the process of starting your infrastructure up; create a new staging environment from scratch periodically. That can be very educational. Figure out how long that takes with a full data set. Maybe you can restart your infrastructure, but maybe it's going to take hours. Is that ok? Maybe it is.

That wasn't an exhaustive list. There are other ways of having major outages, like your network can go down, you can have all sorts of correlated issues to do with time, leap seconds. These are problems that do have some fairly generic solutions and mitigations.

Disaster testing, one of my predecessor speakers spoke about that, is very useful. Fuzz testing is a great way of testing the correctness, or at least whether or not I can crash services with bad data. Chaos engineering, there's a whole talk about that so I won't go back into it again. Instant management processes - again, one of the speakers who spoke earlier discussed this. It's important to have these. If your organization cares about responding to incidents well and resolving them quickly, use something like the FEMA's incident management process. It's become standard in the industry. Practice using it. Lorne [Kligerman] talked about the problems that you have when these things are not well exercised. Use them for more minor incidents, use them for disaster testing drills. Just sometimes use them for practice.

Also, any on-caller should be able to get help. There should be a page or alias where I can literally say, "Help, get more technical people on the case," like a major incident response team of some kind, executives. Communication - figure out a way to communicate in an outage that doesn't require your own production infrastructure to be up. A phone bridge or IRC - figure out how people are going to know about it when your prod infrastructure isn't up. Use a laminated wallet card maybe; very simple but it works.

You're going to be able to prioritize this work in your time budget. Think about that. If your team is so loaded that you think that there are scary areas, the kinds of things I've just talked about in your systems, you've had near misses, you have this feeling with these problems, you should be able to prioritize time for this in your budget. Every quarter sit down and say, "We're going to work x percent of time on this stuff." If you can't do that, is your team overloaded?

These slides are going to be up on the QCon website. There are a whole bunch of links; this is some great other, further reading. I really recommend this book, there's a lot of wisdom in it. There are links to all the postmortems on each slide where they are mentioned.

 

See more presentations with transcripts

 

Recorded at:

Oct 10, 2019

BT