Transcript
Yu: My name is Denise. This is the talk, why are distributed systems so hard? I actually want to start by launching into a meta discussion, which is, why should we even bother learning about distributed computing in the year 2020? After all, we've all been using the cloud for a couple of years. Many of us are using open source orchestration tools like Kubernetes, for example, to manage our services for us. The question remains, why worry about distributed systems theory and history in 2020, when we arguably have tools that might do our jobs better than we can? After all, we've known for a very long time how to run services. Isn't it the case that microservices are smaller versions of services? Why pursue this type of architecture? Why pursue smaller decomposed, decoupled services? I generally think that a microservice architecture is motivated by a set of sociotechnical goals. Among those goals are, perhaps you want to make your engineering team boundaries reflect the boundaries of your business domain. For example, if you work for an e-commerce company, rather than having all your front-end people over there, and all your database administrators over here, you might instead build cross-functional, integrated teams. Maybe some of them work on payments. Maybe some of them work on search, and so on.
Why do you want to do this? If you can decouple your engineering teams to look like your business boundaries, maybe that means that you enable these teams to release and deploy more frequently and more independently. The payments team might not be blocked on a feature that the search team has to merge and to trunk before you ship. Continuous delivery, higher frequency, more rapid deploys, all of these are good things. Why care about that? I think the pursuit of these goals supports the overarching goal that we ultimately want to improve the resiliency of critical systems that we need for our businesses to run. We want to protect these systems against catastrophic and showstopping events. We've had microservices for a couple years now, after all, we know that if we're doing microservices, we can make a change to one microservice without having to change any others.
Microservices and Large Distributed Systems
I'm not going to stand here and preach to the converted. I'm not going to spend the rest of this talk rattling off a list of why microservices might be good. They might be useful in the right context. I'm going to presume that you're all here because you've at least heard of microservices before, maybe you've used them. Maybe you're here because you have a healthy skepticism towards them. That's all fine. The thing is that everybody, every team who runs microservices is now at the helm of a large distributed system. How large, of course, is going to depend on the specifics of your business. It's almost a law. In talks about microservices, you need to have one slide that shows somebody else's architecture. This is my slide for that. Monzo is a bank here, great company, fantastic people. I love them to death. Monzo hit 1500 microservices at the end of last year. This diagram shows all of their networking rules that govern how one service talks to all the other services in the system. Even if your architecture doesn't look like this, yet, maybe, the truth is that the specific technologies that we use to manage and deploy and orchestrate these systems will come and go. Technology trends, honestly, are really fickle, just like my cat's dietary habits. The fundamental principles for designing and operating services on top of distributed systems haven't changed that much in the last few decades. Understanding the fundamentals, taking some time to learn this stuff and to internalize it is the best way to future proof the systems that we're building today. Beyond that, I honestly think that just learning about distributed systems is really rewarding. It's really fun.
My name is Denise. That's my Twitter. That's the wrong hashtag, it's actually QCon London. For accessibility reasons, I actually like to upload my slides 100% online. You can access them right now if you go to deniseyu.io/qcon. I'm a senior engineer at GitHub. I work on the community and safety team. Broadly, my team builds tools to make GitHub a more productive and safer place to build communities.
Broadly, we're going to cover a few big topic areas. We're going to talk about, why distributed systems? Why distributed computing is even a thing? We're going to do a recap. If you've studied distributed systems theory before, you might know that already, if not, don't worry, you'll know by the end. We're going to talk a lot about networks and partial failures. Then we're going to close out by talking about the human side of things. Sociotechnical mitigations, and how the human part of system is the most adaptive but possibly the most complex part of it.
How did this all happen? How did we get to where we are today? A long time ago, in a data center, not too far away, maybe some of you know where, from your own experiences where this data center is. All your business applications tended to be structured like this. Client server architecture, you had multiple applications. They all read and write from the same database. That one database was probably hosted on some machine that was in a server room with no windows in the basement of your building. This worked for quite a long time. It worked because IT was a necessary evil. It was a cost center that we needed to support so that other people could go and do the real money making operations. This stopped being true at some point in the '90s, when computers stopped being this cost center for a lot of companies. There's that famous Marc Andreessen quote about how software is eating the world. At some point in the '90s, computers actually started being business differentiators and competitive advantages, which meant that we had to start taking our IT functions a lot more seriously.
Customer Data
Of course, the core value for a lot of businesses then, and for a lot of businesses today is customer data. We as an industry have always needed better ways to store and retrieve our customer's data. As IT became a differentiator, the way that we stored and retrieved our data, of course, evolved. Today, I don't want to say, it doesn't work for anybody, but for most companies, it's not sufficient to just have your one massive database on one server in your basement anymore. Why though? One pusher here was that data-driven business analysis is increasingly important. Business analysts and product managers want to run expensive SQL query so they can make informed decisions about business. Also, machine learning, artificial intelligence, natural language processing, created a new set of requirements for how we interact with our data. The truth is that we have more data than we've ever had before in human history. That statement will always be true, which is why I like saying it. You have intermediate layers, like key-value caches, and things like that, that help us to speed up data retrieval in a lot of different circumstances. This means our data is more distributed than it used to be.
Scaling Vertically
The first thing we did was we scaled vertically, which means to just bolt more compute power onto the machines that you already have. This worked for a while. It worked to a degree. At some point, it no longer made financial sense for a lot of companies to add that last 1%. This is unit economics. The cost of the margin was more than the value that it granted to companies. Even if you had really deep pockets, and you're like, I want all the CPU, just give me all of it. At some point you actually hit the current limits of hardware engineering. What do I mean by that? The physical limits of hardware engineering, Moore's law is the most common way to think of it. Moore's law states that roughly every two years, the number of transistors on an integrated circuit chip doubles, which basically means that processing power doubles, just like the size of my kitten in the first two years of his life. I know that Moore's Law is becoming less and less true these days. Historically speaking, this has been a pretty good way to track the increased processing power.
Cloud Computing
Lucky for us, in the early 2000s, came along cloud computing solutions. Of course, we know and love Amazon Web Services, Google Cloud, Microsoft Azure. If you need to store your data on-premises, there are solutions that you can roll out over your existing hardware like vSphere or VMware, I think is one of the most well-known options, not the only one. You can take your own hardware and make it behave like a cloud. Cloud computing gave us an easy way to provision new machines in an on-demand fashion. You get it just in time, just when you need to scale up, which means that we are no longer constrained by the limits of vertical scaling. Because now we can do this thing called horizontal scaling, which means you can take a workload and you can distribute it over multiple machines now.
Why might teams want to leverage this ability to horizontally distribute? The first reason is scalability, which means that sometimes you have a machine that can't handle the volume of data you're looking to store. Or maybe the request sizes are too big. One solution is to split that data into multiple chunks by some index, like an encyclopedia. You don't have volumes of encyclopedias that are one book that's 4-feet long. Encyclopedias are real-life examples of sharding, where you break it into multiple volumes by first letter. Another reason is availability. If you're operating with multiple machines, it means that you gain the ability to have multiple copies of your data stored in different locations. By having your data served up by more than one machine, we build redundancy into our systems. The final reason is latency. If you can store your data physically closer to where your end users are going to be requesting it from, it means that pulses have to travel over less cable but request times are going to be faster.
Modern Distributed Systems
I want to spend a little bit of time talking about modern distributed systems. You may have come across the term shared nothing architecture before. It's ok if you haven't. This is the most popular form of network computing. It basically means that different processes on the same machine don't share access to physical resources. They don't share memory. They don't share CPU. They don't share devices. Honestly, this is a really reasonable and sensible design, even if it makes life a little harder for lazy programmers like myself, who are like, "Just give me the pointer," all the time. No, don't. That's actually bad. In fact, it's so sensible that the idea process based memory isolation is baked into some programming languages by default. In the Go programming language, for example, Rob Pike, who's one of the biggest contributors to the language, said, "Don't communicate memory by sharing. Share memory by communicating," which basically means, have your processes send messages to each other. Don't let them arbitrarily read and write to the same locations on memory that are currently being accessed by other processes.
Let's zoom out for a second. What does it actually mean to run a distributed system today? I think it's pretty clear to most people, either through some intuition or things that you've run or read about in the past, that building and operating a distributed system is a fundamentally different ballgame than building a system where everything is function calls on the same host, in the same process space. This intuition and this insight wasn't always super obvious. One of the earliest discussions about how distributed computing works, about the fundamental difference here is the classic 1994 paper called, "A note on distributed computing," by Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall, who all worked together at Sun Microsystems. This paper is really neat. It's a blast of the past. Because in 1994, nobody was building distributed systems at the scale that we see and take for granted today. A lot of people at that point were still hand waving and theorizing about what these systems will look like. People were asking, maybe hardware engineering will just solve our problems for us by the time we need to build those systems. It didn't, probably never going to.
This paper is really worth reading in its entirety if you're interested in this stuff. I'll try to summarize it briefly. In this paper, they identify three reasons that distributed computing is a fundamentally different beast than local computing. The first is latency, which means the difference between processor speed and network speed. Memory access, so this idea of accessing pointers and locations, and partial failures. I would say that of these three, memory access has been the one that turned out not to be so much of a showstopper, because, think back to our earlier discussion about shared nothing architecture. Of course, latency has been addressed a little bit by being able to replicate data. You can reduce that delta a little bit, but of course, not a totally solved problem. Partial failures is the thing we're going to dig into the most throughout. According to Martin Kleppmann, I think a good summary of these challenges is modern distributed systems look like this. You have a lot of different machines. They're running different processes. They only have message parsing via unreliable networks with variable delays, and the system may suffer from a host of partial failures, unreliable clocks, and process pauses.
Distributed computing is really hard to reason about. We've known this since the early '90s. We've probably always known this. This was on the radar of people building computer based systems in the '90s, a different group of people at Sun Microsystems came up with. Originally, it was seven, but number eight was added by the Java guy, James Gosling later on, but eight fallacies of distributed computing. We know today that networks are not reliable. We know today that latency is not zero. That bandwidth is not infinite, of course. That the network cannot be assumed to be secure. That topology should be assumed. We should assume that that will change. We know that there's usually not one administrator. Sometimes there's no administrator. Sometimes we're all the administrator, if you don't secure your systems. We know that transport costs are not zero. These days with more and more different devices that are connecting to the internet, we know that networks are not homogeneous.
When I was learning about distributed computing, I felt like every new paper I read fenced off a whole category of things that I wasn't supposed to believe in, or I wasn't supposed to assume to be true. I felt multiple times, there's so much unreliability. There are so many things that are conditionally true if very specific things are true. How can we know what's even true about the state of the world, and we're starting to start out with learning and building distributed systems? This actually brought me back. I studied philosophy in undergrad, and I never thought that I would be talking about my philosophy degree at a tech conference, but here we are. I actually thought this is actually an epistemology problem. It almost sounds like an epistemology problem. Epistemology is the philosophical study of knowledge. It's the branch of philosophy that asks the question. We think we know some things but how do I really know that I'm in London speaking at QCon standing on a stage? How do I know that this is not just an illusion? Broadly, within epistemic philosophy, you have two main schools. We have foundationalism, which says that there are fundamental truths about the universe, like how there are first principles in mathematics, and everything else is built on them. Or, you have coherentism, the second school, which says that nothing is absolutely true on its own, but when we have enough other interlocking and reinforcing truths, they support each other. Matchsticks can't stand on their own.
In distributed systems reasoning, regardless of which school of epistemology you subscribe to, it's pretty hard to start to define the basic building blocks of truth. Of course, what if we're all just brains in vats and nothing is real? You won't find this in philosophy textbooks, but skeptics, actually, were the world's first internet trolls. Let's go back to message parsing. Unreliable message delivery is totally a thing. Another thing in the category of unreliable uncertainties. The classic case that people talk about is the Byzantine General's Problem. The thought experiment goes, imagine two generals are trying to coordinate a war, but they can't communicate directly. They rely on this little unreliable messenger guy to ferry messages between the two. The whole time, they can't know for certain whether the message actually if it came from the other general, whether it was manipulated in flight, whether it was delivered to some wrong place all together. This is silly, but this principle actually happens all the time when you're building distributed systems. We have of course, some tools, some technical mitigations. We have tools to verify the validity of a sender, for example. We always have to be thinking about things like spoofing, and messages being dropped, and messages that get corrupted in flight. Of course, we can mitigate a lot of these things by doing a good job of monitoring and observing and tracing our systems. There are a lot of things that we're just never going to be able to know. We can be certain of one thing, shit's going to fail.
The CAP Theorem
The CAP theorem debuted in the year 2000 when Dr. Eric Brewer from Google gave a talk at the principles of computing conference called, "Towards robust distributed systems." In the time since, a lot of people on the internet like to write about the CAP Theorem like this. They're like, here are three things. You can have two of them. You can choose any two, and you can throw the third one away. That doesn't really make sense, actually. That's wrong. It's not possible to design a distributed system that way. These days, there are a lot of different alternative frameworks to CAP. CAP is not the only way to think about the trade-offs for distributed systems design. You should at least think about it like this, with partition tolerance being the constant that we trade off against, on either side of. The reason why is you can't sacrifice partition tolerance. That statement doesn't even make sense. Even if you're running a distributed system within one data center, even in one physical host, you can't 100% immunize yourself against network partitions or partition events. Literally, the only way you could 100% prevent the possibility of a partition is to only have one node, at which point you're definitionally not running a distributed system. Brewer himself has talked about this in this update, CAP Twelve years later, in 2012. He acknowledges that, as some researchers point out, exactly what it means to forfeit P, partition tolerance, is unclear.
I realized I haven't actually gone through each letter yet. Let's do that. Let's get a little bit deeper into each part of CAP. C is for, anyone know? C is for linearizability. What does this mean? Linearizability is a super narrow form of consistency. The word you'll encounter in the literature is register. It means you have an operation to register update, which you can think of that as one row in a database. It can only have one value at any given point in time. Imagine you have a register update from time 0, which is before time 1, where the cat's state changes from hungry to full. Under linearizability this means that if any single client saw that the cat is full from now on, all nodes in the cluster have to return that the cat is full. You're not allowed to show to any client that the cat is hungry anymore. This is really hard. This is ridiculously hard, probably impossible, actually. This is a super strong sense of the word consistency because it basically demands instant and universal replication. Replication can't really be 0, we know that. There's always going to be replication lag. You at least need the speed of light, that's your upper bound. The time it takes for a pulse to travel along some fiber optic cable. What can we do about this? People who work on databases, of course, spend a lot of time and a lot of energy trying to reduce replication lag. Of course, there are other trade-offs that they have to make along the way. Eventual consistency doesn't count. When you have background asynchronous replication, that's not the formulation of CAP. We're not talking about background replication.
There are a lot of different ways to define consistency. If you haven't seen this blog post before by Kyle Kingsbury on the jepsen.io blog, I really recommend you check it out. He maps out all the different definitions of consistency that we casually use and what they each logically imply. It's really mind boggling how many different ways there are to define consistency, a word that we think we understand. The takeaway here that I want you to walk away with is that consistency is not a binary state. I don't really want to say it's a spectrum because that implies continuous data, but it's a matter of degrees. There are many different degrees of consistency. We have to be really deliberate and careful about which one we actually need to design for.
A is for availability, which refers to the ability of clients to update data, when they're connected to any node. We tend to think of availability as being a binary state. Reality is a lot messier because of this wonderful thing called latency. If you try to issue an update, for example, and you don't get a response for a long time, because maybe latency. What if that's because the node is down and your request is not able to be processed at that given point, because there's some outage? Latency wasn't part of the original CAP formulation but it has some really important impacts on detecting and responding to network partitions. A real world example, everyone's got a chronically late friend. You're like, "How long should we wait for our friend before we give up our dinner reservation?" One way to deal with this in real life, and when you're building systems is to set a timeout. You can say, I'm only going to wait 10 minutes for my friend to turn up. That's a long timeout for building a computer system. Determining what constitutes a reasonable timeout is really tricky. The first time that you set up a system, you have no historical data. You might as well roll some dice. You're like, "12 seconds sounds good." Of course, monitoring over time is a good way to learn what's normal. You might be lucky enough to pick a software that can learn on its own, what a reasonable timeout is over time.
The final letter P is for partition tolerance. A partition in this case refers to a network partition. I know sometimes this word gets used to refer to the individual encyclopedia volumes when you're talking about database sharding. That's not the sense that I mean partition today. I mean a network fault, or a net split, a partition event, lots of different terms for this. It basically means when you have an event that interrupts connectivity between two nodes of your system that are running either in the same data center, different data centers, whatever. During a partition event, your nodes might as well be on different sides of a wormhole. There's no way to know what's going on, on the other side. You don't know if the other side is responding to health checks. You don't know if it's continuing to read and write and process client requests. C is for consistency with an asterisk, availability, and partition tolerance.
The proof of the CAP theorem is actually pretty straightforward and not rigorous mathematical proof or anything. Intuitively, we can reason that if you have a partition event that happens, imagine this is your server setup. You have two nodes in a cluster, which actually is not a good way to design a cluster. Imagine you have two nodes, and you have three different clients connected, one on the right side, two on the left side. A partition happens. You only have two logical ways to respond now. Either you let the clients continue to read and write on both sides of the split. If you do this, you necessarily sacrifice linearizability, because if any updates happen on the green side of the split, the pink client will never be able to see it. Or you pause writes in one side of the cluster until the partition event ends, which also necessarily sacrifices availability. Because the side that's paused won't be able to receive updates.
Hardware Failure
We'll finish by zooming in a little bit more on partition tolerance. What is this idea? Why can't I just will it away? The reason why is because network partitions, partition events are inevitable. How inevitable? We're going to pick on a small startup that probably knows a little bit about this space, so Google. In the first year of a Google cluster's life, it will experience five rack failures, three router failures, eight network maintenances, and a host of other hardware related problems that Jeff Dean has written about extensively. Why though? Why can't hugely successful companies with lots of smart people can't just will away hardware failures? Hardware is just going to fail. You can't 100% design unfailable hardware. For example, the routers, the hardware holding your routers together mysteriously fails. Network cables will eventually just give out. In fact, apparently, sometimes, sharks mistake undersea cables for fish because of the little electrical pulses, and they try to eat them. It's ok because some journalist at Ars Technica want you to know that as of 2015, sharks are no longer a threat to sub-sea internet cables, because Google and Facebook lay them and they wrap them in Kevlar before they go underwater.
As anyone who has written software will probably know that sometimes the software we write can do things that we don't expect it to do. Some of those unexpected things will result in events that look like or feel a lot like network outages. In multi-tenant servers, which is almost always the case for public clouds, we don't have perfect resource isolation, which means that some users can burst a little bit in CPU or memory as they need to. That's actually a good thing overall. It can result in this weird case where if you have a process as part of your system, running somewhere else on that same host, and everyone else is bursting, your process can slow down. It can look like there's a partition event. It can look like there's an outage over there. Some languages have stopped the world garbage collection. If you're running low on memory, we got to pause everything, reclaim some resources so that we can carry on. This causes everything to suspend, which also makes it look like a node is down.
Network glitches randomly happen. This isn't really an illustration of the principle. It's just the character Glitch from Wreck-It Ralph. Also, sometimes people will glitch. Sometimes people will do bad things. In April 2009, a person took a giant axe. Someone crawled into a manhole and chopped through a bunch of fiber optic cables in San Jose. A lot of people and a lot of data centers in Southern California were on the bad end of a network partition for a little while.
I hope you're convinced either by this talk, or by your own experience, it's more likely by your own experiences that running, building, operating distributed systems is really hard. Peter Alvaro, who's a professor out in California asked his students to think about, what's the one hard thing about distributed systems, if you had to pick one word? One student goes, "Uncertainty." Peter is, "Very good. Uncertainty." Then another student raises their hand and says, "Docker." The first student is like, "Actually, that's better, take mine off."
Why does any of this matter? You might be sitting there like, "That was a cool story. Why are you telling me all this?" The practical reality is that we can't guarantee that every node of a system will always be alive and reachable. It means that some part of every single distributed system will always be at some risk of failure. Think about how hard it is to coordinate plans with your friend who's having bad luck? It's that, on a huge scale over again, with a lot of machines that you can't see, that you can't text them. Maybe you can text machines now.
Mitigating Uncertainty
This whole discussion points to the Fisher Lynch Patterson correctness result, which is the output of a very famous landmark paper from 1985, which basically states that distributed consensus is impossible when at least one process might fail. We've just seen that in almost every case of running a distributed system today, there's at least one element out of your control which represents the possibility for failure. What can we do about this? We can't will away failure. We can't deny that it's going to happen. What we can do is we can manage it. We can build for it. To manage uncertainty, we have a set of technical mitigation strategies. One category, one genre of strategy is we can limit the amount of chaos in the world by limiting who can write at any point in time.
A very common pattern here is the leader follower pattern. The leader is the only node who can write new data. Of course, followers can receive requests, but they have to forward those write requests to the leader who then figures out what order to write them all in. The challenging part is that the leader is also a node that can become disconnected. We need a contingency plan to keep things writable in case that node goes offline. We use a process called leader election to elect a new leader from the remaining nodes who are responsive, and then a failover. A person needs to say, that can be the new leader, that's fine. Trigger failover to promote the new leader. Another genre of ways to mitigate uncertainty is to make some rules for how many yes votes from the whole cluster is enough to proceed. This is a bit of an oversimplification. These rule sets can generally be thought of as consensus algorithms. Raft is one of many two-phase commit strategies that try to keep a simple majority of nodes in agreement about what's the latest data and what's the latest thing going into the register, or the ledger, or whatever. Of course, some of them use leader follower inside of them for extra mitigation. Also, did you know that raft doesn't stand for anything? It's called raft because it's a bunch of blogs. Also, I've heard that people use raft to escape from Paxos Island.
Complexity in Systems
What's even harder than getting machines to agree? We've had consensus algorithms. We're coming up with better consensus algorithms every single day. The thing that we haven't quite cracked is getting humans to agree and getting humans to work together. I'm going to spend the final part of this talk talking not about computer systems but on human systems. The more fault tolerant we try to make our systems, the more complex they inherently become. When we start adding things to accommodate for distributed failures like messages being dropped, message queues perhaps, or maybe load balancers, and replication of different members of the system. Or maybe we can sprinkle a little of our favorite container based workflow orchestration engine, it's Kubernetes. We introduce complexity into systems. This complexity is not a bad thing. I'm not saying never use Kubernetes. Use Kubernetes sometimes. It's not a bad thing. Inherent complexity is something you have to bite to build the systems with the degree of resiliency and fault tolerance that you need for your systems. Of course, uncertainty is also introduced by the humans that are operating and building your systems. Charity Majors talks extensively about this growing complexity. Systems today are getting more complicated. Mental models are getting harder to build. We are all distributed systems engineers now. It's getting harder for us to reason about what's going on underneath.
Managing Complexity
How do we manage this growing complexity? The first step is we need to spend some time understanding where the cognitive complexity comes from. As we build complex systems out of microservices, out of whatever it is that you want to use, Woods' Theorem becomes increasingly important. Dr. David Woods wrote in 1988 that as the complexity of a system increases the accuracy of any single agent's own model of that system decreases rapidly. As humans operating big and invisible systems, we all maintain mental models of what we think is going on. Imagine when you first join your team, or you first start to learn a new system. I think we think that we build mental models like this. We take a new piece of knowledge, and we stack it neatly on top of a piece of knowledge that we had from before. In reality, we don't always have the time or resources to build those neat mental models. A lot of the time, the context that we're building on top of shifts, right from under our feet. More often than not, we're just doing our best to hang on to the few relevant bits. Just like for computers, humans are limited by the size of your level 1 cache. I know I am. The question we're trying to ask when we're building complex systems, as teams of humans, is how can we achieve consensus with our understandings of the world when a lot of the time we don't even have very good language to compare and contrast what's in our heads?
In this example, you tell three software engineers, we're going to have fish for dinner. They might have different conceptions of what that actually means. Or, maybe they have different mental models about what our systems look like under the hood. The systems that we're building and running are really complex. They're more complex, possibly, than the conscious bit that we can hold in our brains, which makes having conversations about them really hard. The next best thing we can do is we can try to tease out this information by looking for situations that are information rich proxies.
Incident Analysis
Incident analysis is really a ripe place for mental model examination. What is an incident analysis? Usually, teams conduct an incident review, sometimes called a postmortem. Although, personally, I don't like the term postmortem, because it implies that something terrible happened and something died. Maybe your soul dies a little bit. Nobody died, depending on your systems. Incidents are actually opportunities for fresh conversations and fresh learnings. If you want to know more about this, I really recommend, and I've linked some resources, that you seek out the work that John Allspaw and his team have been conducting for the last few years. You can also come to my AMA with Dr. Laura McGuire for 10:00 today. She knows a lot more about this than I do. How does incident analysis tease out mental models? Where is this connection? During an incident review, there's a couple of typical activities that teams might do. You might make a timeline of everything that happened. You might create architectural diagrams, try to figure out the decision trees. It doesn't really matter, in the end, whatever format you choose. The goal of this meeting is to create an environment where the team can learn from incidents.
Blameless Postmortems
The one non-negotiable component of incident review is that it has to be blameless. Blameless discussions focus on learning rather than on justice. By having these conversations, you'll almost certainly discover that every person in the room had a different idea of what they thought was going on. Just to close on this point in blameless postmortems, I think one really clear litmus test on whether your incident reviews are blameless or not is to listen for counterfactuals. Counterfactuals are statements like, "Had Denise not put the coffee cup on the table when the cat was out in the room, then the incident would not have occurred." Counterfactuals are difficult because they're hypotheticals. They're like, had X happened, then Y would or would not have happened? What if you establish a different course of events could have happened? They didn't. You can't observe them. You can't learn from them. They're not actionable.
Human decisions never occur in a vacuum. Be Henry and we ask, what seemed reasonable at the time? Because all human decisions were formed by some belief that a course of action was reasonable given the circumstances. If you want to learn more about this, I've linked to Courtney and Lex's presentation from SRECon last year, where they do a whole workshop on this. In addition to running your incident reviews in a blameless manner, please keep this in mind. Please don't accept human error. Don't say, Denise screwed up and that's why the incident happened. Don't accept human error as the root cause. Dig deeper, ask harder questions that will actually be more useful. Ask, for example, was the user misled by some design that was really unintuitive or difficult to navigate? Or, maybe somebody had too many alerts and they were frazzled. They were experiencing alert fatigue. Or, maybe the way that some software was designed, didn't take into account the assumptions that a user would bring with them into the web portal, into the control room, whatever it is. In fact, if you start reading into the latest thinking around resilience engineering, there are a lot of folks suggesting that maybe there are no root causes. One of the best talks on this is by Ryan Kitchens from Netflix, also linked in the notes.
We all have different frames of reference, and we don't have great vocabulary to figure out the differences. Of course, humans can make mistakes. I think that's just part of being human. The great thing about humans is that we're also a lot more adaptable than machines. The more we depend on technology, and push it to its limits, the more we're going to need highly skilled, well-trained, and well-practiced people to make our systems resilient, acting as the last line of defense against the failures that will inevitably occur. This is a paper titled, "The ironies of automation," still going strong at 30, also really a good read.
Humans Can Learn and Adapt
In pursuit of the sociotechnical goals for building microservices, for building resilient systems, for building large, sprawling, distributed systems, we're always taking out a loan against inherent complexity. We as humans, can learn and adapt. That's our superpower. With this superpower, I challenge you to challenge yourselves to empathize with each and every user, not just the customers, not just the paying ones, but also the ones who are operating your systems. The ones who are going to get paged at 2 a.m. When we think about where to draw the lines, for example, for bounded context, those are design decisions. Those are not "technical" decisions. Those are design choices. We should always be challenging ourselves to make those choices in a way that optimizes for the humans that operate and build these systems. That means that we should choose tools and processes that promote the things that help humans do their job better. Make sure your tools and processes are promoting learning in a sustainable pace. Because after all, we owe it to our end users and to our teams to understand and design for the whole system, including the fleshy human parts.
See more presentations with transcripts