Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Managing Systems in an Age of Dynamic Complexity

Managing Systems in an Age of Dynamic Complexity



Laura Nolan looks at the common architectural shapes of dynamic control planes, and some examples of how they fail spectacularly - many major cloud outages are caused by dynamic control plane issues. Why are dynamic control planes so hard to run, and what can be done about it?


Laura Nolan is a Site Reliability Engineer at Slack. Her background is in Site Reliability Engineering, software engineering, distributed systems, and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly 'Site Reliability Engineering' book, as well as contributing to the more recent 'Seeking SRE'. She is a member of the USENIX SREcon steering committee.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Nolan: What is this talk about? I come from an SRE background, software operations for the last several years. I've noticed in the last decade, we've gone from running our systems ourselves by hand, to building and running the systems that run our systems. We've added an extra layer of abstraction here. It's really good in many ways. It lets us do things we couldn't otherwise do. There's a bunch of risks and considerations that come with it. That's what this talk is about. My previous year's talk at QCon in New York was a tales of fail talk. This too has a tales of fail element. If you like software operations architecture and failure, you're in the right place.

I'm a Senior Staff Software Engineer at Slack, Dublin. Before that, I was at Google for a number of years. I was a contributor to the Site Reliability Engineering, how Google runs its production systems book, also known as the SRE book. I write for various places, including InfoQ. I do, of course, the column for USENIX's ;login magazine. I campaign for a ban treaty against lethal autonomous weapons. You should too, because we all know that software is terrible. We should not put it in charge of killing people. On Twitter, I'm @lauralifts.

Reliability in the Cloud

Cloud, lots of really smart people building really resilient systems. All this redundancy. All this failover. These things are designed from the bottom up to be resilient to any hardware failure, from an entire data center on up, and yet. I'm not going to pick just on Amazon, and yet. I'm also not going to pick just on Google. That's the top three major cloud providers. They all have outages. We know this. We have to build our own systems if we're using the cloud. We have to build our own systems to be resilient to this as best we can. Consider a humble server. It's actually not unusual to see multi-year uptimes. It's pretty unusual to have your public cloud go for an entire year without some major outage. This thing has no redundancy, just a single box running in one rack. No network redundancy to speak of. No power redundancy. Why are we failing to make our clouds more reliable than a single machine?

In the old days, we had human beings. The human was in the loop. We were configuring our servers by hand, or we were semi by hand. We were using scripts. We would have Puppet or we would have Capistrano back in the day. The decision to provision a new machine, and set it up, and verify it, and put in your load balancer pool, that was typically done by a human being. We weren't autoscaling. We sized things for peak plus 10%, or whatever your magic number was. We left things running. We weren't scaling up and scaling down. We weren't dynamic in that way. We didn't have job orchestration. Kubernetes and all these things were only a glimmer in somebody's eye. We lived in a mostly static world, mostly run by human beings.

Times have changed. This may not be the case everywhere. The future is here. It's just not evenly distributed. Certainly, at the very large software houses, all of the public clouds, and any other smaller software company that's running their own private cloud, or running on Amazon Web Services, or any of the other cloud providers' clouds tend to be using a lot of this dynamic automation provision. The mantra in operations for the last decade or more, has been to automate everything. Automate job orchestration, so no more provisioning by hand. No more scaling up and down to meet load by hand. All manner of routing, failover, load balancing, all this should be automated. This is not wrong. A lot of this work is repetitive and there are downsides to it. This is the change that we have wrote. In the parts of the future that are distributed, we have largely succeeded in automating away a lot of this work that we were doing in 2010. That is the truth.

Why? Money. In a lot of cases, money is the overriding pressure. We are a business. We run our systems to support businesses. There are direct financial savings from some of these elements. By using job orchestration systems, having multi-tenant computing, and by scaling our jobs up and down to meet demand, if we're in a public cloud, either way, we can save money. We also save money by not having to employ as many people to run our fleets. The mantra is, automate yourself out of your job, so that we don't need to have 10 people to replace you next year. This makes a lot of sense. It's just important to remember that money is the overriding reason that we do this.

There are a bunch of other pressures. One major other pressure is better performance and latency. If you have a job that's being automatically scaled up and down to meet demand, it should have better latency because you have a lower chance of hitting an overloaded instance. A lot of these systems that we're going to look at are related to, how do you balance load? How do you distribute a load in a way that's globally optimized? How do I route your traffic to the nearest place where I can serve you without being overloaded? This is very good for your systems performance. This is very good for end-user perceived latency, which is good for money. We also want to reduce the repetitive toil of managing a large fleet.

When you run a system with millions of jobs and machines, you're dealing with failure constantly. It's real. We see failures constantly. You cannot page a human being to deal with that every time you have any little failure. It's costly. It's repetitive work. Nobody wants to do this. We also want to react faster to routine hardware failures. We don't want a box to break and have a bad hard drive or whatever, and sit there throwing errors until a human being comes along and takes that out of your load balancer pool. There is just no point of living that way. It's unnecessary pain for our customers. It's unnecessary stress and pain for us in operations.

We want more consistency in production. This is one of the big changes that we see in operations in the last few years. This concept of infrastructure as code. We don't want to be building machines by hand anymore. We want to be able to press a button, or better yet, have a dynamic control plane actuate an RPC and provision our instances of our services in a way that is completely consistent and repeatable. We do not want to be building pet systems anymore. That makes a huge amount of sense. We want to reduce compliance risks related to engineers touching productions. Especially, if you're in finance or dealing with any payments, or anything to do with the federal government in the U.S., or other governments, they get very picky about who is logging into your servers, for debugging or for operational stuff. You want to minimize that as much as you can. A lot of companies now are saying, can we automate all of our control and have humans never SSH in, just remove that access. As an aside, I think that's very difficult to do. You should be certainly removing that from routine operations. There's a bunch of pressures towards this pattern where our systems are being controlled dynamically by some control plane system.

As an aside, I'd like to call your attention to this wonderful diagram. This is an Apollo 14 spacesuit. It's got a sunglasses pocket on the right arm. I just want to call attention to that.

The Dynamic Control Plane Architecture Pattern

We've been talking about control planes and the dynamic control plane pattern that we're seeing. It arises in order to control other systems, and very often, to solve global optimization problems. This is our simplest form. We have some controller process. Our controller process is in charge. It's got some business rules. It's probably got some configuration. It might be accepting some control input from human beings, occasionally. It's doing operations on our service instances. These could be switches or it could be jobs, like serve instances of some binary that you're running, or Kubernetes pods. Probably, these service instances are serving work PCs, or web pages, or something to someone else outside of the system. Probably, they have some load balancer pointed at them.

This is outside the control plane diagram that we're showing here. I didn't put it in because this could be pipelines that are processing data. A signal aggregator, this is the way that the controller senses what's happening in the service pool so it can react. Typically, these controllers are trying to keep the system in some status quo state, or a state that conforms to some configuration that they have. Things change in the service instances, they get more load or less load, or they change states, they go healthy or unhealthy. The signal aggregator will scrape that state from each of the service instances and make it available to the controller. This signal aggregator, it could be Amazon CloudWatch. It could be Prometheus, Graphite, or any of these tools. Sometimes, the signal aggregator and the controller are collapsed into one entity.

Autoscaling Groups

Here's a concrete example. Autoscaling groups, as provided by all of the major cloud services. The idea here is, I have a bunch of service instances, so running instances of my containers, or my binaries, or whatever it is. I want to scale them up and down based on how busy they are. You're going to collect metrics in some way. This could be some monitoring tool as provided by that cloud provider. Or, you could be collecting things through some of your own monitoring tools and making them available to the autoscaling logic. This could be something you built yourself. I've seen hand wrote versions of this thing. You want to collect metrics, which could be CPU load, could be some count of the number of requests served per unit time. Based on that, your autoscaler is going to scale that pool of instances up and down. Busy, put more in. Not busy, take some action. Saving you money on your cloud bill.

This is the same architecture as we saw. This is a control plane. One of the really interesting things that people forget about autoscaling is, the system is bigger than just this. You're making an assumption here that your system is not going to be scaled for the peak of your traffic. Therefore, in order to serve the peak of your traffic, you are dependent on this autoscaling system working correctly. You're also dependent on whatever process you have to provision a new instance, and to get it registered with whatever load balancing rules, or DNS, or whatever it needs to be registered with. That varies by system. From experience, very often, a lot of that stuff is made from Janky Python or whatever. You don't want this. If you have this pattern, whatever is needed to provision your instances are part of your core production system now, because it's required in order to serve your peak load. It's surprising how often things like this become fragile.

Here's a Kubernetes cluster. Basically, it's a system for orchestrating and managing your jobs. You say, Kubernetes, here's a bunch of nodes. You manage these. You run VMs on them, and you distribute jobs, instances of the tasks I want to run through this cluster. The Kubelet is a process that runs on each of the instances in your Kubernetes cluster. It monitors the load and the health of the pods, which are your jobs, your working production systems that are running. You got a control plane, which is the Kubernetes master. It's the same system. This is an instance of our control plane system that doesn't have a specific signal aggregator because the control plane just talks to the Kubelet directly. It should be noted, Kubernetes doesn't let you scale that large. It lets you scale up to about 5000 nodes. Partly, that will be because of the control plane limits in terms of how many signals I can get from how many nodes. I have seen job orchestration systems that scale much larger, but they had that level of indirection with the separate signal aggregator.

Global Controller

Getting a bit more complicated. Those first three instances we saw are like a regional, or a local, or a zonal controller. You might run multiples of these. You might run multiples of these in one data center. You might run one per region. They're separate. That will solve some of your problems. It will solve your problem of, I want to scale this job in this location to meet this demand. Or, I want to run this Kubernetes cluster in this location. Sometimes you have bigger problems. Sometimes you have problems where you need to think about global demand.

Very often this arises in the realms of networking, and load balancing, and routing, because if you've got a global backbone, or you're trying to steer user demand to your systems in a global way, you need to think about the problem in a global way. The thing about global problems is you cannot optimize for them locally. You can make them work by doing things locally, but you're going to have waste and potentially reduced performance in parts of your system, where the local parts of your system are not able to optimize the global view. Remember, this is one of the pressures that we see towards these global control systems.

Global DNS Load Balancer

Let's look at an example. DNS load balancing. If you want to route user traffic to the nearest instance of your service, if you're running in multiple places around the world, you need some DNS load balancing to do it, or an Anycast network. This will be something like the NS 1 has a geo-load balancing product. You say, NS 1, my service is here and here, please route my users to the nearest instance that has capacity to serve that load. You don't know where your demand is going to spike up at any particular time or any particular geography. Let's say, I have a data center in Singapore, a data center in Hong Kong, a data center in Japan. A bunch of load pops up in Asia because I do a launch or whatever. If your load balancer system knows what the capacity of each of those sites is, and how much load each of those sites currently have, it can steer the demand to the sites that are optimized for nearness and for not being too busy. This is good.

What you need here is you have your service pool A and B, and they're pushing up into their signal aggregator, how busy they are. You push that up to your global controller, which is going to program out all of the DNS servers for, basically, what percentage of load they should send to each of the sites based on how much capacity they have. This is good. This works really well. This works a lot better than just random, where you could be sending users to a place with an arbitrarily high latency. You don't want to send users in Japan to the East Coast of the U.S. if there's capacity nearer. You also don't want to send them to an overloaded site, or a site that might be down for maintenance. This is why we use systems like this. It's worth noting here, our DNS servers, they're not actually acting on the service instances, so it's missing one element to the control plane pattern. We're still seeing the signal aggregation on the global controller elements of it.

SDN WAN Controller

This is a simplified version of an SDN WAN controller. Specifically, it's a simplified version of Google's B4 SDN WAN controller. B4 is Google's software defined networking backbone. It's global. It's a network that spans the world. What you want to do with that network is you want to look at where load is coming from and going to. You want to figure out, how do I use the paths I have optimally to serve that load at the lowest latency? It's a global optimization problem. Load pops up in different places, and comes and goes. One of the reasons for that is that B4 is used for batch type work. If you need to copy some terabytes of data from here to there, you would use the network priority that gets it onto the B4 WAN, as opposed to the other one that they have.

You need to be constantly looking at what the demand in this network is, and optimizing for it. You got a bunch of switches all around the world, in the different data centers where this network is. They're sending metrics up to the collection system, which is sending those up to this global optimizer service, which looks at configured bandwidth limits for different users. It looks at the flows and it figures out how to optimize that. It pushes that configuration out constantly to these things called TE, for traffic engineering servers, which program paths into these switches to create this global network of switched paths, which are efficient in network terms. This is your complex, global dynamic control plane.

The dynamic control plane is not just any automation. This is automation that tends to control very important parts of your production systems like, are my jobs running? Is my network working? Is my DNS working? These are mission critical things. This is not something that's doing ancillary things like checking, do my backups work? Which is also very important, but your systems won't fall over today if they're not. It's an architectural smell. This automation means something. This automation matters. With these systems, we're mission control in there. We don't run our systems. The robots that we build the systems we build, run our systems. It's harder. This is a more challenging engineering task. It can go wrong in a bunch of ways.

Mission Control - Apollo 13

This is the Apollo 13 lunar module. What happened with Apollo 13 is they had an accident. When the accident happened, everyone back in mission control was super confused. They were getting these really confusing signals from the telemetry that they were seeing. Here they are with their amused faces. They were just getting a bunch of signals about the electrical voltage fluctuating up and down, and pressure fluctuating up and down in cylinders, because their electrical system required the oxygen cylinders to provide oxygen to the hydrogen-oxygen fuel cells. These guys were confused. They were super confused because they were looking at these very abstracted signals that were being sent from space. How did they figure the problem out? The astronauts figured the problem out, because they were there on site. They heard a bang. On their gauges, they could see how the pressure was fluctuating up and down. They could see gas venting out into space. They could see what was going on because they were there. They had the mechanical sympathy and the direct observation of their systems. That is one of the big things that we lose.

This is the service module. It's all exploded and stuff. That's why Apollo 13 was that. Then there's a whole great story. There's a really good part in Charles Perrow's book, "Normal Accidents" that deals with this story, and emphasizes this particular part about understanding what happens and the mechanical sympathy of being able to spot, which is really worth reading. They got home. They managed to go around the moon. The whole thing where they cobbled together oxygen things from the wrong thing. They got back. Then NASA had an IPO. No, they didn't. They're happy because their astronauts are back. Because we're mission control, now it's harder for us to understand our systems in production.

December 24, 2012: AWS Elastic Load Balancers

We're going to talk about dynamic control plane incidents. I'm going to talk about some cloud providers. This one's pretty old. It was the night before Christmas, and API calls relating to managing new or existing elastic load balancers started to throw mysterious errors. Their running instances of their services seem to be unaffected. Their team was puzzled because many API calls were succeeding. They were able to create and manage new load balancers, but not to manage existing load balancers. Some things were just failing. What was happening? They were confused. They spent some time being confused, like 4 hours, which is a long time when your systems are down. They started to put together the fact that running load balancers were ok, unless there was a configuration update, or someone tried to scale them up or down.

Once they figured that out, they mitigated, sensibly, by disabling those scaling workflows. Then they dug in. They noticed that there was missing state data. That was the root cause of the service disruption. They did the data recovery process. They merged in whatever changes since the last snapshot. It took about 24 hours to get back from this. That's a pretty serious thing when you can't create new load balancers or change your existing ones, because that's a pretty important workflow. It turned out that the dev had run something in prod that should have been run in Dev, and it was an accidental deletion. It was a data management issue. What I really want to take from this incident is how difficult and how long it can take to debug when things go wrong with these kinds of control planes. This is why making the actions and the operations of these control planes, really, as understandable and clear as possible to the human operators, is really key.

Operators Need Mental Models of Both the System and the Automation

A big part of the danger and the risk that these things add is the fact that operators now need two mental models. In the old days, you just needed to know how your load balancer worked. You went in and you made the changes by hand. Now an Amazon ELB, the operators need a mental model of how the ELBs themselves work, but also a mental model of how the ELB control plane works. Unfortunately, when you have two different things that are now interacting, that's more than twice as complicated as your first thing, or as two of the original things. The interactions adds to the complexity. This is about four times as hard to do. Nancy Leveson talks about this a lot. She's an MIT professor who talks a lot about engineering safety very often, in military and industrial contexts. She has good diagrams where you can have the system that controls the system, humans, what we do, as part of the unpredictability and the difficulty of controlling these systems. Very often, you have control actions taken by humans, interacting and competing with the control actions that are taken by the control plane system. Especially, when things are going wrong, life just gets a lot more complicated.

There's also the very old sore, which is the ironies of automation. This has been studied a lot in the context of aviation. You automate something, and that's great. You make it more consistent. You make it more reliable. It takes away some human work. It also means that humans become less familiar with the day-to-day hands-on operation of that system. Therefore, when your automation breaks, or goes wrong in some way, the humans are now further away from that system and have less chance of being able to manually intervene quickly to resolve it. The ironies of automation.

Google Compute Engine lost all the external network connectivity for 18 minutes globally. That's your entire cloud network down. How did this happen? Somebody submitted a config change to remove an old unused IP block. This should be very safe. Unfortunately, there was a bug. Some race condition happened and the system decided to remove every single one of the GCE IP blocks. Very sensibly, they have a canary system. A canary system is a small part of your system where you can send upcoming changes to see what effect do they have, and make sure that they're within expected parameters. They did this, which is great. Unfortunately, it seems their system just threw the answer away, which is not so great. This is the second bug that we see here. We have an initial race condition bug that produced a completely incorrect configuration. We have a canary system that was there but not working correctly. They rolled this out. They're advertised over BGP, Border Gateway Protocol. They advertise the same blocks everywhere, all around the world. This is known as IP Anycast. What that means is you have this magical virtual IP address, servable from a bunch of places. What the internet just does is it routes you to the nearest network point where this is advertised. Nearest in network terms doesn't mean geographically nearest, it really means cheapest.

Usually, it's also the nearest. This is fine. The thing about IP Anycast is, if it's available anywhere in the world, and you're probing it, that will still work fine. Until the last one goes away. Then it's completely down. They were probing this system, and it was fine, as it was gone in 99% of everywhere. Then it gets turned off from the last place and everything is bad. Their rollout process didn't have the signal that it needed. Their control plane lacked that signal. There are other ways of monitoring your Anycast systems. For example, you can put in a special query that will tell you exactly where the request is being served from. That could give you more signal about what's happening if you suddenly notice that all your probes are coming back from a smaller pool of results. That will give you a signal. That's something that you can do, but they were not doing this at the time. That's the counterfactual. We don't like those. It's a complex systems failure, multiple bugs and latent problems. You've got multiple dynamic control planes going on here. There's the network configuration. Then there's also the control plane that's doing the canarying and checking the network configuration. Both of those went bad in interacting ways.

What do we learn from this? We learn that testing is a real challenge. These systems have bugs as all systems do. The trick is you get your nasty failures when you have multiple bugs. It's not just testing in terms of unit testing that you need, although you need that. You need to be doing an integration systems testing. Spinning up a little sandbox with all the instances of the related systems and putting them through scenarios. In an ideal world, you would have an integration test for your control plane system that would spin the whole thing up and have it try and put through a bad configuration, and make sure that the canary system catches it. These are very expensive and quite difficult to do. They're worth the investment, I think, to make sure that your failure scenarios are all still running. It's worth doing these in your staging systems and production systems as well, as a real disaster test. Because if you don't, one of the days it may bite you.

June 2, 2019: Google Network Outage

Google had a big network outage for cloud but also for their other Google services. Elevated packet loss and network congestion for 4 hours and 25 minutes, which is quite a long time. Google's machines are segregated into multiple logical clusters. Each of them have their own cluster management software, like the Kubernetes clusters. They had a maintenance event happening in just one physical location, and this was the trigger for the outage. It should just have brought down some infrastructure in one site for maintenance. It could have been power, or network, or any maintenance that requires downtime. Maintenances on that scale become common events, not just daily events but multiple times daily events. You ought to make them right. The automation software becomes a control plane because it's taking in these configurations. It's bringing down pieces of infrastructure and draining traffic away. It's monitoring to make sure those drains are happening correctly. The automation control plane was also interacting with the software control plane for the network. They had a bad configuration, which basically said, turn off all the network control planes everywhere. That was a misconfiguration.

This shouldn't have been allowed to happen. Due to another software bug somewhere else, it was allowed to happen. The control jobs is like what we saw with the B4 SDN WAN earlier. What they're doing is they're doing this global optimization and trying to program the network to efficiently write the packets without dropping anyone's packets, without introducing any extra latency where possible. Without the control jobs, the network will still stay doing what it was doing before, but it won't be able to adapt to things changing. After a little while, it'll turn itself off. It will withdraw the network because it's going to go, "The network control jobs, they're not there. I can't adapt. Turn off." They actually managed to root cause the incident fairly fast. The problem was their network was in a very bad shape and that impeded their ability to get everything back up and running. Also, their network control plane, it runs as a consensus based replicated system. This had incorrectly brought down all incidents of the instances in a bunch of locations. It had lost its state. That state wasn't being backed up anywhere useful. They had to rebuild that from scratch, which lengthened this outage.

We're seeing multiple misconfigurations, bugs, permissions problems. We have multiple dynamic control planes. It was very hard to predict that particular sequence of events. Data loss, these systems are critical stewards of the state of our systems. Data in these systems has to be treated like production data. You need to have all sorts of plans for dealing with failure, or data loss, or data corruption in your control planes. It's a key consideration when you're building these things. We don't just roll these out of Janky Python, and assume that they're going to stay running forever. They need real engineering going into them. The blast radius for these incidents can be really large. It can be your entire network. That's really serious stuff.

Testing your failsafe behavior, it's scary. Scary in dev even. It's easy not to do this. It's easy to assume that the behavior you've designed in for your system to carry on to fail static works the way you intended, when your control plane is down. It's scary to test it, because you're putting risk into your system and we hate this. You don't want to be the person to push that button on that disaster test and take everything down, but it's necessary. It is necessary to have ways to test this. This is from the report on the Challenger disaster, and shows the O-rings that had no fail safes. It's an early classic of resiliency engineering. That report is well worth reading.

What can we do? Is it hopeless? Are we lost? We are basically forced into having these dynamic control plane systems. If we're large enough, we need to have at least some of these in order to mitigate the human toil and all of the downsides and uncertainties that come with having humans manage our systems. Get to a reasonable scale, and that scale is probably only in the hundreds to low thousands of instances, and these systems become inevitable. We have to do something.

Use Regional or Zonal Control Systems Where Feasible and Test them

One thing that's pretty important is to use regional or zonal control systems where feasible. Instead of having one system that tries to knit the entire world together, try and have those split up, unless there's a really good reason why you need that. It means that the maximum blast radius of any of these is much less. They need to be tested at least as carefully as your main production systems and probably more. They need unit tests. They need not to be made out of Janky shell scripts. Although, shell scripts can also be unit tested. They need integration testing. They need to be tested in production. They need occasional testing that their failure modes, and data recovery processes, and all of these things are really working the way you intend.

Plan for Time

You need to plan for the time. This is significant time. Whoever is running these systems needs to have time budgeted to stay familiar with what's actually going on under the hood. Time to run all of your backup tools, your manual processes for recovery, for staying familiar with what's actually happening under the hood. It's so important. This is something that you don't take into account the first year or two, where you build all these systems because everyone has been working with the underlying system for ages. They know exactly what's going on with it. They know how all these operations work. Fast forward three, four years, people have forgotten what they knew. You'll have a whole bunch of new people on the team. You need to plan constantly for exercises, drills, and time to stay familiar with underlying operations.

Put Guardrails around Your Control Systems

This is one of the hardest engineering tasks, I think, around these systems. You should put guardrails around them. Your automated system, your control system should not be able to withdraw every IP block that your cloud is managing. It should not be able to shut down every job that your system is running. You need to put limits on these systems because they can and they will have bugs. They can and they will have negative interactions with other systems. You should put guardrails in and you should test these guardrails constantly in your integration tests, because these guardrails are a really important part of the business logic of your dynamic control plane system.

Sometimes humans are better. Large installations will need some dynamic control planes, but you probably don't need to have 10. You don't need to build a dynamic control plane system for every little thing. Maybe sometimes what you need is a little script that a human will actuate and check the output of, and hit ok, and the operation rolls out. That's a semi-automated state, but it's not just fully automated autonomous control plane system. Consider, where do you need them?

Make Your Control Systems Easily Observable and Over-ridable by Humans

When you're designing these systems, they have to be designed to be able to be understood by a human operator. There are all sorts of things you can do. One really nice pattern is to add a status page to these systems. That's literally a web page that your system will serve, that will tell you exactly what the system has done recently, and why. What it's going to do next, if known, and why? Logs, as usual. Traces tend to be maybe less applicable to these systems but could still be useful. Another really useful thing where you have multiple control systems interacting, make them all log all of their significant operations to one place. It could be your Slack channel. It could be any other log ingestion mechanism. Just one place where people can go and look at the sequence of events. When you're trying to debug weird interactions between multiple control systems, that's what you need. You need to be able to say the sequence of what happened and figure out why. Overridable. Systems are systems, they're fragile. They are brittle. They will have bugs.

Humans are adaptable. Humans are able to deal with the unexpected, and make sensible decisions. Humans always need to be in charge. A really good passion when you're building these automation systems is to build a big, red button. A big, red button is just some state bit that you can flip somewhere that says, "System, don't run anymore. Just stop what you're doing until this bit gets flipped back." Ideally, any on-call engineer in your organization should be able to flip that, because anybody might notice that these systems are running amok. Sometimes, if you have a system that's literally taking down your entire network, or all your serving jobs in multiple locations, you just want to be able to say, stop. That's a feature that should be built in to these kinds of automation, a way to stop really fast.

If we do all these things, maybe one day we'll be able to build a cloud or a large computing system that has better uptime than a single machine does.


See more presentations with transcripts


Recorded at:

Aug 07, 2020