Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Designing Fault-Tolerant Software with Control System Transparency

Designing Fault-Tolerant Software with Control System Transparency



Jon Moore discusses four principles from the architectural paper "GN&C [Guidance, Navigation, and Control] Fault Protection Fundamentals" by Robert D. Rasmussen for building fault-tolerant software.


Over his career, Jon Moore has been a researcher, management consultant, network engineer, small business owner, tech lead, architect, and technology executive. He is equally comfortable leading and managing teams and personally writing production-ready code.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Moore: My name is Jon Moore. I'm a software engineer at Stripe. We're very proud of the fact that over 99.999% of requests to our API get processed successfully. We spend a lot of time thinking about reliability. What I'd like to do for you is to bring some perspectives on reliability and fault tolerance, maybe from an industry that is a little outside of our day-to-day. Does anyone recognize this? Voyager 2. This is the deep space probe, still in service 46 years after launch. This made the news a little earlier this year, because someone accidentally sent a command to it that pointed its antenna away from Earth, so it couldn't actually receive additional commands. Ingenious operators–and we all know that ingenious operators are always there to save the day for our complex systems–figured out how to put a bunch of transmitters together and they “shouted”, is how I heard it described. They got back in touch with it, and they got it repointed. What I found really interesting was that if that hadn't worked, sometime this month, here, in the month of October, there was actually a regular routine that was going to say, “I should probably check if I'm pointed back at Earth”. If it wasn't, it would have looked around at the surrounding stars, and it would have actually pointed the antenna back at Earth. I'm blown away by the fact that somebody thought about that 46 years ago, where that could have been the way that the spacecraft recovered. Obviously, the folks that work in the space industry spend a lot of time thinking about fault tolerance and reliability. This is a mission-critical aspect of what they do, because otherwise, it's very easy to create a $100 million brick flying through outer space, and that's an expensive brick.


This is one of my favorite papers. I actually reread it from time to time. It's called GN&C Fault Protection Fundamentals by Robert Rasmussen, who works for the Jet Propulsion Laboratory, which is an organization that works closely with NASA on designing spacecraft. GN&C is guidance, navigation, and control. These are the main software systems here. This paper actually distills a ton of experience spent with really thinking through how to build really fault tolerant systems into some core principles. We're going to go through them in this talk, and I'm going to explain what the principles are. I'm going to then put them into settings that are more familiar to us in a modern enterprise software system. We're going to see what this might look like, as we maybe take some of these principles and try and see what would they look like if we applied them to a microservices architecture.

What are States, Behaviors, and Models?

We're going to talk about just a few definitions. The first one is: we need to talk about system state. I like to use a chess game as a way to illustrate this. If we think about a game of chess, there are 64 squares on the board. Each square can hold one of a variety of either white or black pieces. Then if we know that we know what's in each square, plus we know whose turn it is, that actually fully captures the state of a game of chess. Obviously, this is the initial configuration of a game of chess. We can also capture the state at any point in the middle of the game. Another key thing to understand are behaviors. Behaviors are the things that allow us to go from one state of the system to the next state of the system. Again, in the case of chess, if I look at a knight like this one, it is allowed to move to any of the red squares that I've highlighted here. That says, if I'm in this state, this is the next state that I could get to by moving one of the pieces. There's also a concept of models, so, in particular, models represent which states of the system are possible, not necessarily which ones are desired, but which ones are possible. What I've shown here is actually an impossible scenario. That's for a number of reasons. One is that white has two kings on the board, black has zero kings on the board, and there is a white pawn on the first rank. None of these things are possible in a regular game of chess. We're going to get to this a little bit later where we're thinking about what are all of the possible states of the system, and not necessarily just the ones that we want it to be in, but what are all the possible states. Then, finally, the other thing, when we put this all together, that models let us do, is if we understand what state we're in and how different behaviors will lead to other states, we can then put together a plan. For example, if white has two bishops on two different colored squares, and their king, and black just has a king, it is possible for white to force checkmate in this situation. There is a path and a plan that white can undertake to actually get where it's trying to go, which is winning the game.

These are going to be our fundamental definitions. Some of you may recognize some of this from another setting, which is that we have a system that's trying to control something and be in charge of making something happen. It needs to have some concept of its objectives; what is it trying to do? We're going to take observations of the system that it's trying to control, its current state. The control system has to have a model of what's going on, and what do I need to do to get that thing into the state I want it to be in. Then that's how it selects the behaviors that it wants to exert on the system under control. This is a classic control loop model, which may be recognizable to some of you. That's how we're taking this terminology, and this is what we're going to be talking about for most of the talk.


In the paper that I mentioned, the key factor here is this concept of transparency. Another way of thinking about transparency is really just making things explicit. There are, in particular, four main things that we're going to want to make transparent. Those are (1) the objectives that we talked about: (2) models of the system as we're trying to control, (3) knowledge about its current state, and then (4)actually the control system itself, the logic that we're using there. The assertion is that we want to pursue transparency across all four of these things, and that's what lets us build really reliable systems. We're going to walk through each one of these in turn.

Transparent Objectives

Let's start with transparent objectives. The paper defines a transparent objective as a contract between the issuer of the objectives. If we want to put things in regular terms these are our clients. These are the callers of the things that we're doing. Then there’s the system that's actually achieving them. The important property is that success or failure is mutually obvious. What that means is, if I issue you a request, and you come back to me and say, “I did it”, I should agree you did it correctly. We should agree that you actually did what I asked you to do. Let's walk through a couple of examples where maybe we don't entirely have that property. Let's look at DNS. I can issue a “dig” command on my laptop to go resolve a hostname, and one of the things that will come back in the response is actually a time-to-live. In this case, it says, this response is good for the next 41 seconds. If I asked for the same hostname,, within the next 41 seconds, then, great, I get a cached response. It's fast. It's all good to go. Then the question is, what happens after 41 seconds? At this point, my cache response is stale. It's past its expiration time. My DNS client will dutifully say, “I'm going to go refresh that cache entry. Let me go ask again.” What do I do if I don't get a response? Do I try again? Do I continue waiting? Do I maybe even want to say, I don't want to wait around but maybe there’s a situation where using a stale response is actually ok. Maybe these are internal systems and I'm trying to establish a new connection to another internal service, where for some period of time, I might be ok using a stale cache entry. The “dig” command in particular, does not give me any way to talk about, “can I have a stale cache entry if I would like one?” In this case, not only does the cache not know what the right thing to do is, but I won't even have a way of expressing what my desires are.

A better example of this might be HTTP. If we look at Cache-ontrol headers on a request, there are various directives that you can supply here. Some of you may be familiar with the idea of the no-cache directive that says, “cache, don't give me a cache entry, please go all the way to the origin to give me a response.” Or max-age, which says, “you can give me a cache entry as long as it's no older, it's no further in the past than this amount.” There's actually a whole bunch of other directives that are part of the HTTP standard that you may not be as familiar with and not use as often. There are things like max-stale which says, “I actually am ok with a stale entry, as long as it's no more than this number of seconds stale.” Or min-fresh, which says, “you can give me a cache entry as long as it's not going to expire in less than this amount of time.” Or only-if-cached, which is like, “definitely don't talk to the origin. I don't want to wait for that business. Give me a responsive out of the cache, otherwise, give me an error.” These are all ways that the client can be a lot more explicit about what it is that it wants. It might even be interesting to say, “please respond to me within this amount of time.” I did not see any proposed standards around this. I think there are various companies that maybe have internal versions of this with custom headers and things like that. That's another example of more transparency in the objectives.

Really, what we want to think about is these objectives are constraints on the system's behavior. In the case of an HTTP cache, I want to take the response headers I get from the server, on those responses I saw earlier, plus the request headers that I get from the client plus protocol defaults. For example, the response to a POST request is by default not cacheable. Then those things constrain what correct cache behavior might be. We want to start thinking about the fact that there may be multiple sets of objectives that we are then combining to control what we're trying to do. For each one of these principles, I'm going to talk through some more pragmatic checklists or pro tips. If you're designing a system, and you are trying to ask, “ok, in this scenario, what's the right thing to do?” If your answer is, “it depends what the caller wants”, then that might be a good indication that you have some hidden or implicit objectives here. Maybe you want to start making those a little more explicit. Whether those are in protocol headers, like HTTP. If you're using something like gRPC, you might want to consider adding additional fields and request messages and things of that sort. That's about making it very clear that our objectives fully indicate and express our intent that we want from the system.

Transparent Models

Next thing that we have to talk about are models. Here's the quote, "Models provide expectations against which evidence of state can be compared." I had to read this a few times before, I was like, what are they talking about here? I think about this as if the system is in state X, and then I do Y to it, then I should see Z as a result. This is the idea of like, when I do something to the system, what do I expect to have happen? That's really what a model is about. One of the important things, and the paper stresses this, is that we may be tempted to think about like, there's normal behavior, and then there's abnormal behavior. That we have a tendency to treat these things differently. I suspect lots of people spend a lot of time on the happy path. I'm not going to ask how well tested is your error handling path here. The paper actually says we're in the wrong ballpark here, that this is not actually the way that we should think about it, because there's just behavior. This is why, for example, like one of the possibilities for that Voyager 2 probe is that my antenna, it might not be pointed at the Earth. That is a state. I don't make deep space probes for a living, but I expect that there's actually a bunch of other stuff that that thing is doing at the same time. It has to understand that it needs to keep doing those things, while it's trying to figure out what to do about this antenna. When things are separated out, it's hard to have a coherent response where we're prioritizing things as well. It's really important that we think about models as really capturing all of the possible states of the system.

If we want to put things in terms of an example, let's say I have a service that's talking to a database. Then let's say, I start getting errors back on my queries. One way of looking at things is like, what are we going to do here? Those errors probably are showing up in my logs and metrics somewhere, and the model of that database, the model of what's going on with that thing is probably in an operator's head at this point. This is a human model in your brain. Maybe you've got a feature flag that you can flip. You can say, let's just go ahead, disable that database. Maybe we have some fallback behavior we can use instead, or at the very least, we can at least fail fast rather than waiting on timeouts or other things that might be going on here. This is one where that model, though, is not really part of the software system at this point. Let's take another step forward and let's say, what if the service had some concept, a model of the database itself. One pattern that you may be familiar with is the idea of a circuit breaker. The way that a circuit breaker works is this is going to observe the requests we're making to the database. If I get too many errors, the circuit breaker is going to trip and it will automatically turn that database off. It'll disable it automatically for some period of time. This is a pretty simple model. It's really two states: the database is up, or the database is down. That's a little bit simplifying because there's also the state where, "Maybe I'll try it again." This is a model that the software system itself is using to describe and interact with this dependency. I think we're a step ahead here from where we were with just a mental model and a human operator.

Then the question becomes, is there something more interesting that we could do? Is there a more interesting model other than just up or down? For example, what if the database is just slow? A database is still up, circuit breaker is not going to help us here. Is there something useful we can do? One of my favorite things is Little's Law. It's a relationship that applies to request processing systems. It says that the average number of requests that are in flight being worked on by the system, which is N, is equal to the average transaction rate, the rate at which requests are coming in, that's X, times the average response time, which is R in this case. It's a simple equation. No one needs to do any calculus, derivatives, anything like that. What's really great is it applies to any request processing system. It applies to subsystems of request processing systems, so you can actually nest it. This is a super powerful model. Just to put things in perspective, so let's say that we say, we're cruising along, database is responding in an average of 10 milliseconds per request. I'm getting 2000 requests per second through to it. It all sounds good. Now if I see that the average response time has risen to 20 milliseconds, I need to realize that I'm probably only going to be able to get 1000 requests per second to it. That's a very simple thing that Little's Law says. This is the type of logic that is very rare for systems to actually do, even though it's a pretty simple model. What you want to do about it, this is very dependent. It definitely depends, because maybe I only want to send 200 requests per second. Maybe I'm ok with 20 milliseconds response time average. Maybe I do nothing, literally nothing, because it's ok. I may also want to say, this is actually not ok. Maybe I need to start shedding load, or I need to start prioritizing certain queries over other ones, and things like that. It's only when I have some model of the idea of the way these things are related that I can start making intelligent decisions about these things.

We've actually looked at a couple different points on this. The model that we have of a dependency can be anywhere on a spectrum from oversimplified–meaning, might as well not have done it. It doesn't let us do anything interesting–to full fidelity, like we can model that the database has a query cache, and it's got a number of threads, and how much memory is on the database server. We could do all that. We're almost reimplementing a database at some point, when we're doing that. That's probably more effort than it's worth. What we really want are points somewhere in the middle where the model is interesting enough to be useful. As we saw with the circuit breaker, like even just up and down, that's a simple model, but it's still useful. Maybe with Little's Law, we get a slightly more powerful model that lets us do a few more things, but it's still easy enough for us to implement, and so on. We want to be able to make sure that the model is simple enough that we can actually use it and understand it relatively easily. Simon Wardley has a great quote, which he says, "All models are wrong, but some are useful." That's really what we're looking for here.

What's our checklists for models? When we're looking at a service, do I have a model of my own capabilities? Number one. When we say model it means like, I can actually look in the code and say here's the part of the code that represents my capabilities. Do I have a model for each of my dependencies? Then, I think most interesting: can I adjust my concept of my own capabilities in response to changes in what I understand about my dependencies? That's where we have now gotten a very powerful model in terms of being able to manipulate this thing.

Transparent Knowledge

The third principle is around transparent knowledge. This is the idea that what we want is: we want knowledge to be explicit. Representing it clearly with timeliness and uncertainty, and striving for a single source of truth. Let's talk for a little bit about uncertainty. I actually ran this command, this “sntp”, which uses the network time protocol to try to understand what the difference is between my system clock and some other time server's clock. In this case, I ran this while I was in Philadelphia, so that's the red X on the right, and the time server is somewhere in Colorado, that's the red X on the left. This is what it gave me back. What this basically said is, we think the time server's clock is ahead of your clock by 47 milliseconds, plus or minus 63 milliseconds. In other words, I don't even know which one is actually ahead.

I actually had an interesting thing where I was trying to debug something in production very recently, where we're like, why are things happening in this order? It doesn't seem like it's right. That's because this thing was happening in this data center over here, and this one was happening in this other geographically different data center over here. Sixty-three milliseconds is like pretty big when we're looking at databases that might give me responses in single-digit milliseconds. What was actually happening here, and Michelle also talked about this, like clocks are a thing. What I think is really interesting is that this “sntp” command actually gives me the plus or minus, on what the answer is. Folks that are making deep space probes, a lot of what they do is they're collecting data. They have instruments that are measuring physical things. When you're in that setting, like we're used to our measurement instruments having tolerances. We get plus or minus out of these things, when we think about it. If you remember high school science class or something like that, you learn about tolerance and accuracy and things like this. A lot of the things that we do in computer science and software, we don't think about that. We're like, no, this is the thing. Absolutely. Period. This might be something interesting for us to think about here.

Then there's a notion of timeliness. If I go make an HTTP request to something that's cached, one of the headers that I may get back is the Age header. In this case, it says, it's actually been 95,000 seconds since I got this from the origin. Maybe that's super interesting for me to know. A really common pattern is that, I've got some piece of my process that's pulling something. It's like on a regular basis, what's your current state? What's your current state? What's your current state? Then when I want to make use of it, I go grab my local cached copy. Like, what's the last thing that you saw? An interesting question is, would you do anything different if that local copy that you got, and maybe you're polling once a minute, if you figured out that like, this thing that I got is from three days ago, is the last time I successfully got the current state. If we're not annotating it, how do we know? Would we do anything different? We often don't annotate sources of data with when were they generated. That may or may not have importance to your domain or not, but I think it's something for us to consider.

Let's talk about a single source of truth. In the earlier example, I showed one service instance talking to a database. The reality is I probably have multiple instances of the service, possibly talking, and it's probably even multiple database servers running in a cluster. Even if I do have explicit models, each service instance may develop its own model of what's going on with the database. The only place that I actually get a coherent view of what's going on in that database is probably in my observability stack, in some dashboard where I'm collecting all of the errors, and logs, and metrics, and things like that. Again, that's really for human consumption, in many cases. Not all cases. We definitely have automated alarms, which start to look at things like this. Those tend to fire alarms and wake up a human. Where do we have the opportunities to start to build a little more smartness into this?

When I think about the service that I'm building, if I have a model of a dependency, do I have one place where I can find out what its current state is, or at least what I think its current state is? Has that detail been annotated with freshness and accuracy, or uncertainty? Do I have some way of keeping it updated? If we're thinking about like, what's the latency of queries to the database, and I'm sending a lot of queries to the database, fine, I can probably just observe what's going on there, and get a pretty good understanding of what's going on with the database. However, if it's a database that I only access intermittently, I may need to build something more proactive that goes with it. Maybe something that just sends a lightweight query to make sure like, I can actually still talk to that thing. That can be something useful for us.

Transparent Control

Then the fourth principle is around transparent control. This is where we tie everything together. If I have knowledge of how the system behaves, what its current objectives are, and what its current state is, then no other information is required to choose a control action. That's actually the thing that we're shooting for. This is actually a really useful property. What it says is, I can think of my control system as a function that takes my current objectives and the current state of the system, and then decides what to do. When we think about the control decisions in this setting, there's a couple things that are really important here. One, this is stateless. You can tell because we're literally passing the state in, so the function itself does not have any current understanding of state. We're going to see that that could be maybe useful later when we try to build one of these. It's also testable. Again, if we think about, states represent all of the possible states of the system. If that's the data type where we say state, that data type should be able to express every possible state of the system. If we're doing that, and we're explicitly passing it in as an argument, it seems like it might encourage us to have a little more shot at trying to understand if our test suite is comprehensive. More importantly, these are also deterministic. This really adds to our ability to test the control system, understand how many of the scenarios are recovering. Then to actually have confidence that it will behave that way when we get there.

Pulling it Together

Putting this into another setting that may be common to you. Here's a standard household thermostat, a non-smart one. It has all these things. It has objectives. It says, please make the temperature be 62 degrees, in this case. There are some other additional constraints as well. For example, it's like, this system is in heat mode. Which means you cannot run the air conditioner right now. That's a constraint on what the thermostat is supposed to be able to do. It definitely has state. It knows what temperature it currently is in the house. There's a model which at the very least it says like, if it's too cold, run the furnace. Now actually thermostats actually are a lot smarter than that, because they understand things like, how often should I run the furnace so that I can get the most efficiency on my heating, and things like that. Then also it has actions. In this case, the thermostat is actually calling for heat and running the furnace. This is how we put all this stuff together, in maybe a setting that's more familiar to you.

Now you may say, "Jon, but what about microservices? We are not running a bunch of micro thermostats. How does this really work?" I'm going to propose a way of thinking about this that maybe pulls this all together. Here's our setting. We've got a bunch of services that have a dependency on a database. Maybe there are connection pools. Each of the service instances has a pool of database connections that are in there. It probably has a maximum size to it. They might be running in an autoscaling situation, whether we're scaling at the VM level, or we're doing something more at a container orchestration level. This is a pretty common setup. Again, the way that this usually works is like, we're pulling metrics from stuff, we're sticking them in a dashboard. We got a bunch of operators that are responsible for them, like changing the way these things are configured. The operators are saying, what's going on? What do I actually want to be going on, and so what should I do about it? The operators here are functioning as the control plane of the service. They are deciding how this service should be operating. I think a pretty straightforward next question is, what if we had a microservice control plane that was a piece of software? How would this work? This is where we pull all those principles together. I can still collect. It needs to be aware of a set of objectives. We'll get about what those might look like. It can still collect metrics. All these things have metrics endpoints, for generating logs, notifications, events. It can subscribe to those same things. It can build a model of the world. It can understand, what does that database look like? How many service instances do I have? Really, what's going on with this service? It can have a model. We talked about Little's Law as one model that we could use here for manipulating things. Then, it can actually configure the same levers that we've already built in.

An interesting thing here, remember we said if we got that transparent control loop, it's stateless, which is really nice because it means that I can just schedule this as a single instance and I can ask my infrastructure control plane and say like, please have one of these running. Maybe it's a single replica task in a container. It will begin observing metrics. It can build up its state. Then it can begin configuring the system from where it is. We don't actually have to do anything special to make this recover other than making sure one of these is running. That's a nice property. Also, again, because if we've done a good job with defining our objectives, the objectives are things that those human operators or other systems can also use. That's actually an important property. Actually, that paper that I talked about, even though there's tons of cool automated recovery stuff that's going on, they're very clear that it's the ingenuity of human operators that save many missions. The point is, give those operators some place to stand. A big part of that is making sure that when you have control systems like this, that they actually make sense to the humans that ultimately have to interact with them. I think by thinking through things in this way, we get that property that comes out of it.

Let's talk a little bit about what these objectives might look like. We're used to talking about service level objectives, and we might have service level agreements, things of this sort. These are slightly different things, usually. They may express things like, the throughput that can be supported, number of concurrent requests that can be in flight, certain latency profiles, maybe a p99, p999 latencies, average latencies, error rates, things of this sort. These are things that we commonly use to describe these things. SLOs tend to be self-imposed. These are imposed by the operators of the service. These are our goals. This is what we think the service should be able to do. This is what it's designed to be able to do. That's what we're striving to actually operate towards. Service level agreements are either offered to you or dictated by clients, depending on who has better negotiating power there. The idea is that, your SLOs have to be tougher or tighter than your SLAs or you're going to end up breaking your promises. The way that I think about it is SLOs are really about my capabilities, what am I capable of doing? Actually, this is where service level indicators, current state, what capabilities do I currently have compared to what I'm designed to have on my best day? Service level agreements in the setting that we're looking here, this is really what I think the best use for the objectives that we're talking about would be. This is a client saying like, "I would like this." You as a service would say like, "I got you. I can get that for you."

Let's put this together. The way that we described things previously, the service was just observing what was going on with the database just by interacting with it and seeing what was going on to it. In reality, when we have a microservices architecture, like we own and construct many pieces of these things. Let's say I have service A, and it's calling service B. I control these things. I can put a control plane for both of them, and now they can cooperate. Control plane A, again like we said with those SLAs, you can say like, here's actually what I need out of you, in terms of throughput, or error rate, or things of that sort. Another way of thinking about is you can say, "I'd like to make this reservation." Maybe those things have a time to live as well. Like for the next 10 minutes, here's what I would like from you. Then control plane B needs to have awareness of its own capabilities. Remember, that was one of the checklists. Do I have a model of my own capabilities? They can respond back and say, yes, I can do that for you, or, no, I can't. If it realizes that it's degraded in some fashion, it can proactively advertise that fact. Maybe puts that on a notification system of some sort, and can say like, "I normally can do requests X, Y, and Z. Z is currently broken, but I can still do X and Y." That's actually super useful for someone that's calling me because rather than waiting for me to give them an error, they can take their adaptive action right away. They can now adjust their own expectations about what to do if they do that. Then obviously we can have other services as well. There could be a service C that also calls B. It's going to be asking for its own things. Because I've got a single control plane for B, it's in a position to be able to reconcile all of these objectives together and figure out if it can meet them, and if it can't, how to properly prioritize between them. A lot of times those priorities are not necessarily made explicit anywhere. When we run into problems, it's like, what will the system do? It's like, let's find out, or we end up finding out when those things actually happen.

As a good example of this, like we talked about this setting, like an autoscaling group, that's something that's adaptive. It's going to add and remove service instances, according to the metrics and what we're probing them to do. You can also have adaptive connection pools. I've also seen implementations where the size of the connection pool may grow and shrink, depending on how many requests you're making and how the database is responding. Here, the question is, ok, now my database has higher latency, what should we do? Should we add instances? That may be something the autoscaling group might try to do. Should we adjust connection pool sizes? That might be something the adaptive connection pools are trying to do. Should we address thread pool sizes in the service instances themselves? I don't even know if there's something that's trying to adapt that. Maybe we should do all of those things, or maybe we should actually do none of them. It all depends on what the objectives are. The reality is, we often build a bunch of these adaptive mechanisms but there's nothing that pulls them together. What will actually happen? A lot of times, we don't know. When we have incidents, it's usually not the simple incidents. The simple ones, we've got a lot of those covered. If I lose an instance of a stateless service, because it just dies, we know how to start a new one. We know how to do that. All of our incidents are interesting. It's because like, these four things happen to happen at the same time and now they interact in a way I totally didn't expect. Now I've got something going on. We have a tendency to build lots of mechanisms that operate independently, and we don't have a single place where we can pull them together to get a coherent response. That's the type of thing that this control plane could offer us by doing that.


A little thought exercise for you. That is that if we think about building transparency in these four things, objectives, models of our own service and the things that we depend on, knowledge of their current states, and having actually just an explicit place where we're making control decisions about these things. Then I'm going to argue that we can build with actually, I think, not that much effort, control planes that are robust, coherent, composable. We saw how we put them together for multiple services, but also understandable. These are also things that human operators can make use of. If we want to do MLOps, we've built all of the framework to provide the data, and the control levers, and a place for those MLOps to run, in a way that they can also be composed. If we wanted to start to do machine learning for operations, we could do that too, if those are the types of models that we want to build. I think we don't have to go all the way to complicated ML models, in order to get something useful out of this. With a little bit of effort, we can get a lot more visibility and control and understanding of what our systems are going to do when the time comes. I really think this understandable bit is really important. My uncle is a retired professor of anthropology. One time he said to me, the main thing holding back human society is that we each only have 1500 cubic centimeters of brain. We are building complex systems that do not fit in any one person's head. We are hardware limited as a species. When we think about designing things, even adaptive things in software, making sure that they're still understandable and that a human can still interact with them and get them to do the things that they want is, I think, a really important property.

Dealing with Explosion of State Variables, as They Scale Up

Moore: We have potentially an explosion of state variables, and how do we basically deal with that, as those things scale up?

There is a property of control theory. I'm not a control theorist, so I may be paraphrasing this improperly, which is that your control system has to exhibit at least as much complexity as the system it's trying to control. To some degree, the more complex the systems we're building, I think there is no escape for that, unless we're willing to sacrifice the fidelity of our models. As we talked about, there's that spectrum of how detailed the models are. One way is we can say, maybe this is a place where we need to simplify so that we can constrain and understand the implementation at the expense of maybe this super optimal thing that we might have done. Maybe we can get to things that are good enough. I think the other way, the way that we often think about things in computer science is with encapsulation. That's where the idea of, if we think about a subsystem that has these types of properties, and its own transparent control plane, to some degree that hides a bunch of operational detail from the things that are calling it. The idea is to build, to some degree, like a tree of these control plane systems. Languages like Erlang and actor model systems have this concept of a supervisory tree, and these sorts of relationships. That's, I think, a fundamental reason why, was it Ericsson that built all these telecom switches using Erlang? I'm not saying like, go rewrite all your stuff in Erlang, at all. I think there's a reason why this particular idea of these nested and tree structured really, control systems have a lot of power. We've seen where that's been helpful in the past.

The Overcorrection/Oscillation Problem

Moore: What about overcorrection, or oscillation, is another problem here.

Control theory to the rescue here. For example, there's a very classic case of control systems called PID controllers. They can do all kinds of crazy stuff. They're really good at balancing stuff, for example, with motors. They're controlling motors. Think about like a tray with a ball rolling around on it. They're pretty simple logic. You can knock on the tray, and it can totally recover. PID controllers have three different aspects to them. One is like, how big of a problem do I have? How far away from my target am I currently? If I'm pretty far away, I probably want to do something like a little more severe here to get closer to where I'm trying to go, faster. Another one is like, how is that gap closing? If I'm approaching my target quickly, I maybe want to slow down on things so that I don't overshoot, to your point of overcorrecting. Then there's another one, it's the idea of like, how do I prevent things from just oscillating in a stable pattern above things? Because ideally, I'd like to get to a point where that eventually damps over time, and I get to a stable state. Control theory does have, I think, things that can be helpful for us. Maybe even basic PID controllers for which there are open source implementations of those things. That may be a useful thing for us to look at. This is where timeliness of state comes into play, which is, when I take a control action, how long will it take for the results to show up? Am I waiting that long before I'm deciding to do something different? Some of that stuff has to be built in when we're making adjustments.

Balancing Robustness and Quick Iteration, in the Control Plane

Moore: If we're looking at the core service implementation, and this is something that we're changing rapidly, how do I balance that with the need to then also adjust its control plane as a result?

As a very simple example, let's say I add a new dependency to my service. That necessitates adding a model of that dependency to the control plane. If you want to do something like this, you are signing up to maintain the control plane, while you're modifying the data plane. I think where these transparent principles here come into play, is that leads to easier to understand and maintain control planes. I think the idea is that if we're not being explicit about the control plane, then we're being implicit about them. I think some of the question is, would I rather be explicit and do the work to actually think about what do I want to have happen in certain scenarios, which like, that's work. Or do I want to just say like, I'm not going to worry about it. I'm just going to have the ops team figure it out when an incident happens. Don't get me wrong, that may be the right thing to do. This is all highly context dependent on, how bad is an outage? How much effort is it compared to how likely something is to happen here? That is something that we have to weigh when it comes to, do we even want to pay the expense of running an extra container instance to be this control plane, for example? That may be material: maybe it is, maybe it isn't. That's going to be highly dependent on your business, I think.

Standard Way of Modeling Microservices

Participant: Do you have a standard vocabulary or factor, like tooling in the language for modeling microservices, or it's just ad hoc as you go along?

Moore: I'm not aware of a standard way of modeling this. Part of my idea here with giving this talk is maybe to inspire us to start thinking about this as an industry. The paper introduces some terminology. I think Little's Law, for example, is a very useful way of thinking about things. We see that we have common ways and terminology of talking about pieces of these systems. We have SLOs, SLAs, those are for objectives. We do pieces of this in different places. We had lots of examples where it's like, this software system doesn't have the transparent objectives, but this one did. We do it sometimes. What I hope that you take away from this is just some additional awareness, and you have the opportunity to look for these things. I think this is not an all or nothing gig, either. I do think that you can have value, all you do is just make some of your objectives a little more explicit, so that your answer to, what should I do in this scenario? Is not just, “it depends on what the client wants”. Have the client tell you what they want. Even incremental investments here can bring value. That's the benefit of principles like this, is that they're just guidelines. It's not prescriptive in that sense. It's just a way of thinking about the system.


See more presentations with transcripts


Recorded at:

Feb 24, 2024