Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Control Theory in Container Fleet Management

Control Theory in Container Fleet Management



Vallery Lancey covers basic principles of observing systems, controller design, and PID controllers. In particular, she dives into container scaling controllers, using both first principles and proven designs from Kubernetes and Mesos.


Vallery Lancey is currently the Lead DevOps Engineer at Checkfront, working on infrastructure automation and reliability.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


We're all here because we were promised that containers would do something for us. And obviously, we're at least a little bit optimistic, but everything hasn't magically been solved yet. So I want to talk about container orchestration, which is kind of the big promise of containers. Containers are a box that we shove stuff in and then we do stuff with that box. So a lot of what's cool about containers is the nature of having shoved the thing in the box. I want to talk about logistics of how do we take that box and then move it around?

So container orchestration encompasses a couple of different things. We want to create things whenever we want. We want to be able to manage the life cycles. We have a box. We want to move that box around and do stuff with it. So we want reproducible systems. We want to be able to cohabitate our boxes and we want to be able to automatically manage those. So traditionally, what we've had is we have a complex system and we have a whole team of people who are always looking at really nasty dashboards for that system, looking at the logs and trying to figure out what's going on and then going in and making hand changes whenever something goes wrong.

This really sucks for a lot of reasons. It doesn't scale well. People aren't fast. People make mistakes. And I don't know about you, I've been woken up at like 5 in the morning and I don't want that to keep happening. With an automatic system, we have a system that tracks its own state, makes codified judgments about that state, and then translates that into internal actions. So those internal actions are things like allocating appropriate resources. We want to bind a video card to our deployment that's doing machine learning work. We want to allocate the right CPU usage for a given web server. We want to manage our network around this. So instead of manually discovering things, we just send traffic to everything that's going to receive traffic. We want to reap unhealthy instances, rollout changes and scale up and down on demand, all without having to worry about doing any of this. We just want it to happen.

Control Theory

So how are we going to make it happen? The answer is something called control theory in engineering. Now Brian gracefully read a little bit of the Wikipedia entry. Who here is familiar with control theory? A couple of hands. It's actually less than I expected, but all good. So control theory is a branch of engineering specifically around managing systems. So I believe it was invented specifically in the industrial revolution with being able to control complex physical devices. It comes up a lot in factory design, aircraft, electrical engineering. It's around being able to take some kind of an input, translate that into a more complicated output. So instead of trying to manage every single part of a system, we now have an abstraction for what we want to tell it.

This is kind of the simplest controller you can get. We take our input, we put it into the controller, the controller translates it in some way to an output that gets applied to a process. So a nice little example of this is a clothes dryer. So with a traditional clothes dryer, you just say I want it to run for 40 minutes. It runs and then theoretically, the clothes are dry. We take the time, we translate that into just how long the power is on. However, there's a problem with this, which is that 40 minutes isn't always enough. A lot of the time my clothes are still wet. I want to tell it dry the clothes, but I can't do that because it doesn't know when they're dry not.

To solve a problem like this, we have closed-loop controllers. So closed-loop controllers introduce a concept called feedback. Feedback is some kind of readout from the process that gets integrated into the controller. So the controller knows how it's actually performing. This introduces the concept of error, which is defined as a set point, what we're telling the system to do, minus the process variable, what the system actually is. So the error lets us know how we need to course correct what the controller is doing in order to achieve the desired outcome. Positive error means that we are under what the set point should be. Negative error means that we're... Sorry, I always get the directions mixed up. Positive error is you are under the set point. Negative error is you're over the set point. So any magnitude of error that's not basically zero isn't good. And the controller changes what it's doing to compensate for that instead of the user at the set point blindly trying to do that.

Control theory as again, Brian kind of indicated gets very complicated. So there's a lot of math in it and most of the math, unfortunately, it only applies to problems called linear problems, where there's a linear relationship between your input and your output. Unfortunately, almost nothing is linear because the real world is a little bit more complicated than mathematicians would like. So this includes a lot of our digital systems because there's often not a direct correlation between a change we make and what actually happens.

Applying Control Theory to Containers

So this is all well and good, but how do we apply it specifically to containers? This is kind of the archetypical explanation of what a declarative system does in code. So any design spec on Kubernetes will have this plastered in it sooner or later. We get the current state from something, we get the desired state from something, and then we use some kind of function to mash them together. And this is a closed loop controller. I've mislabeled those. The current state is our process variable and the desired state is our set point.

So a fairly simple example of that, that I've built way more times than I'd like to admit, is a readiness probe. So when we launch a container, there's a time gap between when we've made it and when it's actually ready to serve traffic. This encompasses a lot of stuff like pulling the image, booting a service, potentially syncing data, all that fun stuff. So the readiness probe knows when the application is actually functioning and it's ready to receive traffic. Typically, we look at like HTTP 200 for this, although it's customizable and should be customized.

So in order to build this, we need to look at what the config should be for the probe and we want to actually get our responses from that. It varies platform by platform, but here's kind of a generic look at what the life cycle of creating a container is. Often, there's pause or stops that turn this into spaghetti because almost anything can pause or stop then go back to where it's at.

So in general, we'd like to go from a nice line of schedule to creating, to running, to ready. And in particular, we're going to pay attention to the running to ready transition, because once it's booted, that's when we want to pay attention to the health check. So this is roughly what it looks like. Again, nice simple controller. We put in our input, what we want it to do, and it can do two things. It can choose to kill a container and it can choose to update the status in the cluster of what part of the lifecycle the container is at. And it gets one piece of feedback, which is the response the container gives. Now, it's a little bit nitpicky, but there's a problem with this. I don't know if anyone can see the highlight. I can't up here. So we're getting one piece of information, which is the HTTP response. Unfortunately, this isn't quite enough because, recall, we have a whack of lifecycle stages that is often more complicated than this. So what happens if we start the probe when we create the container?

Now, in a happy world, container creation is really quick and then we're waiting on the application. But we don't live in a happy world. We live in a world where DNS and so on exists. So sometimes it can take a very long time to actually get that container running. What if the grace period is, say, 20 seconds to boot? For something like plain engine X, that's more than reasonable. What if the container isn't actually properly started by that point in time? What if it's in a paused state? What the readiness probe might do, if it's blindly operating on just fire out a response and wait for the grace period, is it might kill the container when it shouldn't. So if we want to add something new, we want to specifically get the status information as well as the HTTP information so we can judge based on its status, how we should be acting. And this is a problem that I have seen go wrong in the wild, by the way, with custom-built controllers.

So another quick example is replica headcount. It's a big thing with containers. We say, "I want five. I want there to just be five regardless of what happens." So we need to make sure that we keep track of the fact that we want five. We want to look at what the system is doing. And then we need the ability to fire off commands to something saying, "Kill a container or create a container."

So there are two ways you can go about creating controllers, rather like creating anything in software. So on the left, you see kind of the naive approach. We love shoving everything all in one thing, be it a class or be it a service or be it a function. But this gets really bad because we can't tell what it's doing. It's just one big box. Nothing is transparent. You can't test it. Typically what happens when you have a complex system, and factory design has a lot to say about this, you want basically a graph of controllers, often they form of hierarchy, not necessarily, where each thing has distinct responsibilities. It controls a minimal subset of things. It has exactly and only the feedback that it needs.

So a lot of the time when you have complex interrelated things is controllers will defer things to one another. So for example, in the case of a lot of cloud platform design, such as Kubernetes does this quite heavily, there's not a million different systems that all interact with containers directly. A lot of things just will change in deployment or change the replica sets back accordingly.


This is basically where I'm going to spend the rest of my time now: autoscaling. Personally, I think it's kind of the most interesting use case. And this is something that is very widely applicable to end consumers who are customizing the platform for their app, rather than someone who's necessarily building container orchestration from scratch. When we're autoscaling something, we want to take a specific metric and then try to roughly optimize our number of instances until we match that metric. So we're doing a little bit of backwards math. We want to increase or decrease the replica count to get that met. Where autoscaling gets demanding is it's easy to make something auto-scale, but how do you make something auto-scale right and quickly? This is why something that is very heavily customized such as a custom Kubernetes controller, because the way that a default is built might not meet the specific behavior and the specific expectations of your application.

So this is typically done with something called a bang-bang controller, which is a good example of that would be like because this is San Francisco, I have to prefix it, a traditional thermostat. So you basically set a temperature that you want. Once the temperature falls too far below it, then the heat turns on, the heat rises a bit and then the heat turns off. So instead of having a precise point where it's trying to be exactly 20 degrees or 74 which I think to you all, then it has a bound of what's good enough. The reason it has that bound is because you can't just balance wildly between that point. You're not going to make that point exactly and it would be both annoying and potentially very inefficient to be every 20 seconds turning on and off.

So our load is like that too because we realistically can't hit the exact load. We don't really know totally what's going on. We don't have a linear relationship between the replicas we have and the load in the system. And depending on your exact replica count, it might not be possible to exactly scale the system to be at exactly 50% network saturation or something. So it can be a bit tricky to visualize sometimes. I've tried to put it in terms of a bar here. So we basically have the process variable filling up. If it's below a certain point, it turns on. When it gets above a certain point, it turns off, which is based on a buffer around the set point. I'm not going to get into obnoxious symbols, so I've just drawn it out straightforwardly here. We define an error around the buffer. Error is positive enough. It's on. So that means that the set point is too low. If it's negative enough, which means the set point is too high, we turn the system off.

So that's enough, right? Not quite. So close enough is a problem in autoscaling that's semi-unique here, because anything in terms of saturation is kind of a function of how many instances we have. So for example, if we only have two replicas of something, although, don't do that, your units of saturation are 100%. So if you add another thing, suddenly your app, say 66% saturation, was not kind of a happy point where you're super close to your number. So because of that you need to make the controller a little bit elastic in terms of what it represents because if you have a strict bound or you're just bouncing between not many resources, you can wind up in a situation where it's not possible to get within that, say, plus-minus 5% even if technically, you have a satisfactory outcome.

Delayed response is a tricky one when scaling. It's the worst with databases, but almost any system can be impacted by it. So containers, unfortunately, are nice and fast. It's faster than trying to provision bare-metal or something. It's faster than most VMs. But they still take time to boot up. Best case scenario, this is like two seconds. It's frequently 30 or 40, and it can get real long if there's a problem or if you're dealing with some kind of syncing or coordination. So this impacts our ability to actually know what we need because if we look at the system and we say, "This needs three more containers right now," unfortunately, we wait, we wait, we wait. Suddenly, were higher up on that load curve by the time they actually come up and now we're still badly behind.

Who has here heard of something called the MIT beer game? It's much less exciting than you would think for something that references beer and a school. It's a control theory based game around supply chain management. So it's meant to be more on economics, but it is very relevant for any kind of distributed system where there's an A to B to C pipeline. So the game works that you have some kind of supply chain from manufacturing and you have a bunch of players who are basically placing orders, going from customer to raw material and then making fulfillments back in the other direction, with delays in-between that represents shipping and actually doing the work and so on.

And what this game exposes is something called the bullwhip effect, which is that you have oscillations starting outward from the demand side that get rapidly worse throughout the system based on most ways that you'd run things. So you have to specifically work against the bullwhip effect because now you have ways that you predict or stockpile things encourage it. A rough level although- there's a lot of, I think theses around this subject- a rough level of how it works is everything tries to overcompensate a bit because you want to have at least enough capacity if not more for your predicted demand.

So each stage basically overcompensates more and more and you get kind of a thrashing effect throughout the system. This can happen if we're trying to bring up a whole bunch of stuff without necessarily knowing what's going to happen when it gets released. So we can under-scale and therefore twiddle our thumbs. Something comes up, it's not enough, or it can over-scale and try to predict what's going on at the con of having suddenly way too many instances. Now, in a best case scenario, this doesn't matter a lot. Sure, we brought in a bit of extra compute power, whatever. But if we're dealing with anything particularly slow or a heavily integrated system, again, I feel like databases in the classic fashion keep being kind of the point of pain here, it becomes very expensive or risky to start just bringing up all this excess stuff for no reason whatsoever.

In order to do this, we have to account for the delay. So we basically want to mathematically model what the delay should be to try to predict what the system's going to do when it actually comes up. And so we know how long it's going to take, instead of the controller assuming that it's going to get those containers right away. If we have no context and this is where not modifying a platform hurts you, the platform just has to guess. It could use statistical analysis, but that's risky because if you try to analyze anything without understanding it, you don't know what's normal and what's skewed information. You can try to just wait out the grace period, of just wait until it boots up eventually, or you can define a threshold of the grace period. So say once you hit 50% of the timeout for the readiness probe, just start booting up more because you may as well.

If you customize your controller, you have a lot more options, because now you can actually build in logic around what that given service does into how it's scaled. So you can introduce that statistical expectation of, "How long does MySQL take the sync? How long does engine x take to boot? How long does this service take to run its inventory?" And you can customize a readiness probe so that the controller understands how close something is to ready. So instead of the blind process of, "I've started a thing, I'm waiting, it's going to get there eventually," the controller can know the difference between something that's, say, is still a minute away and something that's five seconds away from going online.

Matching demand is a huge topic. I had a really interesting talk with someone recently who works at a notifications company. In their particular use case, they will just be chugging along and then suddenly, by multiple orders of magnitude, their load increases. And they basically can't auto-scale with anything remotely out of the box because it's just too slow. They go in seconds to needing 10, 100 times the resources they did. Luckily, most of us live in kind of like the "cat gif" side of things where our curves are more organic and based on humans, so we don't have that steep of a curve. But scaling up is still hugely important because we don't know when the demand is going to hit and we have to meet our user expectations when it does.

So scaling up is basically a guess of, look at the load. Look at the replicas. Try to guess how many replicas will satisfy that load. As I mentioned, there is the delay aspect of that. We're not necessarily going to have the same load by the time the resources are there and things aren't always linear. CPU use is a good example. Measuring based off CPU use is one of the poorer metrics you can pick because that's not always directly correlated. So if you have any kind of data syncing or processing or intercommunication overhead, that will rise as you bring up a larger network of resources even if you're not raising the consumer load on it.

So in order to look at how we keep things actually fast and precise, I'm going to go to selling classic containers, which is talking about ships. So ships have a helmsman who deals with steering stuff. And once upon a time, this was a person physically turning a wheel around. So there was an engineer, whose name I don't recall, who was trying to study how to build a better steering system than someone yanking on the wheel. And they observed what the helmsmen were doing and they noticed three interesting behaviors.

The first one is the obvious behavior which is that the helmsman proportionately turns his helm according to what the discrepancy is between the ship's heading and the desired heading. It's how we'd correct anything. But there were two other things. The helmsmen would steer harder and sharper if there was a long-running discrepancy. So if the turn was going slow or if there was a large discrepancy so the turn needed to be quite large. And helmsmen would also predictively modify turns based on things like the current or how difficult past turns were. So it wasn't all reactive. There was also that predictive component.

So this leads to something called the PID controller named after its three components. The first component is the obvious proportional one which arguably, a native controller is just called a P controller. You just take the linear measurement of what the difference is between your set point, where you want to be and the process variable, where you actually are. I mean that's a nice flat line because in load balancing your expectations and what you want the load to be are normally flat.

So there's also the integral component which is the compensation part. So just going after the current error isn't always sufficient because sometimes that's just not working fast enough. So say if you have cruise control on in a car, you don't gun up cruise control when you start to go on a hill. The cruise control controller, which is a PID controller, or at least a PI controller, sees that quickly it's not able to be as effective. It's not going at the right speed, with the same power. So increases the power to match that. That's the integral component. So if the system is failing to properly address the gap, it quickly starts putting more and more power towards it. It's basically just best represented by the actual area of the graph.

The derivative component is often minimized, but it's sometimes the most interesting one in systems. So it's the predictor of future error based on the slope of the value. So if you see that things are rising very, very sharply, you'll get a large derivative component. Whereas if things are dropping down as shown here, their derivative component will slow things down a little bit because it goes, "Okay, hey. Yes, we've got a big error but it's sorting itself out. You don't need to completely panic over this."

PID controllers are tuned based on each of these three inputs because it really varies based on what you're trying to do, like what the reality of the system is, what kind of instruments you're using and even just what's going on right now. There was a period in time where helmsmen had physical controls for actually tuning those on the fly if they felt that a particular component wasn't quite behaving in the way it should be. So the derivative component is often minimized just because it's potentially the most volatile. So as anyone who's dealt a lot with like KURT function knows, the derivative of something can be very variable depending on a tiny bit of noise. So if you have a wonky sensor, if you have a tiny spike in information, for a brief time, that gives a very drastic result. It doesn't necessarily reflect things. And there's only so much you can solve that with smoothing. So usually, controllers are mostly driven by the proportional integral components.

Variations of this are quite common. So it's called a PI controller when someone just has the proportional integral components. PI controllers are extremely common. So we're autoscaling with that. We do a lot of physical systems, mostly ignore the derivative component. It's nice for some kind of flavor of how much we're responding, but we don't trust it too much. We mostly want to focus on proportion and making sure we modify that proportional response according to how things are going.

In Kubernetes, I was actually surprised to learn recently that it is entirely proportional controllers. Frankly, I think that speaks to a lot of how well it's built given that I had never noticed that. So it does have a lot of particular checks and balances and bends and so on to do kind of fancy stuff, but it doesn't properly respond in an integral manner. This is a deliberate design decision. Kubernetes, I believe it's called the error steady state that it mostly chases. So it tries to be as kind of a good ramp-up, like meeting expectations, but very deliberately tries to not overshoot. Typically, if you have higher expectations, you really want that fast response, you're going to overshoot. You're going to have too many replicas and you have a little bit of balancing out. Basically, everything that we do with a PID controller, it's just trying to minimize that balance as much as possible.

And as I kind of talked about in the very beginning with not wanting to get crazy with building a controller too big, again in Kubernetes design, we see that stuff like scaling is all just updating deployments and replica sets, versus trying to manually hands-on with containers. So if I have a horizontal pod auto-scaler, it's basically a thing that just makes decisions and then tells my deployment what that decision is. "Hey, we need six replicas now. Hey, we need eight replicas now."

So to sum up as to what's actually applicable, I mean for a start, so much what we're doing in software as always isn't brand new. Engineers have been working with this stuff for hundreds of years and I keep encountering people who think they've invented the concept of a closed loop. But we want to ensure that any controller has only and exactly the feedback it needs to do its job. So if there was a situation where the feedback isn't sufficient or if it's trying to judge based off feedback that doesn't help what it's actually trying to do, then we've mis-designed it. It can be good to think about what a human would do, because those kind of so many more pieces of judgment where it's like, "Oh, well, if this is going on, then we want to, like, check this thing."

Playbooks can help with designer controllers because we kind of codify what all the steps should be. We want to turn that judgment into code. We want to strictly define those expectations so the controller is doing one thing, and it's doing it really well. That way we know what to build, we can test it, and we see exactly what's going on with it. When are we having a kind of shared state, critical that we have a consistent view, so CP data store or we're looking at one physical thing.

Who here has heard of or seen ETCD split brain problems before? Yes, Brian has probably seen some stuff. So split brain problem happens when your data store gets into two different units that both have conflicting views of what's going on. When you have a controller looking at those, your controller might see one particular data set and go, "Okay, well, I need to do this. I need to turn my replica count down." Another one gets a different response from the data store and says, "Okay, I need to turn my replica count up. The load is really high compared to how many pods I have." So, unfortunately, this turns into chaos in the system because nothing's really agreeing with one another. It was a good example that Andrew Spiker had earlier around resources getting deleted because the controller wasn't necessarily aware of what was going on at any given query.

And lastly, custom controllers aren't really common. So this is kind of interesting. It's hypothetical. I'm working on a project right now that's basically building a federation MVP for fun. But this isn't just something that we want to talk about and then there's only like the Titus people and the Borg people and the Kubernetes people building this. As end users, we will have specific expectations for what our application does and the application's behavior is going to be distinct from the one size fits all model. So definitely expect if you're running an application with high expectations as you are going to write some of your own controllers.

So I never introduced myself. I'm Vallery [Lacney]. I work on a lot of the software side of systems. So officially, I lead a systems team right now. I Deal a lot with like Kubernetes, cloud stuff, distributed systems, all of that kind of fun tech that's nice and shiny, although it's much more painful when you actually work with it for too long. I'm also transgender. I'd like to be open about that fact because it's something that closes a lot of doors for a lot of people and I like to do the small part I can to make that more normal and to open that for other people like myself.

I'd really like to thank everyone who actually made the event happen: Brian, all the event organizers, Joe, for actually inviting me, and Tim, for helping out with some of the logistics. He answered some of my questions about the deep internals of Kubernetes that as an end user who doesn't actually build terminates itself, I don't get to see. Thanks, everyone. I'm good to take questions now.

Questions and Answers

Moderator: So real quick, I did say womp earlier and only because I actually didn't know what control theory was. I'm so happy that I got to see a slide that had a derivative and an integral on it today. So you came for containers, you got calculus.

Participant 1: I was thinking, mostly a PID controller gives a value between zero and one, I think, to have like a gradual output. How do you actually translate that to the amount of replicas you want for a certain item?

Lacney: Basically, you take that scale of how far off it is and then you just have to apply some math of what's the average per instance and translate that into how many instances.

Participant 1: So basically, you look at the maximum load that you are able to hit and …

Lacney: Basically, you define the load you have by the instances. So assume it's kind of parceled up and then you make … it is just an educated guess. But mathematically, you know, "Okay, I have five instances at 100% load. I think the load is up like 130%. How many more does it take given that each instance is taking X% of load?" I can't give specific numbers.

Participant 1: I understand. But you could even make that part learning. At first, it's an educated guess and then you could use statistics may be too even tune that one.

Lacney: Exactly.

Participant 2: As an ex-physicist, I'm delighted to see calculus in your talk as well. Have you or have you seen applications or have you used them, set yourself for node elasticity as well as replica elasticity?

Lacney: For node scaling?

Participant 2: Yes.

Lacney: I haven't done it for that use case, though. You could apply the same principles, but I think you'd want a distinct controller with tuned logic for that exact case. I've always found built-in one's good enough, but I don't run it especially high demand.

Participant 3: With the integral term, what were the initial and final conditions on that definite integral, and what is the significance there or how are they chosen, exactly? Is there some sort of like slicing scheme that you have across your time series?

Lacney: I mean, the bad answer I can give is that it varies by implementation. In practice, it's typically the current time and when the current error started. So basically, looking as far back as when the error was zero to now. So this time series is a little bit misleading because technically, after that blue line, it should be cut off because we don't know that's going to happen yet. Does that make sense?

Participant 3: It does, but I guess what I'm wondering is if you're in a case where that error curve is constantly positive, then it's not going to converge to zero. So I guess what I'm wondering is what this term is meant to represent?

Lacney: You mean like can it become infinite?

Participant 3: Yes.

Lacney: If the system was running for an infinite length of time, it's cut off by the current time. Once the error starts, the integral starts and it's up until the present. So the benefit of it raising over that is the longer it goes on, the more pressures to get it down. So at some point, it's going to hit zero again and the term resets.

Participant 3: So basically when you sample the error again, you would start the clock at that point. That would be your new zero, is what's going on?

Lacney: Yes. The clock for the term starts whenever it was at close enough to zero.

Participant 4: Long ago, I was working on a robotics system where we had some slop in a gearbox and it essentially introduced a time delay. And I remember doing some research and finding that people hadn't modeled time delay into control systems, but the math was over my head at that point. Have you tried actually modeling time delays into these PID controllers and seeing into your controllers and seeing if it is worth the effort?

Lacney: I've never done it ''properly'' because I do find that a bit of a kludge, makes sense? The more strictly you know how the system is going to behave, the easier it is to kind of put it in a generic response. So like factory design is able to get away with a lot of stuff because they know everything about the factory. Whereas, if we try to have a cloud platform where everything works out of the box, the generics don't scale well to every one of the use cases.


See more presentations with transcripts


Recorded at:

Feb 09, 2019