Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations The Highs and Lows of Stateful Containers

The Highs and Lows of Stateful Containers



Alex Robinson walks through his experiences trying to reliably run a distributed database on Kubernetes, optimize its performance, and help others do the same in their heterogeneous environments. He looks at what kinds of stateful applications can most easily be run in containers, and a number of pitfalls he encountered along the way.


Alex Robinson is a member of the technical staff at Cockroach Labs, where he works on CockroachDB's core transactional storage layer and leads all integrations with orchestration systems. Previously, he was a senior software engineer at Google, where he spent his last two years as a core early developer of Kubernetes and GKE.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Robinson: Thank you all for making it out this morning to this, the container track. It's a pleasure to be here. And I am going to be talking to you guys today not about CockroachDB really, but about what it's like running stateful applications in containers, how you should approach it, what you should think about it, and so on. And to set the tone for the conversation, I'd like to start by just stressing the importance of stateful applications by calling out some outages that probably affected some of the people in this room. And so who here remembers the GitHub outage from a couple of weeks ago? I know it affected my daily work. And this was caused by a network partition that messed up GitHub's database replication. The site was offline for hours. It took them the rest of the day while running in a fully degraded mode for the site to come back to fully operational. And it was all caused by the storage system having problems and it affected the entire business application.

Same thing happens with Azure in the south central region a couple of months ago. There was some bad weather that took the storage system offline and that caused a whole bunch of higher level services to be offline for the entire duration of the incident. Or if you go back to 2017, you might remember the GitLab outage where an operator accidentally deleted the data from their primary database replica and they were offline for almost an entire day and even livestreamed their attempts to recover the data and recover from this failure. And all of these cases, a stateful system went offline for some reason, a different reason in every case, and it took down the entire application, everything that depended on that stateful application. And so state is really important. If you mess up when you're managing your database or your messaging queue or whatever other stateful app you're running, it's going to have a huge business impact on whatever services you're running on top of it.

And this is true for almost all real applications. They almost all rely on state and it's almost always a very, very hard dependency where if that thing you're depending on goes down, so does your business application. And so you have to be really careful when you're managing these things. You often don't take a lot of risks with say, how you manage your database, or which database you choose. And containers, which we're here to talk about today, are new and different. They represent change, and change is risky. You're planning on running, say your database in containers, or if you are doing that, now you're planning on running your database on top of thousands or tens of thousands of lines of new code that hasn't been stressed for more than a few years. I mean, Docker is relatively recent as far as software goes.

And so it's very rational to be really careful when approaching this problem. It's not something you should do without good reason. It's not something you should do without thinking really carefully about it. And I'd like to go into some of the things that you should be thinking about today as you do approach this problem. And so if you take anything away from this talk, let it be this, which is that if you want to succeed and moving state into containers, you need to do three things. You have to understand whatever stateful application you're running, you have to understand your orchestration system wherever you're running that stateful workload, and you have to plan for the worst. So if you don't understand what you're running, the environment you're running it in, and what bad things can happen, you're almost certainly going to run into some of those bad things. And you're going to experience downtime, you might lose data and everyone is going to be really unhappy about it, especially your customers. And so we're going to go over a few main topics today, following that pattern. After motivating why you would even want to do this, we will talk about the things that you should be thinking about, whatever stateful application you choose. What do you need to know about it? What do you need to know about the orchestration system that you're running everything in? And what should you plan for going wrong? What kinds of things are likely to bite you if you don't plan appropriately?

And so before getting into anything, I'm curious, who here has tried running a stateful workload in a container before? Could you raise your hand? Pretty good. And who here is running a stateful code in production in containers? Way fewer hands but more than I was expecting. That's good. And so before we go into anything, just so you know where I'm coming from, my experience with stateful containers goes back to 2014 when I worked on the Kubernetes project and the GKE hosted service at Google. I was on the original team that launched GKE and worked there for a couple of years before leaving to go work on CockroachDB.

At CockroachDB, I work on the database itself, but also in my copious free time, I work on all of our container-related efforts. So I've worked on configurations to run CockroachDB and Kubernetes and DC/OS and Docker Swarm, and Cloud Foundry across all different cloud providers, on-prem environments, for whether you want to run it in a single availability zone or even across multiple regions in some instances. And in addition to just creating configurations that we think people should use, I've also had to help a lot of people out with configurations that they tried to make on their own. So I've seen people do things very wrong in a number of different ways, and tried to help them out of the holes that they dug for themselves.

Why Would You Even Want to Run Stateful Applications in Containers?

And so to start things out, why should you even bother with trying to run state in containers? I mean, we've been running stateful services for decades and we've been doing a pretty good job of it to be honest. Obviously, we have failures, but it's been working for a very long time. And how you would traditionally run a stateful service is you'd have some data center or a colo somewhere, you'd provision some really big machines with a bunch of desks, a lot of CPU and memory. You put those up somewhere in the data center, you'd copy whatever binaries you want to run and whatever configuration files you need onto those machines. You'd start them up, connect them together if needed if you're running with more than just a single replica and then you would never touch anything again. You just sit things there until you need to fix a security patch in the binary, and you might roll out a new version of the binary every once in a while, but you're not going to be migrating across machines. You're not going to be doing daily deployments. We're not running a web server here where you're going to be pushing new code every hour or every day or every week. You typically let these things run for as long as you can and only touch them when you need to.

And this has some serious pros. I mean, this is very reliable. It's predictable. It's easy to understand. Things aren't going to be changing out from underneath you because you're changing everything yourself manually. Nothing is going to happen without you knowing about it, other than the occasional hardware failure of course. Now, the issues with this is that because you're doing everything manually, scaling can be hard. It's a lot of extra work. Recovering from hardware failures requires extra work. Someone's going to be paged in the middle of the night to resolve a situation if a machine goes down. And those manual intervention steps might not be very well practiced, right? If you only do it once in a while, it's not something that you really have down. That's part of the reason why people like pushing more frequently. With modern applications, if you push more frequently, you get better at it, you're able to roll back more quickly, you know exactly what to do when and if things go wrong. Whereas if you're only pushing once every year or twice every year, this is something you don't practice very well and that you might make mistakes while doing.

And if you want to think about moving to containers, you might first ask, “Can you just do the same thing? Can I just deploy a few machines and in my data center, and just do Docker run instead of starting binaries?” And absolutely, that's a totally legit way to start moving things into containers or it's even a final end state. But it's not really what people are talking about usually when they talk about running stateful containers. This is basically running things the same way you've always done, but instead of using binaries, you're just pushing Docker images and running Docker containers instead of running those binaries directly.

And this is not what you'll get by default if you're using any major orchestration system; they all introduce new levels of dynamic movement into the system. They're rescheduling things for you, they're putting other workloads on the same machines. But this is totally fine. This has some benefits to it as well. If you want to do this it, you know, maybe you can start deploying everything as Docker images within your company. You don't have to deal with binaries anywhere anymore. Maybe you want to run other things on these beefy machines alongside your stateful applications. Using Docker containers makes it easy to get good resource isolation between multiple things running on the same machine. So this has some nice perks.

But we're not going to be doing a whole lot of that today. We're talking about that a whole lot today. We're going to talk more about moving into fully orchestrated containers. You know, going whole hog committing to using something like Kubernetes or DC/OS to run your stateful workloads. And you do this for some of the same reasons why you'd move stateless applications into containers. It automates a lot of the operational steps of running a workload, so it automates deployments, scheduling, scaling, fault tolerance and recovery from failures, rolling upgrades. And it takes these all from things that you used to do manually or not do it all, to things that you can do with just a push of a button. And that means less manual toil for your operators, less room for manual error. Also, as I said, this can give you a better resource isolation if you don't want to devote entire machines to your database, which is probably the case if you're just starting out with a small new application.

And another reason you might do this is just because you're tired of having separate workflows for your different applications. If you're pushing all your web apps into a container orchestration system, you might want to do the same thing with your stateful application just for convenience to avoid maintaining two deployment pipelines, maintaining separate ways of collecting logs and metrics and all of that stuff.

Challenges of Managing State

So assuming that is sufficiently motivated, wanting to move into containers, what are some of the challenges you're likely to face with your given stateful application of choice? What do you need to know about it before you decide whether you should run it in containers and how to run it in containers? So there are a handful of things that pretty much all stateful systems need in order to run reliably. They almost all need some sort of process management, which just means that say if the process fails, something will be there to restart it. If the process gets stuck and can no longer respond to any incoming requests, maybe you want the process manager to kill it and restart it. This is something that all applications need, not just stateful ones. And then most importantly of course, is persistent storage.

Now you need storage that is really going to stick around through hardware, restarts, through power surges, through all sorts of things. You want your state to stick around. And you really want storage that is about as reliable as you could get from an underlying physical desk. If storage gets wiped, say when you delete and recreate a VM in the cloud, that's probably not very good because VMs can be deleted and recreated all the time. And if you're going to lose your data from the storage, even say 1% of the time when you do like a restart of your VMs or something, then it's not worth trusting.

If your distributed system or your stateful system is also distributed, then you need a few more things. All the replicas, they're going to have to be able to talk to each other. So the primary is going to have to be able to talk to the secondary, or all the processes in the cluster have to be able to talk to each other. They're probably going to want some sort of consistent address that they can use to talk to each other as time goes on. And they're going to need to discover what other peers exist. So they're going to have to be able to find each other somehow in the network. And some distributed systems can get by with less. So for example, CockroachDB, it doesn't really need consistent naming or consistent name or address, and it doesn't really need pure discovery. But there are other systems that require much more. So if you have something that relies on like an external master election that chooses one of the replicas to be the master and the others to be the secondary replicas, then you might need something external to this in addition to this list here.

And if the particular application you choose requires more than this list of things, you should be very thoughtful when you're trying to move it into an orchestration environment, because these are things that the implementers of the orchestration systems have considered and thought about and tried to provide features for. If you go beyond this list, you might start running into things that they haven't thought about, having created features for that. You might start being responsible for yourself.

You also should probably think about how your stateful application behaves when certain things go wrong. So what happens if a node is offline for 10 minutes, or what happens if it can't connect to any of the other replicas for a few minutes? These are things that might be more likely to happen in an orchestrated environment because things are changing more often. And so if they might not have been stress-tested much in your previous environment, you might want to think about them and test them before committing to a new one.

And so process management and network connectivity are like the bread and butter of modern orchestration systems. They pretty much all provide these out of the box. They'll do it really well. You won't have to worry about them at all. But these other three things, you do have to think more carefully about. You might not get them if you use the wrong features. If you pick the wrong controller type and Kubernetes for example, you're probably not going to get these things and you have to be careful that you're getting exactly the semantics out of them that your application relies on.

What You Need to Know about Your Orchestration System

And so just as a simple example, let's consider moving on to point two, what you need to know about your orchestration system. Let's consider that simple example of just deploying things the traditional way, but using Docker containers instead of binaries. What's the thing that you need to know about in that bare minimal change? And you mostly just have to worry about storage. When you're talking about stateful systems, you always need to worry about the storage. And the main thing here is that you should never store data in the containers file system. I mean, this is kind of Docker Containers 101. I'm sorry if it's remedial for a lot of you, but never ever store data inside the containers file system. And so when you create a container and run it, all the processes inside of it have access to a file system that they can use, they can read to and write from, and it's the file system that was created when you made that Docker image. However, all of the data that you write to that file system is going to be lost when that container is stopped and thrown away. It's not going to be there if you, say, take down your container and start a new one in its place.

And these containers also use copy on write file systems, which are fairly slow. Sometimes in the past, they've had bugs. Just never write any data to your containers file system. You always want to do what the second and third images show, which is either mount in a directory from the host, or mount in some sort of network attached storage and make sure your database or your message queue or whatever is writing to that mounted in directory. So to summarize if you're using Docker, don't do Docker run and your database name and then start. That's going to cause the data to be written into the file system. It's going to be lost when you stop the container. If you care at all about the data, make sure you're mounting in a directory. So here you're using the -v flag parsing in the path, /mnt/data1:/data one from the host and putting it at -mnt/data in the container. So you can do that and then tell your application that you want it to store its data in that directory and you'll be good. You could stop this container and start a new one the same way, and it'll come back up with the data from the first run.

The only other thing you might need to think about with Docker containers is that you want to make sure that the ports that they use are accessible to containers running on other machines. You could do this through some sort of overlay network, like one of Docker's built in networks, but that's going to encourage some performance overhead, some complexity overhead. If you're trying to keep things as simple as possible, just either use the host networking stack for your process, or mount in the ports that the container needs. So this way, other machines will be able to talk to this machine on port 26257and have it connect directly into Cockroach. So that's the port that's used to send SQL queries to Cockroach.

So again, this way of running applications in Docker is hardly any different from running things the traditional way. It lets you automate things like packaging your binary and doing distribution of that binary and isolating resources on a machine, but everything else is still manual.

Managing State on Kubernetes

So let's talk about something a little more interesting, which is managing containers in an orchestration system. What kinds of things do you need to think about? What's likely to go wrong? What are some of the sharp edges? And I'm going to spend most of this time looking at Kubernetes, just because it's what I'm most familiar with. It's what seems to be most popular based on usage of Cockroach. And so I'll try to throw in references to DC/OS and Swarm and the others where I can, but we will mostly be looking at Kubernetes. And this talk would have been really different just a couple of years ago. Kubernetes has changed a ton over the last couple of years. If you tried to run state even just two years ago, you'd be taking a lot of risk on. Whereas today, the underlying primitives have really stabilized and are something that you can build a reliable system on top of.

And this isn't going to be just a pure one-on-one talk. Hopefully, that doesn't put too many people off. I'm not going to go into all the details of what a StatefulSet is, how it works. Just know if you're trying to run a stateful application in Kubernetes, you probably want to use what's known as the StatefulSet, which is a special controller that's designed to manage a set of replicas for a stateful workload. And it gives each replica one or more persistent volumes that you can store data in that will be kept around if the replica has to be deleted and recreated for some reason. And it gives each replica its own network address and lets them easily discover their peers. So you might notice this is called a StatefulSet. It gives you the three things that I said earlier stateful applications typically need, that's not an accident. It was designed specifically for this purpose. So you should probably use it if you're trying to run state in Kubernetes.

Other options that are out there, so you can just like manually pin individual replicas to nodes if you want, that's not going to get you most of the benefit of Kubernetes, but it's not unreasonable. You just have to go in and read the docs and figure out how exactly you pin things into individual nodes. Or you might use a Daemon Set, which allows you to just run one replica of something on every Kubernetes node in the cluster. And this is most useful when you're in on-prem environments and you don't have network attached storage to use. But we'll get into that a little later. It's good if you want to use the Kubernetes nodes discs.

But moving on, let's look at where things can go wrong, like say you've created a StatefulSet configuration, so this is like a minimal StatefulSet configuration for CockroachDB. Don't worry, I'm not going to quiz you on how this all works, but it's creating a StatefulSet. It's naming it CockroachDB, it's setting up three replicas of that container and then it's using a container image, CockroachDB/Cockroach. It's exposing two ports for anything in the cluster to talk to. And then it is running the command Cockroach, start, join CockroachDB. So this will set up a functional three node CockroachDB cluster in Kubernetes, but there's something very wrong with it. And this is something that I've had multiple users come to our GitHub repo or to our support channels asking why their config isn't working after they put together something like this. They'll go do this, they'll deploy it into their cluster. It'll work great for a few days and then they'll come back like a week later saying, "Hey, all my data's gone. What happened?"

And what's wrong here isn't something that we did wrong in the config. It's missing something. It's actually missing the same thing that was missing in the previous example. We're not telling it where to store the data. Stateful sets are called stateful, but by default, they don't actually give you any persistent storage. You have to ask for it, which is kind of ironic. It's understandable if they don't just create a disk for you because they might not know what size you want, where you want it to be created. But it is somewhat unfortunate that by default, you don't get any persistent storage. And so again, always think about where data is going to be stored. Double check it. Once your application is running, check that the state is being stored somewhere reasonable, test that you can go and recreate things without the state being lost. This is absolutely critical; do this before trusting anything with your data, whether it's in containers or not. And so again, don't store data in the container. That's what'll happen if you don't explicitly ask for storage.

So one way you might ask for storage is you use a Dynamic Volume Provisioner. So you add this volumeClaimTemplate stanza to your StatefulSet. And it'll say, "I want a storage device that I can read from and write to. I want it to be at least a hundred gigabytes and then mounted at this spot in the container." So that's all that YAML stanza is doing. And so if you do this, you will have persistent data. You can take down all your nodes, recreate them again, you can stop all the containers, restart them again when they come back up, you'll have your data in place. So that's great.

There is still something that's not great about this though, which is that it's probably going to be slow, and it's another thing that's not at all obvious unless you are either really familiar with Kubernetes or have really dug through the docs. But we created a configuration file like this a couple years ago, and a couple early testers came to us and were asking why the performance they were getting was so much worse than when they were running the same processes directly on the machines. So same machines, same binary, just running in Kubernetes verses running outside of Kubernetes. It seems like a 30% performance difference, sometimes less, sometimes more depending on the workload. But it was pretty serious. If we could get a 30% performance improvement in a database by just changing a couple of configuration parameters, you should absolutely do it.

And what was going on here is that by default, when you ask for these volumes in most environments, you'll get a magnetic spinning hard drive. You won't be given SSD by default which is a pretty bad default. But on GCE, you'll get their PD standard disc type, which is not SSD. On Azure, you'll get their non-managed blob storage, which is not SSD, instead of one of their better managed storage offerings. On AWS, they're nice, they will give you an SSD, but it’s not one that’s optimized for high io workloads. I think that’s io1, not io2, so I’m sorry about that. But they won’t give you one of their better SSDs, just one of their low end SSDs.

This affects more than just Kubernetes too. If you’re using Docker Swarm and you ask for a volume and AWS, they’ll give you a magnetic spinning hard drive. I don’t know why it differs between Docker and Kubernetes. Yeah, so you have to be really careful. Go into your cloud council, use the tools or whatever, check behind the scenes what your cluster manage is doing for you, what disks it’s created, whether they’re really what you want or not. So just changing from the default of using the spinning disk, to using SSD, can give you a huge boost in performance if your workload is at all io limited.

And so you fix this by adding more YAML. That’s always the solution to problems in Kubernetes. Just more YAML. So on the left here we have what’s called a StorageClass. This is a type of storage you can then ask for later when you want to have storage created. And so, the provisioner line here says we’re using the GCE persistent disk provisioner, and the GCE persistent disk provisioner is built into Kubernetes and it understands the type PD-SSD, that tells it to create SSD disks and Google cloud, which is knows how to do. So if we create this StorageClass, name it fast, we can then in our inaudible 21:32, go back and add that StorageClass name to what we’re asking for. If we do this, we’ll then be getting SSDs. You can also, which I would recommend, mark this as the default storage class. So if you go in, you can run a Kube control command to set it to be the default storage class. And then all of a sudden, you’ll always be getting SSDs even if you don’t specifically ask for them. So, it’s a good tip.

Of course, this didn’t fix all the performance problems; there are still other performance problems when you are running in Kubernetes. We don’t have time to go into a lot of them today. If you do want to learn more, we have some really good documentation on our website, that is not at all CockroachDB specific. It’s just if you’re running a stateful workload in Kubernetes, these are things you should do to try to get better performance. Some of them are simple and easy, and you should definitely do them. Some of them are a bit more out there and you might not want to do them, if you are getting, you know, 5% improvement. Some of them are related to networking, some are related to resource isolation, but if you are curious about getting more performance in Kubernetes, I recommend checking out this guy.

We’ve covered a couple of defaults that aren’t great. So by default, you don’t get any persistent storage, by default you get slow storage. What else can go wrong? Here's something that we noticed early on when we wrote a test that would create a cluster, delete all the nodes and expect them to come back up. So you can reproduce this just by creating a three node Kubernetes cluster, a three replica StatefulSet, and then try failing one of the machines. Just fail one of the machines once everything is running. And what should happen? If you were doing this on VMs, CockroachDB would continue plugging along just fine. If you lost one of the VMs that was going to Cockroach, the other two still form a quorum, they form a majority. Cockroach is a consistent, or a CP system, in the CAP theorem. So as long as it sells the majority of the nodes, it'll continue to function.

But in Kubernetes, we'd see that sometimes, once in a while, this wouldn't work. So while the node is down, the Cockroach cluster was no longer functioning. And what was going on was this, is that the scheduler, the Kubernetes scheduler was putting two Cockroach replicas on the same machine and leaving one of them empty. And this is not at all what you would really want when you're running a stateful service, especially something that needs a majority of replicas still online to function. And so the Kubernetes scheduler does not give any preference to trying to spread things out by default, even if it knows that they're from the same StatefulSet. So as soon as node one got taken down, if we happen to choose node one to take down, in this example, the Cockroach pod on node two would still no longer be able to make progress. And what was happening here is that there were other things running on node three, so some of the cube system pods might be running there. Maybe you already had some other applications running there, and the scheduler thought that node one had more space. It was better to put two replicas on node one than to try and put one on all of them.

So this is bad. I's really bad for fault tolerance because, as two go down, suddenly, your system is unavailable until they get brought back up. It's really bad if you are using local disk for your storage. So if you're using network attached storage, which is the default for StatefulSets, then it's okay because the discs will still be fine. They can be reconnected to another machine and you won't lose any data. If Kubernetes has a feature that's currently in beta to use local disk with your StatefulSets, and then it will try to always schedule the same replica under the same machine, if you're using that and then it puts two replicas on the same machine and that machine dies or that disc fails, you could have just lost data, because maybe you just lost two out of the three copies of a piece of data and you'll have to do some sort of manual consistency recovery to recover from that third replica, which might have been behind the other two.

So this is really bad, especially if you're trying to use local disc. Of course you can always fix it with more YAML. You add what's known as a podAntiAffinity configuration. So this is telling the scheduler if two pods have this label, app=cockroachdb, so that's the second big yellow box there, that's the label that we've put on all of our Cockroach pods. If pods have that label in common, don't put them on the same host name. That's all this is saying. It's like 10 lines of YAML for something that seems a lot simpler. If you do this though, the scheduler will do its best. It's still not guaranteed though. I haven't checked if there have been additional options in recent versions of Kubernetes that make it required or that we'll try to fix it if they can repair it later on. So this still doesn't guarantee it, but it gives you a way better fighting chance of having your pods spread out.

So we've covered a few bad defaults. There are still other things that can go wrong. What's something that's not a default that you might want to think about when you're creating your StatefulSet? This one was particularly painful. In early tests, if you brought down all of the Cockroach pods at once in a Kubernetes cluster, only the first one would come back up. So you could have a three node cluster, delete all the pods. Yeah, maybe a power surge hits your data center and all their machines got rebooted simultaneously. Only the first one would come back up. This was terrible because say you deployed this, everything would be running perfectly for a long time. Once, I mean weeks, maybe months, you can do rolling upgrades and everything would keep working fine because rolling upgrades only take one pod down at a time. But as soon as all of them came down at the same time for any reason, they wouldn't recover.

Luckily, we found this in testing, but this is the sort of bug that would just destroy you in production. If it happened in the middle of the night, you have to get paged. You have to try and figure out why Kubernetes isn't scheduling your darn pod. And when we looked into what was going on, we could see that the readiness probe was failing on the pod. And you might see in this picture that it's marked as zero out of one containers are ready. And it turns out that StatefulSets only create one pod at a time and they wait for that pod to pass all readiness probes before they'll go on and create the rest of the pods. So if you ask for 50 replicas of a StatefulSet and the first one doesn't ever consider itself ready, then the other 49 will never be created.

And of course, the Cockroach health check that we were using as the readiness probe. only returned healthy if the node was able to serve SQL queries. And in a CP system, a node is only able to serve SQL queries if it's connected to a majority of the other nouns. So if a node thinks it's often a partition by itself, this poor little first pod that gets created thinks it's in a partition because it knows there are other nodes out there but it can't talk to any of them. It's going to say, "No, I'm not able to serve queries, don't consider me ready." And this is partially due to an overloading of what a readiness probe is used for.

The original idea of a readiness probe is meant to ask, should this pod be used as a back end for a service, for a load balance service in Kubernetes? And that way, the service will only direct queries from clients to those ponds that are ready to process those queries. However, StatefulSets kind of overload what it's used for and to also saying, is this pod functional and should I move on and create the next one? And this was not working very well for us.

So just to visualize what was going on, Kubernetes was health checking every pod in the cluster. And when they're all running, it's great. They all returned that they're healthy and they can all serve queries and so on. If just one node is to fail, it's fine. The other two nodes, they're still considered healthy because they can still serve queries and Kubernetes goes ahead and creates that missing pod. However, if only one of them is left, then the StatefulSet is going to wait for it to become ready and the pod is going to wait for other pods to be created before it says it's ready. And so you've gotten a classic deadlock scenario that never resolves itself.

So we were able to solve this. We kept the liveness probe that we had in place. So the liveness probe was just a basic HTTP check that would return 200 any time it got a request, so just to check that it could receive and respond to any network requests. But we had to actually create a new health check endpoint in Cockroach that better matched the semantics Kubernetes wanted. So this is a severe case of needing to understand both the database and the orchestration system, and how they work together. And even the weirdest of cases.

We had to create this new probe in point that would return success essentially if the node had finished booting up. So if it was ready to even accept incoming connections. This wouldn't necessarily imply that it was ready to respond and serve queries, because maybe it can't talk to the other node still, but at least it's up and running and is as healthy as it's going to be until it can talk to other nodes. So we have to go ahead and actually modify the database to really suit Kubernetes' needs. Of course, later on, StatefulSets had an option added to them that lets you create all the pods in parallel, rather than one at a time. We've also enabled this and it would have prevented the problem from happening in the first place. So some applications do really want all their stuff to come up one at a time in a nice sequential order so they can initialize however they want to do it. That's not an issue with Cockroach. So just starting in parallel because things going faster and avoids weird scenarios like this.

So we have somewhat limited time today so I can't keep going on and on and on and on. But if you want a reference of other things you might want to look at, some things you might want to consider are make sure you've set it resource requests and limits, so that your pods are guaranteed certain amounts of CPU and memory even as you add more applications to your cluster. This is another problem where when you deploy it, things might be working great. Then when you add like 50 other containers to your cluster, they start staffing and using the resources that previously had been needed by the database.

Another bad default is that you don't get what's called a pod disruption budget by default. This is the construct that Kubernetes uses when doing routine maintenance in the cluster. You can tell it the maximum number of pods that it should ever take down at one time from your StatefulSet. So for Cockroach, for example, we say we don't want you to ever take down more than one pod at a time. And this will prevent it from voluntarily taking down a node that has a Cockroach pod on it, if that would take down too many at once. So this is relevant during things like node upgrades. So if you're in, say, Google container engine or Kubernetes engine now, it'll do node upgrades automatically for you and it will take down some nodes and it relies on these pod disruption budgets to be able to tell how many nodes it can safely take down at once while doing these upgrades.

And related to that in the cloud, you can't depend on your nodes to live forever because when the services like GKE do these node upgrades, they want to change the Kubernetes software running on each of the VM nodes. They just delete the VMs and create new ones because it's easier than upgrading things in place. And so if you were storing anything on those local disks, that data is now gone. If you were relying on that IP address sticking round, like maybe you would configure it into your DNS somewhere, the IP addresses of each node, that's now gone. So you've got to be aware of that if you're running some sort of cloud hosted environment, that your machines might be just like taken out from underneath you once in a while and all that data is gone. This is especially important if you're trying to use that local disk feature that's currently in beta.

If you're running on-prem, you kind of have an opposite problem. Well, not an opposite, but you're going to have a tough getting fast, reliable network attached storage. You might have to go pay a vendor a bunch of money. And a lot of people that run on-prem with Cockroach tend to just use local disks because they know the machines are going to stick around, but they don't have reliable fast network attached storage. And there are things out there like Ceph and Gluster that you can try to use to provide network attached volumes on top of physical machines, but running them reliably is really hard. And if the latency spike up above a second or two to sync a block to disk, it's going to have real bad performance effects on your database.

So more issues. Make sure if you're issuing TLS certificates for your StatefulSet replicas that you include all the necessary addresses. You know, if you want to be able to talk to it from a different name space, you have to include a namespace scope addressed in the certificate. Don't put pod IP addresses inserts because it'll work the first time you run everything, but then if you recreate the pod and it uses the same certificate, that pod IP address is no longer going to be the right address. It's really hard to remake multi-region stateful systems work. So Kubernetes' entire networking model is designed around communicating within a single Kubernetes cluster. As soon as you want to have a pod in one cluster talk to a pod that's in another cluster in some other region, it's really tough to make the point to point connectivity work, and also to come up with some sort of persistently addressable name for each replica. So this is much more difficult, in order of magnitude more difficult than running in a single cluster. But if you are interested in doing it, a bunch of good work is being done in the community to make it better. In particular, there were talks yesterday from Cilium and Istio. Both of those teams are doing great work, Cilium to make a point to point communication possible, and Istio to make like names addressable across clusters. So you might want to check out those projects if you're interested in doing this.

How to Get Started

So that was a lot. There's a lot that can go wrong. I could have kept adding more and more slides if we'd had more time. But getting things right is far from easy; there's a lot that can go wrong, a lot you need to know about your database, about your orchestration system, and what if you aren't already an expert on either of them? You know, it takes a long time to get to know this stuff. You could take the time to build expertise. It might take a long time if you want to really build expertise and hopefully today, coming to the container track today is a good start. And there is a lot of benefit to understanding this stuff regardless. It's great to know your database's failure modes, how it responds to different issues.

Similarly, it's great to know what your orchestration system guarantees for the applications that you run in it, but it's legitimately a ton of work to do this on your own. So when I was creating the configs for Mesosphere's DC/OS, it took me weeks to figure out everything it would have guaranteed, how to use all the features correctly, to do testing to make sure that I wasn't doing anything wrong. And I had some help from people at Mesosphere to help me, point me in the right direction along this process. And it still took me forever to get everything going. It's really hard. That was coming in with a good understanding of Cockroach. So don't underestimate the amount of work it takes to get this right.

But ideally, someone has already done all this for you. So the nature of open source is that there are configurations out there on the internet for any popular open source project you might be interested in running. Go to the documentation page for your favorite database, your favorite messaging queue, your favorite whatever. It'll probably have instructions for how to run it in the popular orchestration frameworks. And in most cases, they've been made by someone that understands that project really well. So they'll at least get half of that equation right. They'll at least understand the database or the stateful application portion really well. They've probably gotten some help from someone who understands the orchestration system really well if they don't know it themselves and it's probably been used in production by other people. So obvious flaws have probably already been found. So this is a great way to get going if you don't want to have to do all this work yourself.

Now, I don't know of any good centralized repository for these. The Kubernetes' example repo has a few, but otherwise, your best bet is to go to the pages of these applications that you want to run. However, there are, of course, problems with these. So YAML is not a great configuration language, as everyone knows these days. It forces the configuration editor to make a bunch of choices that would best be left to the user. So I have to pick what storage class name you want to use if I'm writing a configuration file for CockroachDB on Kubernetes. And I don't know what storage class names you've defined in your cluster. I don't know what environment you're running it on. I can't create the storage classes myself. I can't possibly know how big you want your disks to be, or how many CPUs you want to allocate to each replica, or what application-specific configuration options you want. And any user that wants to change these has to open up this big YAML file, find the relevant spots in it, and modify things appropriately. So this is not a good thing for usability, and it's a similar problem if you're trying to use things like Docker Compose and Swarm; you get these big YAML files, users have to go in and look at them and try to find the right things that they want to modify.

But package managers have really swept over the scene over the last couple of years. So these have typically created additional formats that make it easier to parameterize configurations. So you can go into a package manager for Kubernetes and say, "I want to run CockroachDB," and you get this nice list of like, "Here are all the parameters you can change," which not only makes it easy to change them because you don't have to go modify YAML files, but it also lets you know what parameters exist. So maybe you don't know what parameters you would even want to change. Maybe you didn't know about storage classes, but you go look at this list of parameters and one of the parameters is going to say storage class. “Would you like SSDs or spinning disks?” And then you'll realize, yes, that's an option. Thank you for saving me from that.

And so the user doesn't have to muck with any weird YAML files. So this makes things significantly more reusable, better both on the package creator and the package user. And so there are great repositories out there. So if you're using Kubernetes, you'd probably want to use Helm, has what they call charts. So a package is a chart and they have dozens, if not hundreds, of charts for any popular open source application you could imagine. Or if you're using DC/OS, you can go to their Universe package manager for stateful applications and run, you know, all sorts of different stateful things in your cluster. Or if you're using Cloud Foundry, you can go to the Pivotal Services Marketplace and find things to run there that had been vetted and approved to run safely.

So, you know, for whatever orchestrator you're using, go check out the package manager. Even if you don't end up running those things directly, like maybe you're using a really niche project or something you built yourself so you can't find anything there, you can still learn a lot by looking at these configurations, seeing what patterns they use, seeing how they overcome things. And I definitely recommend, even if you're not using them, go check them out. And Docker has even this summer announced this concept of application packages, which has meant to make compose files easier to reuse. I haven't personally tried it out. It's still considered very experimental, but it's worth looking at as it grows over time if you're using Swarm and Compose in production.


So to close things out, hopefully you have at least some idea of how to get started if you want to try to run data and production in containers. And please don't let configuration mistakes take down your service. Don't take your database down because you misconfigured something or you missed putting something in your configuration. Do proper testing, try testing node failures, try testing, network partitions. Ideally, see what happens if a disc fails. And just remember to understand your application, understanding your orchestrator, plan for the worst, or you can cheat and just use a package manager as someone's already done all the hard work.

So thank you for coming. If you want any more information on any of this stuff, we have plenty of docs on Or in the CockroachDB repo, you can see the history of issues that people have had with our Kubernetes' configuration or our DC/OS configuration. You can see the history of how we've changed things over time and you can always contact me with questions. So thank you.

Questions and Answers

Moderator: What we're going to do for the next nine or so minutes is questions and answers. I have the mic, Alex said he would answer questions and I'm going to tee it up by asking the obvious question. With all the caveats that you listed in this talk- I don't know if it was like a gripe session on how bad state is at containers- would you still recommend that people investigate this?

Robinson: Yes. I wouldn't recommend that you go out and start using an orchestration system just to run your stateful workloads. But if you've already heavily bought in, it's a great way to coalesce how you're doing everything to get the benefits of containers. But first, do it with your stateless stuff. Make sure you can run those reliably, get comfortable with all the technology. And then, and only then, would I look at trying to make a move with your stateful workloads. But once you've done that, it is a good way to run them.

Participant 1: Is there any difference between running a container and mounting the volume to a local pathfinder host machine, and running a PersistentVolume using host path on Kubernetes, like mounting also a host path in the machine maybe to an external disk, but attach it to the machine? Is there any difference between the two?

Robinson: If you're using the host path on the machine in Kubernetes and somehow making sure that that replica of your stateful workload only runs on that machine where that host path is present, then not in theory. In practice, you need to be really careful to make sure it's not going to get moved something to another machine. And also if you're running in the cloud, that that machine is going to live there forever. So you have to be very careful in the cloud that your machines don't just all delete it and recreate it and your data is gone.

But if you're running on-prem and you're careful to correctly use things like taints and tolerations or just directly assigning to a node, then it is roughly the same, except for the networking. So one other thing you might want to consider is using the host's network if you're doing that. So there's a flag in Kubernetes you can say hostnetwork=true and that way, rather than using the virtual containers network, you'll just use the host's IP and address directly because there is some performance overhead using those overlay networks. But that would be the only real difference I'd call out.

Participant 2: I have a question about pod disruption budgets. How do they work, what does it prevent in terms of a disaster and how's it actually plumbed?

Robinson: Pod disruption budgets only help with what are considered voluntary pod terminations. So it's not going to help if your hardware fails for some reason, or something shuts everything offline. But there are these certain maintenance things that Kubernetes sometimes does where it will drain a node so that the operator can do upgrades on that node or it can…maybe it's going to have to preempt a process because the machine is running out of resources and pod disruption budgets will prevent Kubernetes or tell Kubernetes that it shouldn't take more than one of your pods offline for those such reasons at any given time. Now, because we've been using them since the early days of disruption budgets, I don't have any specific stories to tell about when the lack of a disruption budget caused a problem. But the Kubernetes docs themselves are pretty firm in the sense that you should use them if it's going to be a problem for your application.

Participant 3: I have actually two questions. In the best possible benchmarks you have done, what’s the performance that you take running your stateful application on multiple layers of virtualizations, like Kubernetes is there, then you have an attached network storage, so your block request become TCP IP request and become block request know, so there's a lot of performance that you take. Given that you have configured your system best, what are the still the minimum performance that you'll take?

Robinson: It kind of depends on how far you go in optimizing your system. So if you run like Kubernetes using say PD-SSDs on Google versus running directly on VMs using the same size PD-SSD for your storage, you'll see a 15% hit on one of our popular workloads, like a simple key value writing and reading workload. If you optimize everything that you can, you can get those to be identical assuming you're using the same disks. But if you're not optimizing everything you can, you will see a hit. And it always depends on the workload too.

Participant 3: I think we guys have a 20% hit minimum. And have one more question. Do you have a recommended pattern you use to rule out schema changes? So what we do is we roll out schema changes in a container script so that we can have software like code rollout as well as database rollout at the same time. The pattern we use is do any schema change in the scripts and ruled it out as code. What do you guys recommend?

Robinson: I assume you mean like Kubernetes configurations schema changes?

Participant 3: No, schema changes of your database.

Robinson: CockroachDB specifically has online schema upgrades. So you can just send the DDL statement to any of the nodes in the database and it will make the changes happen in the background while you can continue doing things to that. So I don't have specific advice. Our docs probably have more specific suggestions.

Participant 4: I was wondering in scenarios on-prem where you have more than a single database, how do you do sizing if you're based on local disks, and do local disks configuration support resizing if they need to take more space?

Robinson: I don't know if the local disk feature supports resizing. I'm sorry. I could check the docs and get back to you on that, but I'm not sure. As far as using multiple databases on the same machine, as long as you properly set resource constraints and ask for the right volume size off the disk, it should do a pretty good job of isolating things. I would be worried a bit about the I/O sharing. So I don't know how good of a job Kubernetes does with sharing I/O resources to disk. So you'd have to test that if you're wanting to seriously try doing that.

Participant 5: Thanks for the detail on the talk. You talked about the convenience of the shared cluster management with ops, and how does that apply to upgrading Cockroach and how you think about Cockroach lifecycle with Kubernetes semantics? Or do you always just do that at the Cockroach layer and not worry about that at the Kubernetes layer, if that makes sense? Does Cockroach independently update itself or do you use any of those container semantics to upgrade?

Robinson: Yes. I might be missing part of what you mean by the question.

Participant 6: If I wanted to upgrade cockroach from 1.1 to 1.2, do I use any Kubernetes commands or does Cockroach do that above it?

Robinson: So we would just recommend you do that via Kubernetes. So do the normal rolling update where you go into the config file, change the image you use from 1.1, 1.2 and let it do its thing in a rolling way and Cockroach will just...

Participant 6: That's been pretty stable so far, and that's worked because you didn't mention that, so thanks.

Robinson: Yes. That's worked in production. I haven't had any issues with it so far.

Robinson: There's always that person, he wants to be the last. I'm going to make my way over there and you will be last.

Participant 7: It seems the use of Kubernetes' operators to manage database and things is a popular topic nowadays. What are your thoughts on operators with regards to database providers providing those in Kubernetes and does that have a future?

Robinson: Sorry for not mentioning that. What I've seen operators mostly used for, or to clarify, operators are this construct where you basically write code that does extra work and tells Kubernetes what to do to properly manage your application. So you go write a go program, you run it in your cluster and it sends commands to the Kubernetes to tell it what to do with your database or with your application. I have mostly seen that being useful when whatever application you're trying to run doesn't work well in the default Kubernetes primitives. If what you want to run can run well as a StatefulSet and doesn't need any extra help, that's probably the best way to do it rather than just adding extra code that might have new bugs. You might introduce new problems. But operators have been a great tool for introducing applications to Kubernetes that couldn't have otherwise run there.

Participant 8: Have you ever seen situations where StatefulSets need to be scaled? And is there any approach for that?

Robinson: That's one of the great things about StatefulSets, is that you can scale them with a single command. And as long as your database supports, or your stateful system supports dynamically handling new members joining, you don't have to do anything. You just run kubectl scale to the new number of replicas. Depending on the application, now you might have to think about any manual commands you might want to run after that to properly integrate new members into the cluster.


See more presentations with transcripts


Recorded at:

Feb 08, 2019