Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations A Kubernetes Operator for etcd

A Kubernetes Operator for etcd



James Laverack overviews etcd and why running it in Kubernetes is difficult. After outlining what an Operator is, he discusses ways of writing Operators, why they wrote their Operator the way they did, and talks though how it works for etcd.


James Laverack is a Solutions Engineer at Jetstack, and spends most of his time working directly with clients to help them get the most out of Kubernetes. Previously, he has worked as a Software Engineer in the Financial Technology space for a number of years. He has a passion for distributed systems, and has used Kubernetes to build complex financial applications.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Laverack: I'm here to talk about Kubernetes and operators in etcd. A lot of what I'm going to be talking about is pretty generic to operators. It's really about what an operator is, and what problems it can solve, and why you would actually want to build one. Also, how you can build one. I'll be using etcd as an example. This isn't really specific to etcd at all.


I work for a company called Jetstack. We are primarily a consultancy. We also provide training. We do open source work. We're most well-known for cert-manager, which is an operator for TLS certificates. I mostly work on the consulting side of things. That's where this story comes from. This is based on a consulting engagement we started at the end of last year. We were working with a company called Improbable. They're based here in London. They're a software company. They make software for massively multiplayer games. They came to us with this problem statement, "We need to run etcd in Kubernetes." They were running part of their platform in Kubernetes. They wanted etcd alongside of it to simplify their management story.


Let's take a bit of a step back and figure out why this is difficult at all. Why am I even talking about this? First of all, was etcd itself. Etcd is famous for being the backing store for Kubernetes. Everything you put in Kubernetes is stored in there. That's where you mostly see people talking about it. This is not that use case. It's something else. The project itself is a completely generic, distributed key value store. It's a CNCF hosted project. It's been going along for a couple of years now, originally made by CoreOS.

Why it's Difficult to Put in Kubernetes

Why is this difficult to put in Kubernetes in the first place? Let's take a look at the traditional case. This is if you're running this on either bare metal, or VMs, or cloud VMs like EC2, or something like that, so no orchestration layer of any kind. In this, we've deployed three machines. We have three etcd instances running. They have a little bit of local disk each. They can contact each other. That's all you need to configure. Once you tell them about each other using their domain names, or hardcoded IP addresses if you have static IPs, they will find each other and they will use the Raft consensus protocol internally to elect a leader. Then from there, they can start serving requests. Client applications can connect and they'll usually connect to either all of them, and they'll pick one and do client-side load balancing. Or, you can tell it, just give a single domain name with a bunch of A records, one per node, which can then do the load balancing that way. That's pretty simple.

What about a Stateful Set?

If you've used Kubernetes at all, and you're looking at this, the thing you're probably thinking is, we need some persistent disk. We need some persistent network identity. There's a stateful set. Kubernetes has a native component to handle this task. What's wrong with that? This is the example diagram from the Kubernetes Icons Set of a stateful set. We have pods which are running containers, which have our application in it. In this case, etcd. We have a service that's effectively providing the DNS resolution to this. It has a service name. It helps us provide static DNS names to all of the individual pods. We have those PVCs, Persistent Volume Claims, which then your cloud provider will go and then bind those to actual disk over on the side there. As a user, you don't really need to worry about it. They end up on a disk somewhere. The stateful set orchestrates that side of things. They'll make sure that they always get reattached to the same disk. They'll make sure they always get the same name. What's wrong with this? Why doesn't it work? If you just want a completely static cluster, it does work. It stands up fine. There's no particular issues immediately.

Let's talk about something slightly more advanced, you might want to do with etcd, where this starts to fall apart. One feature it has is runtime reconfiguration. You can, while it's running, add or remove nodes from your etcd cluster. How you would do this is, you have to tell etcd about the new node first. Etcd has this internal knowledge of all its members because it's not designed to run with an orchestration layer. The first thing you have to do is actually tell it you're going to add a new one. Then you can bring the new one online. The new one will then join to the cluster, if you don't do this, the new one will be rejected. It won't join the cluster. It won't perform quorum. It won't perform consensus. You'll get a lot of errors. It's not a nice place to be. The scale down is exactly the same in the inverse. The first thing you do, going back to our three-node cluster, is you tell it you're going to remove a peer. It will then unregister it. It will stop replicating to it. If it was the leader you removed, it will redo a leader election. Then you take it offline. That's pretty simple. This is encoded in the operations guide documentation for etcd. This is how they recommend that you do this.

Let's go back to our stateful set, and think how we might implement some of these problems. If you scale up, we're going to add a new pod. Then before we turn it on, we need to do something to the etcd cluster. In Kubernetes, we have the concept of Init container. You can add an extra container that will execute strictly before your application starts. It'll perform some logic. We also have a concept of a pre-stop hook. This is where before Kubernetes will stop your container, you can have it execute something. You can have it exec a script in a container or whatever. We can start doing this. Already, you can look and see, what happens if you have comms failures? What happens if you stand up and you can't contact the cluster? How do we determine whether this is the first time we've launched or not? What do we do if we have an error removing a peer? Do we stop blocking? Do we not let it shut down? What do we do? Then the real thing you start to realize is that this pre-stop hook is going to get executed every time we shut down one of these pods, not just when we scale down. If Kubernetes decides to reallocate the pod somewhere, and a node gets drained, and wants to put it on another machine, it will run that hook before it stops the container. We're going to be constantly resizing this cluster, even when we're not meaning to. That goes out the window slightly. That's a complete disaster. Then if you think about it leaving, if you could find another way of doing this, there are a whole slew of other issues that come out of this. Things that could go wrong. Things that you're going to have to handle and deal with. It starts to come with a lot of overhead. Potentially, it's going to be overhead on the operations team that's going to be managing this. They're going to have to know how this thing works. Understand how it could go wrong. Understand what to do to fix it when it inevitably does go wrong at 3 a.m.

We Need an Operator

What can we do better? We decided we need an operator. To briefly outline the definition of an operator in Kubernetes. If you go to the official documentation, and you ask it what an operator is, it tells you this. It can extend the API. Which is nice, but how did that help us? If you read down the page a little bit, this is a quote that is to capture the key aim of a human operator. It is to take the knowledge of how the system ought to behave, and deploy it, and react if there are problems. That's a bit more like it. That's our operations guide. We have pages of documentation from the etcd project telling us how to run this thing. We want to do that. Why don't we encode it?

An Operator Encodes Knowledge

That's what the operator really does here. It encodes operational knowledge of an existing application. This is just one use case for running an operator. People use these things to build completely cloud native applications. This is slightly different. This is taking an application that was never designed to run in an orchestration system, it was never meant to run in Kubernetes, and making it work with the Kubernetes system. The project has become pretty popular. Other concepts have become pretty popular. There are a bunch of them out there. OperatorHub has been mentioned by some. We've lists of hundreds of these things. Cert-manager made by Jetstack is the one for TLS certificates. Strimzi is another good example that's actually written in Java. That one is for Apache Kafka. There are a whole bunch of these things out there that are used to run complex applications in Kubernetes. One interesting tidbit is when the operator concept was first introduced back in November of 2016, in the blog post by CoreOS, they actually use etcd as the canonical example of why you might need an operator. Which of course begs the question of, if there's already an etcd operator, why are we building one? The reality is that we looked at it and it didn't quite meet our production use case with Improbable. We decided the changes we needed to do were too big. We decided to write something slightly different. A slightly different focus for us. It's very interesting that we're going back to this use case again.

How to Actually Build an Operator

How do you actually construct one of these things? We know we want one but how do you actually make one? The core thing that makes this possible is the custom resource definition. This is something part of Kubernetes that lets you tell Kubernetes that there is a new thing it knows about. You can specify the shape of it. You can specify a spec. You can have it do basic validation. This is the thing that really enables this. Once you specify a custom resource definition, load it into your cluster, it works just like a native resource. You can ask, in this case, kubectl to list the API resources available, as well as the things that are built in like deployments, and replicasets, and pods. It's listed my etcd cluster resource. It knows about it. This means all the tooling knows about it, too. Kubectl works perfectly well with it. If you're using GKE or something like that, their web console knows about it. If you're using one of the visual systems for interacting with Kubernetes like VMware's Options, that will also show it to you just alongside everything else. All your existing tooling, GitOps, even things like OPA Gatekeeper, if you're using something like that to enforce policy, will work with these tools. That's pretty good.

Now that we've defined this, how do we actually build the operator to do anything? Just like everything else in Kubernetes, it's just a pod running in Kubernetes. That pod right there is our operator. We put it in its own namespace. We've got a deployment there to make sure it comes up. We give it a service account. I mention the service account, because it's particularly important. It's what gives it the ability to do everything it needs to do. If you look at what's in there, we have permissions to look at etcd cluster resources. We have permissions to make the things we need in response, replicaset services, whatever we need. In particular, there's watch right there. This is a great feature with operators. Instead of having to sit there and poll Kubernetes asking it, what's there? What does it need to do? It can get notified when things change. Those write really efficient cacheable code. You create an etcd cluster and the operator will be woken up by Kubernetes and asked to do things.

What's in that pod? What has the operator actually implemented? It's the miracle of containers. It can be anything. You could write an operator using a bash script in Curl if you wanted to. I wouldn't necessarily recommend it, but you could. This is where I start talking about the exact specifics of our project. For our team, we chose to do this in Go. This is for two reasons. Firstly, Go has a lot of really good ecosystem tools for building these things. Go is what Kubernetes itself is written in. It's natural. There will be a lot of support for it. The other, and probably the biggest reason, is really about our team. The team of people who worked on this, which is a bunch of us at Jetstack, and a bunch of us in Improbable. We all knew Go. We were all familiar with it. We were all happy with it. It worked for us. Had that not been the case, if we were working with a company that likes Java, I would have had no problem writing this thing in Java. Pretty much you can use whatever you want. There are lots of examples out there. There's operators built in Java, in Rust, whatever you need.

The other thing we did was kubebuilder. This is what I was talking about with Go ecosystem. This is a project that can help you scaffold out and create a lot of resources, and merge these things in Go projects. This isn't the only thing like this. There are a bunch of them out there. The Operator SDK by CoreOS and now Red Hat, is another operator framework, which the Operator SDK is one part. It is another very good example. Other languages have other examples as well of things that really help you do all the things you need to do, manage your CRDs, manage your versioning, all these stuff. We chose to use kubebuilder, largely because we really liked the documentation. That kubebuilder book goes into a lot of detail about the why of how you want to do things. It's quite opinionated. That's something that we liked, we agreed with its opinions. We moved in that way. We also found the testing stories really good.

Operator Logic

We decided that we are going to make it. We're going to build this thing. Make no mistake, this is an engineering effort. It took us a few months to build this. It was mostly working full-time. It's not a small project that you do in one week. It's something you actually have to build and maintain as any other software project. Now that we're building, and we decided we're going to do it. We know how we're going to do it. What are we actually going to make it do? The core of any operator is this reconciler loop. This is how Kubernetes itself works internally. Where the first thing the operator in this pod will do is it will ask the API server about these etcd clusters that got woken up by them when it changes. From that it will get the desired state of the world. You have told it that you want an etcd cluster. You want three replicas. You want this version. Then based on that, it will go and make that happen. It will make the underlying resources. It will do the things it needs to do, to make that in reality based on what's already there. This is the core loop. The idea is that if you change your definition, then it will update accordingly. If you fiddle with some of those things on the side, it will correct them for you. It'll put them back in the desired state. It is always moving the world towards the state, as you told it you wanted it. This is how, pretty much every operator works. Ours was actually fractionally different. It all goes back to that etcd internal state thing. We need to tell etcd about things. Etcd has its own view of what the world should look like. What things need to be there, and what peers it needs to know about. We built this into our design. We actually have an almost double loop. The first thing we do is we take our desired state and then we go talk to the etcd cluster that we're managing it. Make sure that it's in line with what we wanted to have. If we need to add a new peer, the first thing we'll do is tell etcd about it. Once we've done that, we can go and we can take what etcd expects of the world and implement that. That means etcd is always the first to know. It means that we are moving with its expectations. We are implementing what the operations guide told us to do.

In order to help with this, we actually split that logic out. We actually built two CRDs. We actually built one for a cluster and one for a peer. This is largely a code clarity thing. It means that we have two different code paths. One for managing clusters, which then create the peer. Then one for managing the peers. It also means that this is exposed in our API. As an administrator, if you want to know what's really going on, you can ask the API, what etcd peers exist? It can tell you. It means that when we expose status of individual peers, you can use kubectl to describe a peer, and you can get its status information right there in the API just for you. This helps us leverage that first thing in the first quote I got from the documentation of extending the API to help an administrator run this thing. That means that the actual reconciliation loop for us looks more like this. We view a desired cluster. We talk to etcd. We create the peers and the service, because the service is per cluster not per peer. Then we respond to the peer existing, and go through this as a second loop. Then we create the replicasets, persistent volume claims, and things like this. It's a slightly different take on it. That core, reconciler observe state, actual state loop is still there.

Design Considerations

Some design considerations. These are things that we thought about at the start, based on our experience with operators, based on my colleagues' experience with cert-manager, things like this. Things we really wanted to make sure we get right at the start. These are things that worked for us. These are not hard and fast rules. This is just our experience.

Be Level-Triggered

The first thing was to be level-triggered. This is a piece of terminology that has been co-opted from people working on low-level embedded signal systems with voltages and things at that level. This is the idea that you should be level-triggered, not edge-triggered. I'm not going to explain exactly what this means for signal processing. For us, it means that you shouldn't react to changes in state, you need to react to the state itself. If you scale your cluster up from three to five, you shouldn't interpret that as add two. You should interpret that as I want five, I have two. That seems like a really subtle distinction. It seems like I've just said the same thing twice. It's important in certain failure conditions. It is important, if you lose connectivity, if your operator pod gets restarted, when it comes back it may have missed that scale up event. It might not have known that this happened. Instead, it just has to look without being told to look. It has noticed that this thing is wrong, and do it. That means, look at state, not look at the change in that state.

Do One Thing at a Time

The other thing is to do one thing at a time. Our operator reconciler loop starts by deciding what to do. Does one thing, and then exits. Then next time around when it gets invoked again, which is usually immediately, it will do the next thing. This seems weird. When you first do it, you think, I know I need to make three things. Why am I doing this three times just to make my three things? Why don't I just make them? The answer is that it makes it more resilient this way. It makes it easier to understand what the code is doing. It makes it easier to test. It makes it easier to debug. It means that if you then go and change one of those things, and it re-reconciles, it will only do the one thing it needs to do. If you group these changes together into larger changes, then you might miss things if you're in a partial state, which can help you in certain failure conditions. It's not something that you'll notice necessarily, but when you need it, you'll know.

The Cache Might Lie to You

The other thing is caching behavior. Another one of the things that a load of the Go logic and kubebuilder gives you is it will cache things. You pull information from the API about what replicasets are already there because you need to make one. Your cache might be out of date. You might get lied to. You might ask it, do I have a replicaset? The answer will be no. When you go to create it, you'll get an error telling you it already exists. This can happen. This isn't necessarily something to be afraid of. It just means that you have to make sure that all of your operations are reproducible. It means when you create names for things, they have to be deterministic. It means you have to accept that sometimes you might just try to create something and it's already there, because you've already done it last time. That's ok. Just wait, do it again. Then, eventually, your cache will update. They don't have to worry about it. The canonical example where this could be a problem is if you're making randomized names for peers. By the time you have caches ready, you could have created five of them. Then you have to notice that you created five and it's going back down again. It's much easier just to have a deterministic name, try to create it a second time, fail. Then wait to update your cache.

Deploying an ectd Cluster

Let's actually talk through how this solves our etcd problem. It is going back to our etcd example. You want to deploy it, you can create a resource like this. This is a really minimal example. You can specify more things in this spec. You can specify storage information. You can specify the version of etcd you want, things like this. In this simple example, I just said I want three of them. I'll let the operator to develop the rest. This is what we get. We get a cluster resource that someone just created. We make a service, which is going to give network identity to all of our etcd's. Then we'll create three peer resources. Those of you who remember my original diagram are wondering how we just did this, because the first thing we do is we talk to etcd. Etcd isn't there yet. We haven't made it. How can we do this? We have this state. We're just trying to dial etcd and there's nothing because we haven't made it. For this particular case, we have a bootstrap mode. If your operator sees that there is a cluster desiring peers and it hasn't made the peers yet, so it can't find the peers. It can't contact etcd. Then it assumes it's bootstrapping. Assumes it needs to add more peers. It will speculatively create them, and add those peers. This can go wrong. If you have a network outage, your operator pod might not be able to talk to etcd, even if it's already there. This can fail. That's ok, because the operator will recover. If you accidentally create too many pods, eventually this theoretical network fault will heal. You could talk to etcd again, even though your etcd is confused because it has a bunch of things trying to talk to it. It doesn't know that they're there. It doesn't expect them, at which point the operator will reconcile the state of the world to etcd's expectation and get rid of them. This will eventually heal, which is why we're comfortable with having this bootstrap mode. The other thing to bear in mind is we never delete things in bootstrap mode. We only ever add peers, which means we can't delete data by accident. You can scale this down and then accidentally drop all your pods, because it's all they needed to.

These peer resources look a bit like this. It's pretty simple. The two things to be aware of is the initialClusterState there. That's actually something that can be either new or existing. This is what etcd wants. There is a configuration flag that you give to etcd to tell it whether it should try to bootstrap a cluster, or whether it should join an existing one. We're really just fulfilling what the operations guy told us to do. This is what I showed you to provide. We also need to tell the other nodes, they have peers in this cluster. Here, we've actually done something a little bit interesting. We've predicted what its names are going to be, because we haven't made any of these things yet. This is the point where we've created the FCP resources, the underlying pods, and replicasets, and everything else don't exist. We're predicting based on the service that we've already created, what its DNS name is going to end up being. This is achieved by the hostname field on a pod. If you've set that, it will get a DNS name regardless of its name. That's how we do that.

What does the peer create? We make a replicaset. We get the pod with its hostname set, because we set out on the pod template and replicaset. We create a PVC. To clarify, that's one replicaset per peer, not a replicaset for all of the individual etcds. The reason we're using a replicaset is because we had concerns about HA constraints in production. We didn't want our operator that was directly managing pods to have to be alive in order to bring a cluster back up in a failure condition. Instead, we use a replicaset. Operator doesn't even have to be running and the cluster will heal itself because Kubernetes will restart the pod for it, because we use a replicaset to hold it to. You don't have to do this. You can directly address pods if you want. This was a take it or leave it decision for us. It is one thing we do slightly differently. The takeaway is that you can use these higher order Kubernetes things with your operator, you just have to go all the way down to start managing pods and things like this.

I've drawn the line to a PVC slightly differently, that's because we don't set an owner reference. Normally, with Kubernetes resources, you can tell it that one resource is the owner of the other. When you delete the parent, it will delete the children too. We wanted to avoid the case where if you accidentally, or someone maliciously did a kubectl delete as the cluster of my cluster on my etcd, it will delete everything. We didn't want it to drop the data. This means that if you do that, the PVCs will be left behind. Your data will still be there. You can recover from doing that.

Now that we've got all of this, where does it deploy? If you just deployed that YAML I showed you, you'll get this. You will get three pods running in etcd each, with hostnames like that, with a PVC each and a PV each. That's its internal view of what the world is. This is pretty much exactly what that on-prem traditional VM slide I had at the beginning looked like just with pod written around it rather than a machine. That's pretty good.

Scale Up

Let's go into things that were difficult before, what was hard. Scale up was awkward. We could have done it, but it was awkward. How do we do it now? We edit our resource. We tell it that instead of wanting three of these things, we want five of these things. This could just be a few kubectl edits. This could be from some GitOps pipeline, redeploying it because you've changed it. This could be through kubectl apply. We also implement the scale sub-resource. You can actually just use tooling that understands this concept of scaling, to scale these things. You can just say, kubectl scale, I want five of them now. It will understand that and it will work in exactly the same way.

The first thing we do, we reconcile to etcd. We can contact etcd. We're not in bootstrap mode. We go to it and we tell it we're going to create a new peer. Again, we've predicted its name. We know we're going to call it -3, because that's the next ordinal number. It is a deterministic name. We tell it. This is going to come into existence. We create a new peer resource, because now we reconcile etcd against the world, and etcd expects a my-etcd-3. There isn't one, so we make one for it. This now looks like this, at this time. The only real difference is that initialClusterState is now existing. We know this because we populated from etcd. We know we can talk to etcd at the time, so the operator knows this must be an existing cluster. We just give it the names of all the other peers so it can talk to them and find out, and bootstrap with them. If you draw everything on one slide, you get this. You get our namespace. We have our cluster resources, and our peer resources, and all of our nodes, and replicasets, and PVCs, and everything else we need. Then of course it would go on and it would do the fifth one.

Scale Down

What about scale down? This was a big problem before. This was really painful. How do we do this? If we go back to the three-node case, and we say scale it down to one. We go to etcd, and we tell it we want to remove one. Then we get to remove it. There's a problem. We don't want to delete the PVC. If you just delete the resource, it'll clean up the replicasets in the pods, but the PVC, we don't set its owner reference, so we're not going to get rid of it. This is an issue because if you scale down and then scale up again, because the PVC is deterministically named, it'll use the same PVC. Which means that you could have a scale down, and a few weeks, a month later, scale back on the same cluster, pick up stale data. A property of etcd is that if you already have a data directory, it ignores bootstrap instructions, or even worse, this new etcd you've created will be completely stale. We take care of this case, in particular, in response to a scale down when the operator has intentionally decided to scale something down as distinct from just deleting the resource. It will go, and before it deletes it, it will attach a finalizer. This is a hook you can attach to any resource in Kubernetes to tell it to do something before it gets deleted. Then we delete it. Then Kubernetes calls back in to us, to ask us to do the thing that we said we were going to do. In which case, we delete the PVC. We clean it up. Then we get rid of it. Then we'll get rid of number one as well. Then we'll be down to our desired state of having only one replica.

Other Features

I've covered those features. I mostly gave worked examples. This operator does a whole bunch of other stuff too. We have version upgrade. You can specify a version and a spec, and it will do a rolling upgrade. You can do backup. You can tell it to back up your cluster while it's running into a cloud bucket. You can do restore. You can't do a restore in place. You need to create a new cluster, again, following the operations guide. You create a restore resource and then it will go and it will pre-populate the PVCs for you. It will create an etcd cluster on top of it to make sure it comes up correctly. These are all just implementing what etcd told us to do in their documentation.


The last thing I really wanted to touch on was testing. I mentioned it earlier when I talked about kubebuilder, and what a testing story was like. It looks like this. I will go to the biplane over there, which is our Go process on your laptop or on your CI node. That was a big driving factor for us. We wanted all these tests to run on a laptop as well as in CI. We didn't want to be tied into needing a GCP cluster, a GKE cluster, or something like that in order to run our tests. Kubebuilder's default test harness does is it will stand up the API. It'll stand up, ironically, an etcd, in order to back our API. This is not the etcd on the test. This is just an etcd to back the API server. It just downloads the binaries, these things that runs it. There's no pods. There's no Docker. There's no container runtime. There's no nothing. You don't get to actually run things. You can create resources. You can watch them, and listen to them, and do all these things. In that code, our controller loop is running. Then we go in and test. We're going to create an etcd peer. We go [inaudible 00:31:23] API in response.

We created a replicaset. We created a PVC. We can assert on its properties. We can assert that we made it. We can test what happens if we start deleting stuff, or changing things, and how it responds. Of course, part of our reconciliation logic is we go out to etcd. We end up just mocking that. We have a little mock stub for the etcd, so we can pretend that the etcd is behaving, or pretend the etcd is not behaving, that you can't contact in order to trigger that bootstrap mode or non-bootstrap mode in order to cover our behavior. This is great. These are just a little bit heavier than unit tests. They are not quite as fast because it has to actually launch the API server and the etcd binaries. Those things are pretty efficient. These don't take very long.

The last piece of this is a real end-to-end test. We actually run this and actually stand up etcd pods. For that, we use kind. This is another CNCF SIG hosted project. It stands for Kubernetes in Docker. People can and have done entire talks on how this thing works. It's a really interesting project. I recommend you check out the documentation. The really short version, for our purposes, is just that it lets you stand up Kubernetes on any host that has Docker. You don't need anything beforehand. These things are small. They come up in a few minutes and are entirely reproducible. It's perfect for our testing. We can run these on CI nodes. We can run these on laptops. This is great.

Putting this all together, we have this. This is everything we do, if you try to run an end-to-end test using this thing. The first thing we do is we create a kind cluster. That stands up the control plane and everything else we need for the whole Kubernetes thing. Then we go and we run Docker build in order to build the images that actually have our operator inside of it. Then we go and we load the operators in, load the images in. Deploy the operator, and do things like that, in order to get our CRD and everything else. Then we can deploy an etcd cluster. Step back and just watch it bootstrap everything, in the way I described. Then we can start poking it. We can start asserting on it. This is because it is actually running real containers in actual pods. It's actually running etcd and talking to it. Verifying that our communication to etcd is working as we expect. Verifying that our reconcile loop is doing the correct thing in the correct cases. We can go in and start blowing things up and removing things to make sure it works. This is how we handle that end-to-end testing part of building this thing, to make sure it does actually work.

Lessons Learned

What do we learn from doing all of this? What are our key takeaways from this process of trying to implement this operator? They provide value for applications with complex runbooks. This was complicated. The operations guide is quite long. That meant that there was value in writing this code to implement that for us. We learned that we can work with the existing tooling. If you're using GitOps to deploy a bunch of YAML files, or a Helm chart, it will still work with this. If you're using Gatekeepers to enforce security policy on these things, it will work with this. It will work with those CRDs. We learned that you can use any stack. Go worked for us. This was a result of our team. Our experience. What we were comfortable with. What we wanted to do. If you're a Java company, go write this in Java. Or Python, go write it in Python. You don't have to use Go. You don't have to use that tooling. It's completely agnostic. We learned that you can end-to-end test them with kind. I can do this on my laptop. My laptop isn't that powerful but it can still run it. We can do this in CI. It is reproducible testing between a local development environment and a remote one. The operator itself is MIT licensed. It's available on GitHub.

Questions and Answers

Participant 1: I assume when you talk about scale, that means you can plug it in to [inaudible 00:35:58] scaling, and it's just going to work.

Laverack: You could, but you probably don't want to.

Participant 1: Do I get that [inaudible 00:36:06]?

Laverack: It's really a property of etcd that you don't scale up to increase load, because of the way etcd works. You scale to increase resiliency. If you scale up, actually, your write performance will get worse, and your read performance will get only fractionally better. If you could autoscale on some dynamic resiliency metric, you probably want to do that. You completely could if you wanted to.

Participant 1: How much does it depend upon introspection in that platform? It sounds like you're talking to etcd and you're telling it about stuff that's happening. Does this whole thing depend upon that or would you recommend to get away with that space that doesn't have any bounds set, or less bounds set?

Laverack: We did it in this case because a load of the problems we had doing this without it was just that etcd's internal state would get confused, and wouldn't match the world. That's where we went on that path. If your database didn't do that, and it would accept whatever its environment would, or dynamically figure it out, you probably don't need to. It would depend on the use case. Most operators don't really do this. They don't need to, so they don't bother.

Participant 2: I was quite interested in your concept of doing one unit of work per reconcile loop, monthly. If you're adding five nodes, you add one, go on to the next reconcile loop and add another. Could you go into a bit more details about why you decided to go that way? What the major benefits were versus the drawbacks? In OperatorHub, for instance, we don't do that. We do unit operation after unit operation. It would be quite interesting to see what your thinking was on that.

Laverack: The canonical example was talking about, if you need to create two things. Let's say you need to create a service and a pod, for example, then you might have a piece of logic that goes, if service does not exist, then create service and create pod. Of course, if you delete the pods, you'll never rerun this. You have to figure it out. I knew of this logic, which is, if any of these things do not exist, try to create all of them. Of course, many of them may already exist and they have to handle that. It just worked out to being a little bit easier for that case. Typically, what happens is you're going to run multiple times anyway. Because when we create most of these resources, we set owner references. For example, when the cluster controller creates the service, it sets itself as the owner. When anything changes in something you own, we get woken up again by that watch. What actually happens is we go and we create a service, and then as a result of that service being created, we get invoked again. Which means we immediately get invoked at which point we start doing the next thing. Whereas if you created 10 things, you create your 10 things and then you'd run 10 more times as you could have re-reconciled every time. It's actually more efficient in most of these cases to actually just do it one by one, and just let the reconcile loop just re-run.

Participant 3: How much effort did it take your team to create this operator? You mentioned that it is a big project. It's not a small part. You didn't mention any scales.

Laverack: I think the team was probably about three or four people. Not quite full-time. I think about 80% of the time for a few months. This was over Christmas, and a load of us were around for one time. Nothing much happened then. In total, it's probably a few people for a couple of months to be able to get it to this stage, which we consider to be pretty stable. It's not completely small. We think it was a price worth paying for making the operation side of things easier. It's definitely not something you could bang out in a week. You could bang out a prototype. I think the original prototype was written overnight in about three hours by one guy. Of course, that's just a prototype. It doesn't have documentation, or testing, or anything like that. To actually bring it to a production ready state took much longer, as is true for most software projects.

Moderator: I presume you write this because there wasn't an operator already available? Maybe it's worth sharing that there's hundreds available? This etcd one that you built, is that available to people in this room?

Laverack: Yes. There was an operator available. The CoreOS one is still out there. It didn't quite meet our use case, which is why we decided to go slightly different. We evaluated the concept of contributing upstream to it and trying to make the changes that way. We realized that in order to do everything we wanted, we would have to change, basically, the entire codebase anyway. We decided just to do it, take this approach. The code is available. It's on GitHub, if you want to take a look. It is MIT licensed. You can use it if you want to. We welcome contributions, all sorts of stuff.

I mentioned OperatorHub. I don't think we're on there yet. I think we need to talk to them about it. There are hundreds of these things out there. Lots of databases, in particular, have these things. CockroachDB have one. There's one for CouchDB, I think. I think it is three for Postgres that do have various different focuses. There are a bunch out there for these things. They're very popular for databases, just because databases require this extra special stateful management.

Participant 4: Were you doing this through a test client. Is this something you'll be able to work in, or you experienced doing operators already?

Laverack: Me personally?

Participant 4: Your team.

Laverack: Jetstack have done operators before. One of my colleagues, James Munnelly wrote cert-manager, and that's an operator. He's been working in this space for some time, a few years now. I myself had not written an operator before. A bunch of my colleagues had done so. We did have experience in the team to draw on.

Participant 4: Relatively easy to get into it?

Laverack: Yes. I mentioned the kubebuilder documentation, the kubebuilder book. That was pretty good. It does spend a lot of time talking about why we do certain things, because it has all these opinions. It goes into some detail about why the authors of kubebuilder think you should do things in this way. That helped quite a lot.

Participant 5: Because obviously you guys actually you've got state, what happens especially early in the project when you have to start upgrading, adding fields to your CRD? That's going to get quite painful, I imagine.

Participant 5: Yes, we get stung by it.

Laverack: If you looked, our current API version is V1, alpha 1, which means that we're comfortable making breaking changes. We don't really want to. We can and we have. Once you stabilize into beta, you can't really do that. It does get easier. Kubernetes has a feature called a conversion webhook. You can register a piece of logic that will run whenever you create or update something that can actually run a version conversion for you. If your version is only you've moved a field or you've done certain categories of changes, you can automatically migrate those forward for your users. That's possible. Sometimes you can't do that. I think it's actually happened. I'm not actually involved in the cert-manager project. Of course, I talk to James who is. They actually had some change recently where they had to make a breaking change because they changed a thing about their API, which wasn't convertible. I think they split a CRD out. They had to go through some documentation. They had to notify users. It can be difficult, especially if you have a lot of users out there, to do that thing.

Participant 5: It's good to know we're not the only ones there.

Laverack: It's not a perfectly solved problem, certainly. The conversion webhooks help.


See more presentations with transcripts


Recorded at:

Oct 13, 2020