InfoQ Homepage Articles Kubernetes Operators in Depth

Cloud

Kubernetes Operators in Depth

This item in japanese

Sep 25, 2020 21 min read

Follow us on

Youtube232K Followers

Linkedin26K Followers

Key Takeaways

The Kubernetes API itself empowers cloud native adoption by providing a single integration point for all cloud resources.
There are frameworks and libraries to streamline writing an operator. These exist in multiple languages, but the Go ecosystem is the most mature.
You can make an operator for software that isn't your own. A devops team might do this to help manage a database or other external product.
The hard part isn't the operator itself, but understanding what actions it needs to take.

Abstract: Operators have been an important part of the Kubernetes ecosystem for a number of years. By moving the administration surface into the Kubneretes API they facilitate a "single pane of glass" experience. For developers looking to streamline their Kuberentes-native applications, or devops practitioners looking to reduce complexity of existing systems, operators can be an attractive proposition. But how might you construct an operator from scratch?

Operators in Depth

What is an Operator?

Operators are everywhere today. Databases, cloud native projects, anything that has a complexity to deploy or maintain on Kubernetes is getting one. First introduced by CoreOS in 2016, they encapsulate the idea of moving operational concerns into software. Instead of runbooks and other documentation, the operator performs actions automatically. For example an operator can deploy instances of a database, or upgrade database versions, or perform backups. These systems can then be tested, and be relied on to react faster than a human engineer could.

Operators also move the configuration of tools into the Kubenretes API by extending it with Custom Resource Definitions. This means that Kubenretes itself becomes a "single pane of glass". This lets devops practitioners take advantage of the rich ecosystem of tools built around Kubernetes API resources to administer and monitor their deployed applications:

Change authorization and authentication with Kubernetes' built-in role-based access control (RBAC).
Reproducible deploys and code review for production changes with "git ops".
Policy enforcement over custom resources with security tools based on Open Policy Agent (OPA).
Streamline describing deployments with tools like Helm, Kustomize, ksonnet, and Terraform.

This approach can also ensure parity between production, testing, and development environments. If each is a Kubernetes cluster, then an operator can be used to deploy the same configuration in each.

Why would you want to build one?

There's lots of reasons to build an operator from scratch. Typically it's either a development team who are creating a first-party operator for their product, or a devops team looking to automate the management of 3rd party software. Either way, the development process starts with identifying what cases the operator should manage.

At their most basic operators handle deployment. Creating a database in response to an API resource could be as simple as kubectl apply. But this is little better than the built-in Kubernetes resources such as StatefulSets or Deployments. Where operators begin to provide value is with more complex operations. What if you wanted to scale your database?

With a StatefulSet you could perform kubectl scale statefulset my-db --replicas 3, and you would get three instances. But what if those instances require different configuration? Do you need to specify one instance to be the primary, and the others replicas? What if there are setup steps needed before adding a new replica? In this case an operator can configure these settings with an understanding of the specific application.

More advanced operators can handle features like automatic scaling in response to load, backup and restore, integration with metric systems like Prometheus, and even failure detection and automatic tuning in response to usage patterns. Any operation that has a traditional "runbook" documentation can be automated, tested, and depended on to respond automatically.

The system to be managed doesn't even need to be in Kubernetes to benefit from an operator. For example major cloud providers such as Amazon Web Services, Microsoft Azure, and Google Cloud provide Kubenretes operators to manage other cloud resources, such as object storage. This allows users to configure cloud resources in the same way they configure Kubernetes applications. An operations team may take the same approach with any other resources, using an operator to manage anything from 3rd party software services via an API, to hardware.

An example operator

For this article we'll be focusing on etcd-cluster-operator. This is an operator I contributed to with a number of colleagues that manages etcd inside of Kubernetes. This article isn't designed to introduce that operator, or etcd itself, so I won't dwell on the finer details of etcd operations beyond what you need to know to understand what the operator is doing.

In short, etcd is a distributed key-value datastore. It's capable of managing its own stability so long as:

Each etcd instance has an isolated failure domain for compute, networking, and storage.
Each etcd instance has a unique network name.
Each etcd instance can reach all of the others on the network.
Each etcd instance is told about all the others.

In addition:

Growing or shrinking an etcd cluster requires specific operations using the etcd management API to announce changes to the cluster before instances are added/removed.
Backups can be taken by using a "snapshot" endpoint on etcd's management API. Hit it over gRPC and you get a backup file streamed back to you.
Restores are done by using an etcdctl tool on a backup file and on the data directory of an etcd host. Easy on a real machine, but requires some coordination on Kubernetes.

As you can see, there's more to this management than what a Kubernetes StatefulSet could do normally, so we turn to an operator. We won't be discussing the exact mechanisms that the etcd-cluster-operator uses to solve these problems in-depth, but we will be referring to this operator as an example for the rest of the article.

Operator Anatomy

An operator consists of two things:

One or more Kubernetes custom resource definitions, or CRDs. These describe to Kubernetes a new kind of resource, including what fields it should have. There may be multiple, for example etcd-cluster-operator has both EtcdCluster and EtcdPeer to encapsulate different concepts.
A running piece of software that reads the custom resources, and acts in response.

Typically the operator is containerised and deployed in the Kubernetes cluster it's going to provide services to, usually with a simple Deployment resource. In theory the operator software itself can run anywhere so long as it can communicate with the cluster's Kubernetes API. But it typically is easier to run the operator in the cluster it's managing. Usually this is in a custom Namespace to separate the operator from other resources.

If we're running the operator using this method, we need a few more things:

A container image, available to the cluster, containing the operator executable.
A Namespace.
A ServiceAccount for the operator, granting permissions to read the custom resources, and write any resources it needs to manage (e.g., Pods)
A Deployment for the operator container.
ClusterRoleBinding and ClusterRole resources, binding to the ServiceAccount above.
Webhook configurations.

We'll discuss the permissions model and webhooks in detail later.

Software and Tools

The first question is language and ecosystem. In theory almost any language capable of making HTTP calls can be used to make an operator. Assuming we deploy in the same cluster as our resources, we just need to be able to run in a container on the architecture that the cluster provides. Typically this is linux/x86_64, which is what etcd-cluster-operator targets, but there is nothing preventing an operator being built for arm64 or other architectures, or for Windows containers.

The Go language is generally considered to have the most mature tooling. The framework that is used to build the controllers in core Kubernetes, controller-runtime is available as a standalone tool. In addition projects such as Kubebuilder and Operator SDK are built on top of controller-runtime, and aim to provide a streamlined development experience.

Outside of Go languages such as Java, Rust, Python, and others have tooling or projects to either connect with the Kubernetes API in general, or specifically to build operators. These projects are at varying levels of maturity and support.

The other option is to interact with the Kubernetes API directly over HTTP. This requires the most heavy lifting, but allows a team to use whatever language they are most comfortable with.

Ultimately, this choice is up to the team that will be producing and maintaining the operator. If the team is already comfortable with Go, then the wealth of Go tooling makes it the obvious choice. If the team isn't already using Go, then it's a tradeoff between more mature ecosystem tools at the cost of the learning curve and ongoing training to use Go, and a less mature ecosystem but familiarity with the underlying language.

For etcd-cluster-operator the team was already well versed in Go, and so it was an obvious choice for us. We also chose to use Kubebuilder over Operator SDK, but only because of existing familiarity. Our target platform was linux/x86_64, but Go can be built for other platforms if it was ever required.

Custom Resources and Desired State

For our etcd operator we created a custom resource definition named EtcdCluster. Once the CRD is installed, users can create EtcdCluster resources. An EtcdCluster resource, at the highest level, describes the desire for an etcd cluster to exist and gives its configuration.

apiVersion: etcd.improbable.io/v1alpha1
kind: EtcdCluster
metadata:
  name: my-first-etcd-cluster
spec:
  replicas: 3
  version: 3.2.28

The apiVersion string informs which version of the API this is, in this case v1alpha1. The kind declares this to be an EtcdCluster. Like many other kinds of resources we have a metadata key which must contain a name and may also contain a namespace, labels, annotations, and other standard items. This allows EtcdCluster resources to be treated like any other resource in Kubernetes. For example a label could be used to identify which team is responsible for a cluster, and then these clusters could be searched for by kubectl get etcdcluster -l team=foo, just like you could with any standard resource.

The spec field is where the operational information about this etcd cluster exists. There are a number of supported fields, but here we express only the most basic. The version field describes which exact version of etcd should be deployed, and the replicas field describes how many instances should exist.

There's also a status field, not visible in the example, which the operator will update to describe the current state of the cluster. The use of spec and status fields is standard in Kubernetes APIs, and integrates well with other resources and tools

Because we're using Kubebuilder, we get some help generating these custom resource definitions. Kubebuilder has us write a Go structs which define our spec and status fields:

type EtcdClusterSpec struct {
    Version     string               `json:"version"`
    Replicas    *int32               `json:"replicas"`
    Storage     *EtcdPeerStorage     `json:"storage,omitempty"`
    PodTemplate *EtcdPodTemplateSpec `json:"podTemplate,omitempty"`
}

From this Go struct, and a similar one for the status, Kubebuilder provides tools to generate our custom resource definition, and implements the hard work of handling these resources. This leaves us only needing to write code to handle the reconciliation loop.

Other languages may vary in their support for doing this. If you're using a framework designed for operators then this may be generated, for example the Rust library kube-derive works in a similar way. If a team is using the Kubernetes API directly then they will have to write the CRD and the code to parse that data separately.

Reconciler Loops

Now that we have a way to describe an etcd cluster, we can build the operator to manage the resources that will implement that. An operator can function in any way, but almost all operators use the controller pattern.

A controller is a simple software loop, often called the "reconciler loop", that performs the following logic:

Observe desired state.
Observe the current state of the managed resources.
Take action to bring the managed resources in-line with desired state.

For an operator in Kubernetes, the desired state is the spec field of a resource (an EtcdCluster in our example). Our managed resources can be anything, inside or outside of the cluster. In our example we'll be creating other Kubneretes resources such ReplicaSets, PersistentVolumeClaims, and Services.

For etcd in particular we'll also be contacting the etcd process directly to fetch its status from the management API. This 'non-Kubernetes' access will require a little bit of care, because it can be subject to disruptions in network access which might not mean that the service is down. This means we can't use the inability to connect to etcd as a signal that etcd is not running (if we did, we could make a network outage worse by restarting working etcd instances).

In general when communicating with services that are not the Kubernetes API, it's important to consider what the availability or consistency guarantees are. For etcd we know that if we get an answer it's strongly consistent, but other systems may not behave in this way. It's important to avoid making an outage worse by acting incorrectly as a result of out-of-date information.

Features of Controllers

The easiest design for a controller would be to just re-run the reconciler loop periodically, say every 30 seconds. This would work, but has many drawbacks. For example, it must be possible to detect if the loop is still running from last time, so that two loops don't run at the same time. In addition this means a full scan of Kubernetes for the relevant resources every 30 seconds, then for every instance of an EtcdCluster you need to run the reconcile function which lists relevant pods and other resources. This approach leads to a large amount of load on the Kubernetes API.

This also encourages a very "procedural" approach, where because it could be a long time before the next reconcile each loop attempts to do as much as possible. For example creating multiple resources in one go. This can lead to a complex state where the operator needs to perform many checks to know what to do, and makes bugs more likely.

To combat this, controllers implement a number of features:

Kubernetes API watches.
API caching.
Update batching.

All of these together make it efficient to do less in each loop, as the cost of running a single loop and the time you have to wait is reduced. As a result, the complexity of the reconcile logic can be reduced.

API Watches

Instead of scanning on a schedule, the Kubernetes API supports a "watch". Where an API consumer can register interest in a resource or a class of resources and be notified when a matching resource changes. This means that the operator can be idle most of the time reducing request load, and it means that the operator will respond nearly instantly to changes. Frameworks for operators will typically handle the registration and management of watches for you.

Another consequence of this design is that you need to watch the resources you create too. For example if we create Pods then we must watch the pods we have created. This is so that if they are deleted, or modified in a way that isn't consistent with our desired state, we can be notified, wake up, and correct them.

As a result, we can now take the simplicity of the reconciler a step further. For example in response to an EtcdCluster the operator wishes to create a Service and a number of EtcdPeer resources. Instead of creating them in one go, it will instead create the Service first, and then exit. But because we watch on owned Services we are triggered to re-reconcile right away. At which point we can create the peers. Otherwise, we would create a number of resources, then re-reconcile again, once for each of them, potentially triggering even more re-reconciles.

This design helps to keep the reconciler loop very simple. By performing only one action and then exiting, we remove the need for complex state that a developer needs to reason about.

A major consequence of this is that it’s possible for updates to be missed. Network disruption, an operator Pod restarting, and other issues may under some circumstances caused a missed event. To combat this it’s important that the operator works in terms of being "level based" and not "edge based".

These terms are taken from signal control software, and refer to acting on the voltage of signals. In our world, when we say "edge based" we mean "responding to the event", and when we say "level based" we mean "responding to the observed state".

For example, if a resource was deleted, we might observe the deletion event and choose to re-create. However, if we missed the delete event we might never try to re-create. Or, worse, have a failure later on because we assumed it still existed. Instead, a "level based" approach treats the trigger simply as an indication that it should re-reconcile. It will observe the outside state again, discarding the actual context of the change that triggered it.

API Caching

Another major feature of many controllers is request caching. If we reconcile and ask for Pods, then trigger again 2 seconds later we may keep a cached result for the second request. This reduces load on the API server, but introduces further considerations for developers.

As requests for resources can be out-of-date, we have to handle these. Resource creation in particular is not cached, and so it's possible that we could have the following situation:

Reconcile a EtcdCluster resource
Search for a Service, not find it.
Create the Service and exit.
Re-reconcile in response to the Service being created.
Search for a Service, hit an out of date cache, and not see one.
Create the Service.

We erroneously create a duplicate Service. The Kubernetes API will correctly handle this, and give us an error indicating it already exists. As a result, we have to handle this case. Generally it's best practice to simply back-off and re-reconcile at a later date. In Kubebuilder simply returning an error from the reconcile function will cause this to occur, but different frameworks may vary. When re-running later, the cache will eventually become consistent, and the next phase of reconciliation can occur.

One side effect of this is that all resources must be deterministically named. Otherwise if we create a duplicate resource we may use a different name, which might result in an actual duplicate.

Update Batching

In some circumstances we might trigger a lot of reconciles at nearly the same time. For example if we're watching a number of Pod resources and many of them are stopped at the same time (e.g., because of node failure, administrator error, etc.) then we'd expect to get notified multiple times. However by the time the first reconcile has actually triggered and observes the state of the cluster, all the pods are already gone. So the further reconciles are not necessary.

This isn't a problem when that number is small. But in larger clusters when dealing with hundreds or thousands of updates at a time, this has a risk of slowing the reconcile loop to a crawl as it repeats the same operation a hundred times in a row, or even overfilling the queue and crashing the operator.

Because our reconcile function is "level based" we can make an optimization to handle this. When we enqueue an update for a particular resource, we can remove it if there's already an update for that resource in the queue. Combined with a wait before reading from the queue, we can effectively 'batch' operations together. So if 200 pods all stop at the same time, depending on the exact conditions of the operator and its queue configuration, we might only perform one reconcile.

Permissions

All things that access the Kubernetes API must provide credentials to do so. In-cluster, this is handled using the ServiceAccount that a Pod runs as. Using the ClusterRole and ClusterRoleBinding resources, we can associate permissions with a ServiceAccount. For operators in particular this is critical. The operator must have permissions to get, list, and watch the resources it manages across the whole cluster. In addition, it will need broad permissions on any resource it might create in response. For example Pods, StatefulSets, Services, etc.

Frameworks such as Kubebuilder and Operator SDK can provide these permissions for you. For example Kubebuilder takes a source annotation approach, and assigns permissions per controller. Where multiple controllers are merged into one deployed binary (as we do with etcd-cluster-operator) then the permissions are merged.

// +kubebuilder:rbac:groups=etcd.improbable.io,resources=etcdpeers,verbs=get;list;watch
// +kubebuilder:rbac:groups=etcd.improbable.io,resources=etcdpeers/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=apps,resources=replicasets,verbs=list;get;create;watch
// +kubebuilder:rbac:groups=core,resources=persistentvolumeclaims,verbs=list;get;create;watch;delete

This is the permissions for the reconciler on the EtcdPeer resource. You can see that we get, list, and watch our own resource, and we can update and patch our status subresource. This allows us to update just the status to display information back to other users. Finally, we have broad permissions on the resources we manage, to allow us to create and delete them as required.

Validation and Defaulting

While the custom resource itself provides a level of validation and defaulting, it is left to your operator to perform more complex checks. The easiest approach is to do it when the resource is read into your operator. Either when it is returned from a watch, or just after a manual read. However, this means that your defaults will never be applied back into Kubernetes' representation, which may lead to confusing behavior for administrators.

A better approach is the use of validating and mutating webhook configurations. These resources tell Kubernetes that it must use a webhook when a resource is created, updated, or deleted before persisting it.

For example, a mutating webhook can be used to perform defaulting. In Kubebuilder we provide some extra configuration to make it create the MutatingWebhookConfiguration, and Kubebuilder handles providing the API endpoint. All we write is a Default method on the spec struct we are going to default. Then, when that resource is created, the webhook is called before the resource is persisted, and the defaulting is applied.

However, we still have to apply defaulting on resource read. The operator can't make any assumptions about the platform to know if webhooks are enabled. Even if they are, they could be misconfigured, or a network outage could cause a webhook to be skipped, or a resource may have been applied before webhooks were configured. All of these problems mean that while webhooks provide a better user experience, they cannot be relied upon in your operator code and the defaulting must be run again.

Testing

While any individual units of logic can be unit tested using your languages' normal tooling, integration testing causes particular problems. It might be tempting to consider the API server a simple database that can be mocked. However in a real system the API server will perform a lot of validation and defaulting. This means that behavior between test and reality can be different.

Broadly speaking, there are two major approaches:

In the first approach the test harness downloads and executes the kube-apiserver and etcd executables, creating a real working API server (the use of etcd here is unrelated to the fact that our example operator is managing etcd) Of course, while you can create a ReplicaSet we're missing the Kubernetes component that creates the Pods here. So we won't see anything actually run.

The second approach is much more comprehensive and uses a real Kubernetes cluster. One which can run Pods, and will respond accurately. This kind of integration test is made much easier with kind. This project, its name a contraction of "Kubernetes in Docker", makes it possible to run a full Kubernetes cluster anywhere you can run Docker containers. It has an API server, it can run Pods, and runs all the main Kubernetes tooling. As a result tests that use kind can be run on a laptop, or in CI, and provide a nearly perfect representation of a Kubernetes cluster.

Closing thoughts

We've touched on a lot of ideas in this post, but the most critical ones are:

Deploy operators as Pods in the same cluster they operate on.
Any language will do, so select one that is best for the team. But Go has the most mature ecosystem.
Take care with non-Kubernetes resources, in particular behavior during network outages or upstream API failures that could make an outage worse.
Do one operation per reconcile cycle, then exit and allow the operator to requeue.
Use a "level based" approach to design, and ignore the content of the event that triggered the reconcile.
Use deterministic naming for new resources.
Give minimal permissions for your service accounts.
Run defaulting in a webhook, and in your code.
Use kind to make integration tests that you can run on a laptop and in CI.

Armed with these tools, you can build an operator to streamline your deployments, and reduce the burden on your operations team. Either for your own applications, or applications you develop.

About the Author

James Laverack works as a Solutions Engineer with Jetstack, a UK-based Kubernetes professional services company. With over seven years of industry experience, he spends most of his time helping enterprise companies on their cloud-native journey. He's also a Kubernetes contributor, and has served on the Kubernetes Release Team since version 1.18.

InfoQ Software Architects' Newsletter