BT

Q&A with James Munnelly and Matt Bates on Kubernetes Stateful Services and Navigator at QCon London

| by Manuel Pais Follow 9 Followers on Mar 17, 2018. Estimated reading time: 6 minutes | NOTICE: The next QCon is in London, Mar 4 - 6, 2019. Join us!

The advantages of Kubernetes for stateless services are well documented. However, stateful workloads have particular requirements that have not been fully addressed yet by the Kubernetes ecosystem, according to James Munnelly, solutions engineer, and Matt Bates, co-founder of Jetstack, who presented at QCon London this year.

InfoQ took the chance to ask Munnelly and Bates about their views and ongoing work to be able to configure, deploy, monitor, scale and auto-heal stateful services in Kubernetes in the same way as stateless services. In particular, the approach and implementation of Navigator, an open source Kubernetes extension Munnelly and Bates have been developing, was discussed in detail.

InfoQ: Why did you decide for a Kubernetes-only platform instead of a hybrid? i.e. Kubernetes for stateless services and cloud provider services for data storage?

Matt Bates and James Munnelly: Where managed cloud provider services exist, such as Cloud SQL, there is a good case to use these in the pattern you describe. However, as more enterprises look to use Kubernetes in multiple environments, there is a desire for deployment and operational consistency and this is not always achievable with different flavours of managed service. It is also the case that in some environments, especially on-premises, such managed services simply do not exist. This is a situation for many of our enterprise customers.

There are already some efforts to integrate cloud provider managed services into cloud-native applications with the service-catalog project. We see Navigator integrating with service catalog to provider higher level layers of abstraction than we offer already today.

InfoQ: What are the fundamental difficulties you have faced managing stateful workloads on Kubernetes?

Bates and Munnelly: Kubernetes provides many benefits to industry in terms of development velocity, resource utilisation and automated operations, however it’s fair to say that this has not been translated across to stateful workloads.

Many common database systems make assumptions they will be run on machines with fixed software versions, persistent disks and network identity - pets, essentially. Few systems are designed for highly dynamic environments like Kubernetes, where pods can come and go and change identity, and services are round-robin load balanced, for instance.

Moving database systems to Kubernetes is also problematic because it does not have the complex and application-specific operational awareness required to respond appropriately for all the various types of failure. So during these events, human interaction is often still required to ‘operate' the database in question, and benefits of the time and efficiency savings anticipated with the automation can diminish.

InfoQ: In your talk you highlighted how Kubernetes evolution has been adding features over time that help manage stateful workloads, could you expand on that?

Bates and Munnelly: Since the very early days of the project, there have been efforts to introduce and mature features to enable workloads with state. Persistent Volumes, dynamic volume provisioning and StatefulSet provide building blocks that help run applications which require persistent disks or stable network identity, such as a databases. These are brilliant tools, but on their own can be problematic to use and understand, and cannot do everything you need to automate the operations of of the many flavours of distributed database systems.


InfoQ: Despite that evolution, you've opted to develop a Kubernetes extension called Navigator. Could you tell us what features Navigator provides and how do you see the tool fitting in the existing Kubernetes ecosystem?

Bates and Munnelly: We’re very much building on this evolution. Resources such as StatefulSet and PersistentVolume, and their controllers, have brought about the building blocks for distributed stateful systems on Kubernetes. But by themselves, these primitives are not quite enough as they do not take account of the application-specific behaviour for bootstrap, scale-up/down, backup and restore, and more. We are building extensions to Kubernetes in order to fill these missing gaps between platform functionality and user experience.

InfoQ: In your talk you mentioned the operator pattern. Could you summarize how this pattern can help or hinder the operation of Kubernetes stateful workloads?

Bates and Munnelly: The Operator pattern was introduced by the folks at CoreOS and they have led the way in adopting this pattern to orchestrate and manage the likes of etcd and Prometheus. In Navigator, we follow a similar pattern, but we also add a co-located binary (a ‘Pilot’) that wraps each deployed database process. It’s our eyes and ears to determine the database node’s state, and this is reported back to the Pilot resource status in the Navigator API server (built on Kubernetes API machinery).

InfoQ: How does Helm native Kubernetes application management fit in? What are its main shortcomings when it comes to applications with strong data storage requirements?

Bates and Munnelly: It’s great to see such an extensive and ever-growing library of Helm charts. Most applications can now be easily deployed from a readymade chart, and that includes stateful systems such as MySQL, MongoDB and Elasticsearch, the list goes on. However, many of these charts still require point-in-time management and lack the operational knowledge for pro-active management and failure recovery. A chart will spin you up an Elasticsearch cluster, say, but it won’t be able to handle scale down gracefully.

InfoQ: How does the Navigator extension work, in a nutshell?

Bates and Munnelly: Navigator introduces new API types (such as ‘ElasticsearchCluster’, ‘CassandraCluster’) which represent higher level constructs for users to interact with.

We have then created an ‘operator’ which is responsible for manipulating and creating other Kubernetes resources in order to realise the ‘desired state’ (i.e. a valid Cassandra deployment). This controller continually watches the deployment, and takes corrective action in response to failures, as well to drive operational tasks such as upgrade and scale-up/down.

In order to facilitate data collection from the databases being deployed, ‘Pilots’, small applications that run alongside your database processes, collect information and store it back in the Navigator API in order to inform decisions made by the controller.

This separation of collection from action has been a key success for the project so far.

InfoQ: Navigator makes use of Kubernets CustomResourceDefinition (CRD), correct? But it also extends the Kubernetes API, why? Could you provide an example?

Bates and Munnelly: Navigator extends the API by including its own API server that can run alongside an existing Kubernetes control plane in order to provide the API extensions.

This is a new pattern to Kubernetes, and it's being used to ‘break up the monolith’ and allow external collaborators to add native resource types to their cluster. A couple of examples of this would be metrics-server and service-catalog.

InfoQ: Do you think in the future Navigator's API could become part of Kubernetes standard API?

Bates and Munnelly: This is very unlikely to be the case. The Kubernetes API and the powerful API machinery provides the foundations to build extensions such as Navigator. Building on Kubernetes is very much regarded the preferred pattern for future application and environment-specific developments. The project maintainers' desire is to slow down development of the core and ensure stability and not overburden the existing API, which is already at 100s of resources.

Looking into the future, we envisage a new generation of distributed systems built with Kubernetes primitives as a foundation - Kubernetes, container and cloud-native from the get-go. The likes of CockroachDB offer us a glimpse into this future.

InfoQ: Finally, you've received a good amount of interest in the project since you gave the talk at QCon London. Any ideas on the roadmap and maturity level in the near future for the tool?

Bates and Munnelly: We were really pleased to present the project and our developments to date at QCon and, as you say, there has been a really positive reaction. We’re fortunate to work closely with customers that are driving requirements, whilst maintaining the project in the open.

We are aiming to cut a v0.1 release in the coming weeks, which will represent the first supported API surface for Navigator. We will also include end-to-end tests for common disaster recovery scenarios, and will be extending these further to ensure we can properly handle system failures. Looking further ahead, we’re actively looking at supporting more database systems. Stay tuned!

The video recording of the talk will be made available on InfoQ over the coming months.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss
BT