Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Choosing Kubernetes: Managing Risk in Cloud Infrastructure

Choosing Kubernetes: Managing Risk in Cloud Infrastructure



Ben Butler-Cole tells the story of how Neo4j’s system developed, both the product and the implementation, in terms of the decisions they made about how to manage risk. He talks about their use of Kubernetes as a foundation for their stateful service: why they chose it and how they handled the risks associated with that choice.


Ben Butler-Cole is the engineering lead for Neo4j's new DBaaS product. He has spent most of the last two decades looking for imaginative ways to build software without unnecessarily increasing the number of lines of code at large in the world. He has worked in publishing, finance, retail and telecommunications.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Butler-Cole: One of the themes of this track in the conference is risk mitigation and in particular how the responsibility for mitigating risks is transferred between the providers of services and their consumers in modern IT systems.

One of the primary examples of this is the public Cloud, and what's happening when people are using public Cloud services is that they're transferring to the public Cloud providers the risk of the responsibility for mitigating some of the infrastructure risks that outsourcing that risk, so the Cloud providers take responsibility for capacity planning, for hardware and management, for running data centers and some of the security risks.

The Cloud providers in their turn pass on some of that risk to providers of their own hardware providers or it's notable that a lot of the big cloud providers are now manufacturing more and more of their own hardware as their needs become specialized and as they operate at a bigger and bigger scale. This is a theme that we'll return to, which is of businesses holding on to risks which are core to their business or particularly important to them.

Ultimately, the hardware providers are delegating to Quantum physicists the risk of ensuring that the components are going to continue operating in a way that they ought to. The businesses who are using the Cloud providers have consumers of their own who are handing on some of their risk to them, so there's a whole chain of risk delegation here with different puzzles of risk being handed on down this chain to be mitigated at a later date, the lower point in the chain.


I work for Neo4j, the company behind the open source graph database, Neo4j. Currently, Neo4j is available only as shrinkwrapped software for you to download and install on your own service on running a team building a new database as a service product on top of Neo4j so that once this product is available instead of having to install it yourselves you'll be able to just come along to our website and stick in a credit card and start using the database.

This role places us squarely at the center of this chain of risk delegation. We accept from our customers risk for the operations of the database and we take responsibility for mitigating those and then we pass on to Cloud providers some of the infrastructure risks. We made a very early decision in building the system that we weren't going to run any variant hardware that we were going to delegate that risk to the public Cloud providers, we use other providers, other external services where appropriate.

In this session, I'm going to talk about the architectural and product decisions we've made while building this system in the light of this risk delegation model. In particular, I'm going to talk about our use of Kubernetes and how we chose to use it and the reasons why and what use we're making of it. This is an overview of the whole of our system all the way up to some end-user who isn't visible to us who is using the service provided by our consumers. The key service that we're providing to our customers is the ability to use a database, a graph database without having to operate it, so they get the benefits of storage and retrieve their information without any of the attendant efforts of operating it.

The reason they want to do that is because running databases is hard primarily because it concerns State and State is tricky in IT systems. Installing a database tends to be relatively straightforward if you're lucky, but then keeping it running particularly in the face reliably in the face of faults becomes a difficult problem. That difficult problem isn't called to the business that our consumers are trying to run, they have other things and so they're very happy to handoff to us that risk of operating the database. They retain the risk of data modeling, of building queries to run against that database and so on and that's because those queries and the data model are intimately tied up with the business that our consumers are trying to run, so it's another example of businesses holding on to risk that's core to their business.

Retain Risk that Is Specialized or Core to the Business

This is a theme that we'll see time and time again that businesses choose to hold on to and mitigate risk that is core to their business or where they have particular specialized knowledge or information that allows them to efficiently mitigate that risk. The fact is the influencers can change over time, we'll see examples of where the right place to mitigate the risk and what's considered to be core to the business can change over time. For example, for our customer's database backup is not core to their business and so they're happy to delegate that to us, but our specialism is operating databases and so we're very happy to take their responsibility for that risk mitigation from our consumers.

Here's an example of one of the things that we're very happy doing, this schematic shows the upgrade of a cluster of Neo4j instances from version one of the software to version two. In a clustered service you want to be able to do this upgrade process without any downtime and with data safety so you can ensure that you're not going to lose any data while you're doing this and you want all these to work smoothly in the face of failures even while that upgrade process is going on.

There were two things that make doing this costly, first of all, designing the process itself is difficult. This is not a trivial problem, distributed computing is hard, and so simply designing the process correctly to upgrade this system can be complex. In fact, it's sufficiently complex that we use a simplified version of this as a discussion topic in our interviews and we're hiring, so if this kind of think thrills you to the core then we'd be very glad to hear from you. However, although it's a difficult task to design this this is our expertise we have in house, we built the clustering system, and we have the expertise to design this process. The other thing that makes this expensive is the cost of engineering and automation for this process. Is this a significant engineering effort particularly considering all the fault tolerance that's necessary? It's actually a lot of work to automate this and if you're only running one or two databases the effort required to automate that is probably more than it's worth you're going to and say you end up with a primarily manual process, as a result, it can be very error-prone. For us, however, we can amortize that cost of automation across many consumers and so it becomes efficient for us to put in place some very sophisticated automation processes to run things like the upgrades of our service.

Another example is backup. Backup is much simpler in terms of the kind of the operations that are required but it's absolutely critical to the success of the database because it's vital for the safety of the data. Neo4j has an online remote backup facility to minimize the impact on the running cluster of the backup process. Automating that is very straightforward but it requires you to provision an extra server that sits outside your cluster and that server needs to be of a similar size to the machines that you're already running in your cluster. However, most of the time that server is going to be amused because you're already running backup, say, once a day, so it's very expensive to have this extra server lying around. However, we can amortize the cost of that server or the time because we have many customers and so we can keep our backup service running hot and so it makes it efficient us to take this on.

Efficient Mitigation

We've seen that there are three things that can make it efficient for you to take own risk and take responsibility for mitigating it. Firstly, the concentration of expertise or information, secondly, economies of scale, the ability to amortize over multiple consumers, and thirdly, temporal smoothing or the ability to amortize the costs over time. I think this kind of analysis is one way to spot a viable business in IT, you look for little puzzles of risk that you're well placed to mitigate and which you're potential customers are happy to hand on to you. The value of the business you build is exactly the value to your customers of having those risks mitigated by somebody else.

The external services that we use are not just the Cloud provider, we also use Datadog as a metric service, we use Auth0 for authentication so that we hand to them the risk of correctly implementing the authentication subsystems, and Stripe who we use as a credit card payment provider. There's a huge amount of risk associated with credit card payments both for financial and for regulatory reasons and it will be very difficult for us to implement the necessary work needed to mitigate those risks but for Stripe, they've made a business out of it. They have economies of scale, they have expertise that makes it efficient for them to do that and our providers, of course, handoff risks, some of that risks to providers of their own.

Risk Movement

Another theme is the best place to mitigate risks can change over time as the balance of costs and expertise available change. I talked already about the Cloud providers building some of their own hardware and we expect to possibly move, bring back into our in-house some of the things that we're currently outsourcing, for example, Datadog is very expensive, so it may be that eventually when we come to a kind of cost optimization face of the project that we want to bring that in-house and run the metrics infrastructure ourselves.


To talk about something slightly different I want to talk about Kubernetes about our use of it and in particular, how we decided to use this on the process that we went through. Our use of Kubernetes isn't notable in and of itself, Kubernetes is now widely deployed. What is notable is that we're using it to build a database service which is a Stateful system and they see this somewhat unusual. In fact, when we were starting to think about using Kubernetes to build this system, Kelsey Hightower, had recently given a talk in which he said, "Most people get really excited about running a database inside Kubernetes. This is going to make you lose your job, guaranteed." so that gave us some pause as you can imagine. Since then his position has maybe softened and suddenly more nuanced over time and so currently his position could be summed up with the Twitter sound bite, "Kubernetes supports stateless workloads, I don't." and his point is he says that Kubernetes only solves part of the problem. The other parts must be solved by the Stateful service and through operational experience.

Kubernetes, if you're building a Stateful system doesn't give you anything for free, you still have to have the expertise necessary to operate the database for example outside Kubernetes, but it is possible if you have that expertise to operate it inside Kubernetes. We were considering using Kubernetes as the foundation for all databases as a service despite this die, career termination warnings from Kelsey Hightower. Why was that? It's because to return to that point I was just making that the hard bits the bits for which we have the operational expertise would have to be solved anyway wherever we built it, we were going to have to build, design these complex systems for doing upgrades.

Kubernetes, while it couldn't help us with the very difficult bits of our system, it could help us with the easy bits and leave more of our energy available for solving the hard problems. The other point was that we were pretty sure that while Kubernetes didn't support Stateful systems very well a couple of years ago, that things were moving in that direction and that by the time we were done, that maybe Kubernetes would have caught up with us that the support for Stateful systems would have improved. I built my first container-based system in about 2011, I was building a software as a service version of a Java web app, a single tenant web app, it was a project management tool. We took the Java web app and we stuck it into containers in order to make it multitenant for this SaaS system. This was before Docker existed, so we built it using naked LXC containers and, I don't know if any of you remember LXC, I spent more time than I care to remember wrangling IP tables and defining bridge networks and getting my hands very dirty.

I remember thinking at that time that this technology is it seems promising, it's kind of helping, but surely there should be a better way. Run about the time we finished building this system, put it into production, Docker was released. We learned that there was indeed a better way of doing things so you didn't have to get your hands so dirty or at least let's say smart people at dotCloud would get their hands dirty on your behalf. I've already seen container technology kept shot with what I was trying to do once before and I felt reasonably confident that that was going to happen again.

Hand off Risk to Our Future Selves

There's another theme here which is that sometimes rather than handing on risk mitigation to someone else you can hand it off to your future self. You're making a bet so that your future self will live in a world where the risk mitigation will be cheaper than it is today. We thought that Kubernetes might be useful but we had some reservations based on the experience of people who knew it better than us, and we had no experience of using it. We wanted to get more information, it was to decide whether it would be right for us and we decided to do that by pantsing the hard problems down the road and using it today for building a simple system and worry later about whether it was going to be the right solution for running our actual databases.

This is a very high-level overview of the architecture of our system. At the frontend, we have an end-user-facing application that we called the Console, this is a JavaScript single-page app with a Python server web service sitting behind it and running its state keeping it state inside a database which lives outside Kubernetes. Behind that sits the database manager which is responsible for actually managing the databases, here the domain is complex but the architecture is very simple that it's just another Python web application that exposes an API for the Console to use, and finally, right at the backend, are the Neo4j databases themselves. Neo4j runs in two modes, you can run it in a single instance mode or you can run it in clusters for greater availability and fault tolerance. The single instances are operationally very simple and the clusters as you can imagine are operationally more complex.

We decided to run just these simple, the stateless applications at the front of our system to build those on Kubernetes from the very beginning in order to give us an opportunity to learn Kubernetes in live, running a proper system to give us the information to decide whether later on we want to go lean and run the databases at the backend on Kubernetes and we left the single instance Neo4j running on EC2 using EBS as a persistence mechanism.

For a system, the simple Kubernetes is overkill, I don't think it would justify the effort needed to learn Kubernetes and implement everything on top of it just in order to run a couple of relatively straightforward stateless web applications. For us that investment of efforts was worth it because it was giving us the information, the information we were going to have as to whether it was the right decision to make to go early in on Kubernetes later on and also gave us the opportunity to learn a bit while the world was catching up.


This cost was only going to be worth it if we didn't also have to learn how to run Kubernetes itself. We have to find a hosted service because running Kubernetes is complicated, it takes a lot of effort and it was a sign that we want to invest in learning before we were even sure whether it was the feature of our system, so it was only viable if we could find a hosted service and we choose to use Google clouds, GKE which is a hosted Kubernetes service.

At a time when we were starting this, GKE was I think the only viable production-ready hosted Kubernetes service, Azure and AWS now have their own buts but we decided to use GKE because it was already production-ready. This entitled moving from AWS where we currently worked and where we had most of our expertise in the team onto Google cloud. This isn't the AWS versus Google cloud's talk but I've got quite of opinion on the top, on the subject so if anyone wants to try and catch me in a corridor off to it, I'd be very happy to go on the bus at a great length. But sufficed to say, we were happy with our experience of using GCP.

With the risk of giving spoilers to my own story, we used Kubernetes and we liked what we found, and we ended up deciding not only to build our system on Kubernetes but also to use the architectural patterns that Kubernetes provides as the basis for the architecture of our whole system.

The first implementation of our system running single instance Neo4j was very imperative following what Martin Fowler calls the transaction script pattern. I'll just take a brief detour to explain how we went about building the system. We started off when it was just me on the team and I haven't even hide anybody yet. The first thing I did was I stood up a couple of these two instances and I installed Neo4j on them and then I taunted them around inside the company with Neo4j and say, "This database is a service thing I'm working on, it's kind of done. Anyone wants to give it a go?" I provided that as a service internally and then to a few external people and it kind of smoke and mirrors way Martin doing all the beavering away behind the scenes to keep the thing running. Then gradually over time we had some people and we automated more and more of that system and we added monitoring and we added functionality until we had a fully running system, but we kept it running live the whole time never took the system down while we were developing it. Effectively, we went into production before we'd written a single line of code and that enabled us to pull forward the risk of learning how to operate the system in production and to embed into the code as we were going along all that we learned about operating the system.

This transaction script approach was effectively a direct translation into code if the manual operations that you carry out created database, here are the steps, upgrade the database, here are the steps. This worked fine for us to begin with when the system is quite simple, but it started as we grew to show some weakness. Primarily, in error handling, you can probably imagine the system like this which is coordinating EC2 instances, various other external resources, DNS entries, pretty much every single line of code is a callout to remote service which may fail and which is almost certainly side effect, eError handling in that scenario is difficult. We found in these transaction scripts that we have to make special decisions for every line about what we would do if it failed. Can we retry, can we abort, do we have to rollback previous steps and so on? That becomes extremely complex over time.

Secondly, we're about to build the system so it can recover from failure and with this approach, every single kind of failure had to have its own script telling the system how to recover from it. As we started to scale the system up and as we were looking forward to using clusters where everything would become not much more complex we decided these designs was not viable for the long term, so we looked around and we thought hard about another, an alternative approach. We came up with a model that we called the reconciliation model, so in this model rather than having transaction scripts, you have a system of record so you might recall you want this database should exist. You have a process called the reconciliation process or the reconciler which looks at the system of record, it looks at what does exist and it carries out a reconciliation process between the two making small changes to their actual state in order to pull it into line with the defined state of the system.

As an example of how that works, the process for creating a database you enter a record into the system of record, the reconciliation process sees that record, it knows there should be three database instances running in a cluster and it sees there are none and so it creates one instance and then it goes back to sleep and then it wakes up again and it sees that this database should have three instances but it only has one and so it creates another one and so on until the whole cluster is running.

Another example, healing a cluster, so here's a happily running cluster. We lose one of the instances in that cluster, the reconciliation process sees the database should exist, it sees that it should have three instances, and so it adds another one to heal the cluster. You all have noticed that the lost step in both of those processes is absolutely identical and so would you love some slight simplification, you get the healing for free because you've implemented the reconciliation in this way and this plays out not just in the creation of your instances but all across the system. The healing doesn't quite come for free but at least it emerges naturally out of the design of the system rather than having to be especially catered for, so instead of having an imperative system we have one that's declarative, it's convergent and it's stateless.

Of course, there are some costs to this, it makes the system more decoupled, but we thought that those costs would be worth it and we liked this idea, but it was a big change and this is a big risk we're taking in the kind of fundamental architecture of our system something that was new and we weren't aware of any examples and we were looking around for other systems that worked like this and we couldn't find any. Those of you, if anyone knows anything about the internals of Kubernetes you'll be able to see where the story is going. We'd learned a bit about Kubernetes by this point and we liked what we saw and we thought, "Wow, this problem we want to solve is kind of similar to what Kubernetes is doing." and so we lifted the lid on Kubernetes, we checked out the code and started reading the code and this is exactly how Kubernetes works.

Kubernetes has this rich data model, containers are encapsulated in Pods which make a ReplicaSets which formed deployments. This allows you to define using this very, very abstract model, it allows you to define the architecture of your system. Each of these different resources in Kubernetes has a thing that Kubernetes calls a controller and those controllers are exactly the reconcilers of the reconciliation approach that we were proposing. For example, the ReplicaSet controller looks at the system of record to see ReplicaSet should exist. It looks at the Pods that are recorded that's running on the system and it creates Pods or destroys them as a bright print and the deployment reconciler does something similar in its manipulation of ReplicaSets.

This gave us the reassurance we were looking for that we were heading on the right track. What's more we discovered is that Kubernetes exposes the mechanism to allow you to define your own custom resources and custom controllers this is Kubernetes operator pattern. You can use this pattern when the model that Kubernetes provides doesn't have quite what you're looking for particularly where you have custom logic you need to implement when making changes to the system you can build what's called an operator and plug that into Kubernetes and you get all the facilities that Kubernetes provides to enable this model and you can use it for your own code.

With this sympathy to the approach that we wanted to take along with our experience of using Kubernetes in production for the simple case and the excellence of GKE we at this point felt very happy to go ahead with moving onto Kubernetes. We have just last week finished migrating our system from the single instance EC2 systems onto these clustered databases running on Kubernetes, so we found out risk not just the operational risk to GKE but also we found out to the Kubernetes development team some of the risks of the picking the architecture for our system.

Risk Rejection

Another thing that often happens in these chains of risk delegation is of risk being rejected. Sometimes providers reject risk, push it back to their consumers if it turns out that actually, it would be cheaper for the consumers to mitigate those risks after all. This, in particular, happens when there is information or state that the consumer has in their hand which isn't available to the providers, so here does that kind of meet risk rejection in our system? Well, the cloud providers take on a lot of risks but they reject some of it.

If you make an API call to create an EVM that API call can fail. If they can't provide, it doesn't have enough hardware capacity to run the VM that you're asking for they will say, "No, can't do it." That's your problem, and they reject that risk back to you. A more interesting example is the case of VM failure and how systems deal with that and that changed significantly over time, so early versions of EC2 if the hardware underlines your VM failed your EC2 instance would just evaporate like that and you have to architect your system that risk was being rejected, you have to build your system to cope with that because the cloud providers didn't have enough information to be able to do that accurately. Over time that's improved, EC2 gave warnings and the alerting and the ability to migrate your VM away from the failing hardware by stopping it and restarting it, and finally, they've implemented a feature where they will if you configure it you can automatically have your VMs moved away from the failing hardware just by being stopped and restarted with a short interruption.

All system running on EC2 was taking advantage of that and so we would occasionally see outages of our databases as they were stopped and restarted running on a different piece of hardware. We thought that was as good as things could get, and then we had a bad August with EC2, we had a 5% failure rate of our EC2 instances in August which seemed preposterous to us. Our system could cope but it seemed like this was a lot more failure than we were really happy with, so we did some analysis to try and understand it and work out whether this was unusual or not, one of the things we did was to look at Google Cloud. We'd been running GKE on Google Cloud and that runs on top of Google compute engine which is Google's VM service.

We looked back at our historical data for our development systems to see how many GCE instances had failed and when we'd had 1 in 20 of our EC2 instances failed in August we looked and we saw that in June, July, and August combined we had had no failures of any GCE instances and we were confused. This seemed too good to be true, so we looked deeper and we discovered that there was a feature there that we had been using all along without even knowing it existed and the feature is called "Live migration". Through a bunch of very small engineering and I think a little bit of magic, GCE have developed this facility where they will move your VMs from one server to another, from one host, one piece of hardware, from a failing piece of hardware to a good piece of hardware without any interruption to the service at all does have a very slight pulls as they switched over and the network stuck to point from one server to another. They copy the memory over while the service is running and then they keep copying the dirty pages of memory until it reached the kind of point of no returns and then they take it down and bring it back up from the other place just like magic. We had had no VM failures and hundreds of these live migration events had gone on, so here's an example where the risk was rejected and then it has been accepted back again.

We reject some risks too, we've reluctantly decided to ask our customers to define the resource allocation for the databases that they're running. We have a vision in our heads of an elastic system where it adapts to your needs and we charge you based on the usage, the benefit you get from the system rather than on arbitrary infrastructure cost. We've reluctantly decided that at the moment we don't have the information to build a system like that and so we are pushing back to our customer the responsibility for sizing the databases that they're running.


Talking about this idea of a chain of risk delegation I've covered a number of themes, firstly that businesses retain risk that is very specialized to what they do. Secondly, efficient mitigation relies on concentration of expertise or information on economies of scale or on temporal smoothing, amortization over time. Sometimes instead of handing risk off to other people, we can hand it off to our future selves and in better words living in a world where that risk mitigation is cheaper. Some risks are pushed back to the consumer if it's actually turns out to be cheaper for them to mitigate and finally, the responsibility for risk mitigation moves up and down this chain over time.

I've talked about our system and about the flow of risks and I think this is a useful way to think about our system and I encourage you to go home and do the same thing for the systems that you're working on. Don't just think about the risks that are mitigated in the component of the system that you're working on but think about the chain as a whole and think about the important risks all up and down that chain and where that actually the system as a whole could be improved by moving the point of risk mitigation and I hope you find that useful.

Questions and Answers

Participant 1: Really a good talk. One thing I was wondering when I saw you talking about risk transfers is that this kind of thinking seems like it would be really useful when you were plotting the competitive landscape when you're comparing the services that you're offering versus maybe what are the companies would be doing. Do you use that kind of thinking in calculus to think about what your competitive advantages or how you might compete against other companies?

Butler-Cole: Yes, absolutely. In particular, the point that I was talking about at the end about one thing to build this service where our customers don't have to think about the risk of scaling the system, if we achieved it that would be very unusual for a database as a service. Some do but if you think about other and if you've used Amazon's RDS, their relational database service, there are a huge number of complicated decisions you have to make when you're creating, when you're running an RDS instance about the infrastructure configuring the database. We think we will have a competitive advantage by taking that risk ourselves, by not pushing that back to the consumer and while we're pushing some back we certainly already trying to defend as much of that as possible.

Participant 1: May I add to that? Do you find that your risk goes down over time? Was the open source tools you're consuming get better and better?

Butler-Cole: I guess so. Generally, the ability for people to delegate risk to other people has improved over the last 10 years, 20 years though the emergence of open source where you can just delegate having written a whole bunch of your software to somebody else and in particular the rise of both the public cloud providers and also software as a service and those services are only getting better. Certainly, five years ago we haven't been able to outsource our Kubernetes operations risk and we probably wouldn't even have considered this five years ago than the earlier Kubernetes wouldn't have.

Participant 2: I have a question about the sort of the issues that you encounter, do you think Kubernetes first Stateful services, could you go more into detail about the problems you encounter that could possibly terminate your job but didn't. How did you mitigate those problems?

Butler-Cole: Unless you're doing something very clever, the State comes down to disks, you just have to make sure there exists a disk and that it's reliable. The problem with using Kubernetes is that there is an awful lot of lesser obstruction between you and your disk. Running Kubernetes on GKE, we're actually using GCE's network disk facility under the covers, so there were actual physical disks which are only down in some complex physical arrangement in Google's datacenters and then there's GCEs, APIs, and facilities, and on top of that there are Kubernetes own obstructions and then we build our own obstructions on top of that.

The biggest challenge is just the number of less, this stuff is a lot easier now than it was a couple of years ago, a couple of years ago if you're going to run Kubernetes you have to make all sorts of decisions about which networking, plugging that we're going to use, which storage plug-ins, and so on. Now using GKE you just stand it up and it just works, so it's a question of thinking hard about the disk. The physical disk can then also logically about the identity of those disks within the cluster and making sure they all keep, they all stay lined up.


See more presentations with transcripts


Recorded at:

Jun 08, 2019