Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Develop Hundreds of Kubernetes Services at Scale with Airbnb

Develop Hundreds of Kubernetes Services at Scale with Airbnb



Melanie Cebula identifies key problems that make out-of-the-box Kubernetes less friendly to developers, and strategies for addressing them, based on Airbnb’s experience empowering one thousand engineers to develop hundreds of Kubernetes services at scale. She focuses primarily on four problem areas: Configuration, CI/CD, Service lifecycle and Tooling.


Melanie Cebula is an infrastructure engineer at Airbnb, where she works on service orchestration. She loves building systems that make it easy for developers to create and operate their own services securely and reliably.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Cebula: I'm really excited about this track, DevX and DevOps. Personally, it's very exciting to me because at least at Airbnb, we did have this historical problem of people building DevX tooling and DevOps tooling separately, but not really unifying those two sets of tooling. That's something I've really been championing at Airbnb. So yes, just seeing this track, bringing these two worlds together for me is super exciting.

I'm going to talk about developing Kubernetes services at Airbnb scale and, what does that look like? Raise your hand if you know what Kubernetes is, if you've heard of it? Keep your hands raised. Who here has used Kubernetes? Who here has used Kubernetes in production? So for their real services? Over 10 services, 50 services, 250 services? All right, some people, a few hands. Okay, so that's kind of the scale we're looking at right now, is rapidly adopting Kubernetes for hundreds of services, some more critical than others, but all of them serving production traffic. So, I'm Melanie, I work on the service orchestration team at Airbnb, and our team's goal is to empower the rest of our engineering team to create, build, and operate their own services at our scale.

A Brief History

First, I'm going to give a brief history of infrastructure at Airbnb. We had our monolith, and who of you saw Jessica’s talk a few days ago? Yes. I'm not going to spend too much time on this. Basically, monolith got kind of scary around 2015, we knew we had to do microservices. Our engineering team was over 1,000 engineers last year and that's also pretty scary. Then I also saw Ben's talk yesterday talking about why microservices, and he's totally spot on about engineering velocity and trying to scale engineering and shipping your org chart, and how do you orchestrate that.

Basically, we need to scale our continuous delivery cycles. Each of these is a service, rather than like one big monolith. I really like this chart as well. This is actual data. I can barely fit it on to our slide because of the exponential growth. But this is essentially all of our deploys per week, for all of our projects and all of the environments they're in. The reason this is exponential is both because we have exponentially more services, and also because we have exponentially more environments. And that's actually very interesting to me because we don't just want engineers to test in production, just develop locally in development, we actually want them to test their changes on a suite of pre-development tiers. So we have canaries, staging, load testing environments, defeat type environments, etc.

The other thing you'll notice on this slide is there are big dips around Christmas. So we actually take a holiday break every year because we don't want to ruin Christmas by shipping bad code. I also really like this slide. When we were first working on services, there was a bit of a existential moment where monolith contributions were still going up exponentially, and we weren't sure if we were doing enough to stave this off. But we actually have staved off contributions to our monolith. Yes, we're looking at 125,000 production deploys per year, which is both exciting and very scary. So how do we make that a good experience for our developers, and also keep everything up and running?

The reason I split these slides into why microservices and why Kubernetes, is our decision for microservices predates Kubernetes or any mainstream adoption. You can actually think of these decisions separately. If you go back a decade ago, our infrastructure was not evolved in any way, just like typical startup, no configuration management to speak of. Maybe five years ago, probably a little bit earlier than that, we realized we needed some kind of configuration management and we standardized on Chef. For us, we needed to obviously manage our different services. The decision we came to was basically, this huge Chef cookbook, a series of recipes and all that. The interaction between all these recipes and cookbooks was pretty non-standardized. So it was pretty frequent where someone would modify one recipe, and it would just take down another service when they converged and stuff like that.

So this very complex hierarchy of services, inheriting configuration from each other, was not working for us. Additionally, because it was a mutable infrastructure, someone could SSH into a box and AppGet install some package, and that package was just hanging around. It's not declarative. So like things accumulate. It's kind of scary. So we landed on this decision that we needed to re-think our configuration story, and get some additional properties that Kubernetes gets us. And so I'm just going to break down those properties, what made Kubernetes in particular compelling to us compared to other options.

I talked about that declarative, ideal state. So something like we know we want 10 replicas, and so we're declaring that we have 10 replicas. If one of your data centers is struck by lightning and you lose five replicas from that data center, then it knows what to do. It knows that you have 5 and you need 10. Efficient scheduling, just like saving lots of money, because we spent a lot on our hardware costs. We needed something that was extensible. So a nice feature of Kubernetes is it's built with extensibility in mind, so you can kind of make it work for your infrastructure.

And then obviously, Kubernetes uses containers. Containers have portability, immutability, and reproducibility. So having that immutable state where things don't change, just being able to roll back and know that we're rolling back to this particular version of our code. And then YAML, for all of its defects, it is human friendly and it's a standard format, and was definitely better than ERB format, which is what we were using before. So that's how we landed on Kubernetes.

As of a week ago, 50% of our services are now in Kubernetes. This is about the services we know, so several hundred services. About 250 of those are in the critical production path. Some of these are more critical than others. But for example, when you go to, and you look at that homepage, that's a Kubernetes service. When you type into the search bar, "I am looking for homes in London this weekend to stay at," that's also a Kubernetes service. Similarly, when you're booking that Airbnb and making those payments, that's also going through Kubernetes services.

Challenges with Kubernetes

The point of this talk. Kubernetes out-of-the-box worked for us, but not without a lot of investment. There were a lot of issues that we noticed right away and we had to figure out how to solve them. So, number one, the configuration itself is actually quite complex. The tooling you use to interact with your Kubernetes files and also with the cluster is complex. Even though there's a way to integrate with your infrastructure, it's not quite obvious how to do it, especially when you're looking at an infrastructure that's over a decade old at this point. There are a lot of interesting things that can develop in that time. Kubernetes itself has open issues. You can look on the Kubernetes GitHub issues page, some of those issues are quite frightening.

And then Kubernetes, we knew that it would work at the scale that we were at when we started looking at it two years ago, but would it work for us today, or a year from now, or five years from now? It wasn't quite clear if it would work at our scale, because we were looking at other companies and trying to figure out if they were using it at a scale similar or bigger than ours, and more. But what I wanted to drive to you today is that if you have engineers working on these problems, they are just problems at the end of the day, and so they're solvable.

And I just want everyone to know that you are not alone. Lots of people are working on these problems. Just look at the startups in the Kubernetes space, and the number of companies working on open source projects in the Kubernetes space is really exciting and reassuring to me. This is a big bet that the industry is making together. So, yes, we've got to work on it.

What I'm focusing on today are the solutions we have so far for developer tooling. That's both developer experience locally, as well as ops tooling in production. Some of our solutions is abstracting away the configuration, generating service boilerplate, versioning and refactoring that configuration, building opinionated tooling, and then having custom CI/CD and validation to validate your configuration.

Abstract Away Configuration

Problem number one, abstracting away configuration. Basically, you take Kubernetes files. So I talked about how we have all these environments, production, canary, development, I actually can't even fit all of our environments on this slide, so I didn't try. But we have this N by N grid of environments and the files you need to apply. It's quite a lot and these files have a lot in common with each other. Graphs, resources and environments, lots of boilerplate, repetitive by environment, how do we solve that?

The problem is reducing k8s boilerplate. For us, we were using file inheritance with Chef. We decided we did not actually like file inheritance that much after all, so maybe we should use something like templating. We wanted to template YAML files. And two other decisions we had to optimize for that, maybe you wouldn't have to if you're an average consumer, but we're looking at hundreds of services and over 1,000 engineers at this point. How do we actually make this configuration work through these cases?

In the middle is the Kubernetes files, and then on the left is our internal abstraction. And there's a lot of similarities here like files, volumes, containers, that sounds pretty familiar to you. We also have the project, which is sort of where most of the information is stored about your service. And then the apps, you can think of those as the different workloads. So we call it kube-gen internally. Disclaimer, I think this term is used externally for an open source project. This is not that. This is something we just use internally.

How does kube-gen work? The project or the kubegen.yaml file, sets the params per environment. So on the top right here is the bonk service I've created. There are a few things you'll notice about this file. It has a version, and it has all the environments which I cut off and simplified to fit on the slide. We have the production environment, the staging environments. Production environment has 10 replicas, staging environment has 1 replica, and then all the other files can access these parameters. For example, the workload stanza deployment replica is accessing the replica specified in the kubegen.yaml.

We access those parameters and then we generate them. So when I run my generate command, I get bonk canary, bonk development, bonk production, and bonk staging, etc. It generates all of these lovely Kubernetes files, including admin role binding files, config maps, the deployment, etc. When you do .m.params.replicas, the staging namespace uses 1 replica and the production namespace uses 10 replicas. So we get standardized namespaces based on environments. These are actually all of our namespaces that we're going to be working with, with bonk.

One other thing that we ran into was basically trying to reduce the amount of YAML developers had to write. We call that components. If you're creating a bunch of services that use Ruby on Rails, and they use Nginx, you're going to start noticing that you're creating an Nginx config file and Nginx container a lot. You can actually just reuse these components. You define them in one place and then use them in a bunch of other places.

So the common patterns abstracting into a component, and the component YAMLS are merged into the project on generate. And the components can require a set default parameters. So that's what we came up with internally, but there's a lot of things that have evolved in the open source space since then. So there's Helm, Kustomize and Kapitan. You can choose whichever one works for you. Our project has sort of predated these open source options. So takeaways: reduce Kubernetes boilerplate and standardize on environments and namespaces.

Generate Service Boilerplate

We started with the first problem, which is too much Kubernetes boilerplate. But we realized we landed on a more powerful idea, which is basically everything about a service is in one place in Git and managed with one process. To me, this is a really powerful idea because once you start storing things in Git, it unlocks so many opportunities for you. And once you start deploying everything with the same process, you have a standardized way of storing and applying configuration.

Here's our bonk service. Everything about a service is in one place in Git. For us, this is the _infra directory. All configuration lives in _infra alongside project code. So you edit code and configuration with one pull request. It's easy to add new configuration. It's statically validated as part of CI and CD. It'll basically support all of this different configuration just by storing it in Git. So bonk is created with one command. It's a collection of config generators. For example, documentation is markdown stored in Git, CI is stored in Git. And it's also a collection of framework specific generators. So like Rails and Dropwizard.

This is the point where I say you can generate any service in any language with our framework, but we only support a paved road of different languages for the framework specific generators. So Rails, Dropwizard, and node are our paved road of languages. What's important here is you can make the best practices the default. So having a deploy pipeline. Having services use autoscaling. Having documentation for your services can be generated out-of-the-box. And you can run the generators individually or as a group. I run a group of them to create the service, but I can run one that just adds documentation to bonk.

And finally, we use the Thor framework which has support for review, update, and commit, so you can actually regenerate your project to update it. I think that's pretty neat. This whole service was generated with one command. So once everything about a service is in one place in Git, you can make best practices the default by generating configuration.

Version Configuration

One other thing I wanted to talk about, is versioning configuration. Storing it in Git is one thing, but you actually should version the configuration itself. Let me dive into that. Why do we version our kube configuration? On the left here, we have the kube.yml version one, and kube.yml version two. Why do we version it? If we want to add support for something like a new Kubernetes version, if we want to change something like our default deployment strategy, like from rolling update to something else, if we want to drop support for something, for example, maybe we realized we're using something that's less secure, and we want to drop support for it now. And then a little embarrassingly, if we know that we've been working on our tooling and some of our previous versions are bad, we know which versions are bad and which services are using it.

Finally, we can actually support a release cycle and cadence, kind of similar to Ubuntu. We have a stable, we have latest, and we have a minimally supported version. Every month, we can sort of update our release and have a new stable. That's actually super powerful for keeping things going and evolving and making sure services aren't just stuck on v0 of our project.

How do we version our kube configuration? One, the version field. So this is bonk.yml. And then we publish binaries for each version. And then we have something called channels which point to binaries. For example, we have the stable channel, and we can generate and apply using the appropriate binary. So kube-gen v29.14 would be used to deploy this project. That's versioning our kube configuration, but we need to version our generated configuration as well. So, what our project generators generate changes over time, best practices change. Then also sometimes bugs are found in our generators. For example, these kube.yml generated by sha1 is different than the kube.yml generated by generator at sha2. And generator at sha2 has a bug. So we need to know which services were generated at sha2 in order to find them and fix them.

The way we just do that is we just tag the file with information about the generator, the sha, the timestamp, etc. This is one way to solve this problem. But it has pointed us to another thing, which is just, we need to make sure we test our generators and the generated code, and make sure that it's working across a wide variety of services. We actually have what we call verification fleet now, which is extremely automated integration tests that test every time we generate. So we have CI that's just constantly generating new services, getting those services to build and deploy and talk to each other across a variety of our paved road language frameworks. So Rails, Java and Node.

Actually, one other thing I would have added, is just when you build all this tooling, you need to have really robust integration testing, and a lot of people skip building integration tests, and that's just a bad practice. So if you're building developer tooling, you need to also be a good developer and build good developer tooling for yourself. That's version configuration.

Refactor Configuration

One other thing we landed on is refactoring configuration. So, why refactor configuration over time? Basically, we want services to be up-to-date with the latest best practices. We know what the best practices were before and what they are now, so we can actually automate that transition. For example, if we bumped stable, we know that we want everyone to be using the stable version so we can actually try to refactor everyone to stable. Security patches are huge. We realized there's this huge regression in our Ubuntu base image, we want everyone to bump to the latest Ubuntu base image. So lie we actually have a schedule, we need to have everyone running that patch in like a week. And finally, it's just something that should be automated. I was watching this study, or I read this study from Stripe about some crazy percentage of a developer's time is spent on refactoring their own code and refactoring configuration. I just think it's a big waste of time if it can just be automated for them. So why not do it?

And then finally, we don't want to manually refactor 250 plus critical services or 500 services in Kubernetes. So how do we do it automatically? We have something called refactorator, which is a collection of general purpose scripts. You can just look at the names here. It's like get-repos, list-PRs, refactor, update, close, status. They're all modular, so they're meant to be combined in different ways, so you can close all PRs, create all PRs, update all PRs, etc. Then it covers the lifecycle of a refactor. So what does the lifecycle of a refactor look like?

Basically, you run the refactor itself, which checks out a repo, and this is from the point of view from one project, but we actually do this for all of our projects. So check out the repo, find the project in the repo, because we also have mono repos here. Run the refactor job, tag owners, and then create the PR. That's step one. Just create a PR that refactors your project.

Step two, and this is actually a cycle, is update the PR. We have a GitHub bot user that can comment on the PR, remind all of those owners to verify, edit, and merge the PR. And then the owners are expected to actually do this step. But finally, most of our refactors have a lifecycle guarantee, which after seven days, if it hasn't been merged, it'll just merge it on their behalf. Tagging owners is another thing because we store a project.yml which has ownership information. We know how to find owners for our projects, so that's really valuable here.

So how do we refactor configuration? We have refactorator or collection of scripts, but what actually does the refactor, and that's a refactor job. So refactorator takes a refactor job, and it will run it for all services and run that job. That job is responsible for updating files in _infra. For example, upgrading kube version to stable. That's the example I'm going to run through, is bumping the stable version.

Basically, we have a channels file in our tooling docket. That channels file has the channels stable points to a version. When we bump that file, this whole process is automated from there. So as soon as we decide to bump stable, we have a cron job that runs daily on weekdays, and it calls refactorator with Most of the time when it calls refactorator with, all products are already on stable because of our previous refactor, and so it doesn't do anything. It doesn't create any PRs, no changes, nothing to do. As soon as we bump stable, it starts running on projects, and it knows that this project is behind stable. It runs, which actually creates the PR, updating their version and then handling any breaking changes.

Then we finally have another cron job, which actually handles updating and merging PRs. So we have two cron jobs, one that creates PRs, one that does update and merge. There are other considerations here. If you are going to force merge a PR, maybe don't have that cron job run on Fridays or weekends, and so it's just another thing. My takeaway here is that configuration should be versioned and refactored automatically. This isn't the most exciting thing to build, but it actually really unlocks so much behavior that you want, like repeatable deploys. If we have this configuration and deploy it at sha1, every time you deploy it at sha1, it should have the behavior of being deployed at sha1. So we can't just change our deployment tooling under the hood, and then have it deployed differently, because then we lose that version guarantee.

When you make these changes, it's a little bit more disciplined, where if we change our deployment behavior, people have to bump their project to get that deployment behavior. Similarly, if our change is a breaking change, people have to update to a breaking version change to get that. So it's kind of more communication via configuration with our users. I think that's really powerful.

Opinionated Kubectl

I talked about configuration a lot, and I want to talk about the tooling you use to interact with Kubernetes, which is kubectl. We built something that's basically Opinionated kubectl. We talked about this problem, but kubectl is also kind of verbose and repetitive by namespace. We want to introduce some duct tape to make this easier for people. Here's our duct tape. So our tooling is going to take some opinionated stances, which just makes everything more streamlined. This is stuff that kubectl does not provide out-of-the-box.

Number one, k tool, assumes you're running in your product directory. So your CD to bonk, it takes end variables as arguments. So, k status ENV=staging and then it will actually print what it does. What does k status ENV=staging do? It actually calls get pods with namespace bonk-staging. So standardized name spacing, we're in bonk, we pass staging, we know there are namespaces, bonk staging. All of our commands can basically drop knowing what namespace you're in.

The k tool does a few other neat things. By the way, the k tool started out as a make file that we used on the Kubernetes team just to iterate on our Go project, and then it kind of just blew up from there and now we distribute it. So, k generates our Kubernetes files. K build will actually build your projects, do a docker build, and and then docker push. And then k deploy has some logic, like create namespace if it doesn't exist. Now if the namespace exists or if it already existed, apply or replace Kubernetes files and then sleep and check the deployment status. All these commands can also be chained with k all. So if you're lazy like me, you can just type this to literally just update your service.

Finally, the k tool, we also use it as a debugging tool. It's both a development tool and a debugging tool. One thing it does is it assumes when you run commands, that you don't care which pod you're running in. It's any pod and it tries to find the main container and then do stuff there. We had a lot of developers ask us, how do I just SSH into my service? You can actually type k ssh ENV=staging. It knows to go to the bonk container of any pod for bonk staging.

And finally, you can specify a particular pod in a particular container. To get the logs for a container, you just type CONTAINER=bonk or CONTAINER=statsd or whatever it is that you want your logs for. Then you can automate debugging with this k diagnose command, which I'm going to dive into. One thing I wanted to point out on this slide is we do differentiate a main container versus sidecar containers. This is a pattern that other companies are using as well. But there's no official support for it in the Kubernetes Docker project. But this is something that we landed on that we needed, having a difference between the main container and sidecar containers.

One example of this is, if you have a Datadog sidecar container, and you do a Datadog operation which brings down Datadog, you don't want all of those Datadog sidecar containers to take down your service because it's failing health checks. You want to have a way to say, “This is the main container, this is the important process. If this process is unhealthy, then my service is unhealthy.” Some of these other processes and other containers are less important. We can have graceful degradation where maybe we're having a problem with logging. That doesn't mean that your whole service needs to have a problem.

Then, yes, kubectl diagnose is a plugin I wrote recently. It was a collection of bash scripts before, but I kind of learned about this kubectl plugin thing. Who here has heard of kubectl plugins? Not that many people. That or everyone's fallen asleep. Actually it's pretty new. So we have this collection of bash scripts, and I was like, “There must be a better way to do this”. I feel kind of, like, it's cool. It's like a neat tool. But I'm like Kubernetes is all about extensibility. How do you extend kubectl to run commands you want? That's what kubectl plugins are, is custom kubectl sub commands.

I was like, "Okay, that's cool. How do I build a kubectl plugin?" I read this and I was like, "Wow, this is so easy." Basically, it's just a standalone executable file that starts with kubectl. And it just needs to be in your path somewhere on your computer. You can build any executable like a bash script. Then call it kubectl-whatever, and then now you have kubectl commands. So that's pretty cool.

Here's the setup for k diagnose. I needed to intentionally break my bonk service. So, yes, my generator worked, it created a valid Hello World Service. Then I changed the command to be heyo, which means nothing. Then I deploy it. And now my pod is in CrashLoopBackoff. So, when you have a pod in CrashLoopBackoff, you might take a few of many steps to debug it. So for me, I might do kubectl get pods --namespace=bonk-staging -o=yaml, pipe that to grab, look for red=false, kind of like search around like that. Or maybe I'll just do keep kubectl logs, namespace for this pod for that container. This is obviously before any of my shortcuts.

Maybe there's something wrong with the cluster. Maybe that's why it's a CrashLoopBackoff, I don't know. So I might also get Kubernetes events related to this pod and filter them whether they're normal or not. So look for un-normal events. Yes, and if you look at this command it's off the screen, I could barely capture it. I never want to type this command ever again. It's kind of painful. So I create kubectl pod events. My very quick plugin that took me 30 seconds to write that calls kubectl events with my namespace, with my pod, and then looks for any interesting information about it, prints it out.

That's pod events. But I wanted to build a diagnose kubectl plugin and basically it's a Go program that you type kubectl diagnose --namespace=mynamespace and it diagnoses your problem. Basically it takes the namespace and figures out the rest. So, yes, I use the Cobra Go CLI, it's pretty cool. Namespace parameter. Then it just starts cycling through all of your containers and looking for ones that are unready. You also want to cycle through Init containers because your pod can get stuck there as well. Then, yes, it just prints out all the last logs of unready containers and then it uses my podevents plugin to get all of the events, filters for normal ones. I realized I could have also implemented podevents in the Go program but I was running out of time and I'm actually just quicker with bash. So, but yes, you can just call kubectl podevents from your Go program, and then here it is.

So I'm calling kubectl diagnose, --namespace-bonk-staging, and it gets all the unready container info, it gets podevents, and it gets the podlogs for the unready containers. Basically, all three of these say something like, executable not found, what is heyo? What are you doing? So I finally know what's wrong with my service. And this is kind of a contrived example. But there are lots of reasons why our services got unhealthy. Having an engineer just type one command immediately tells you there is something wrong with this service and this is what it is.

Distributing Opinionated Kubectl

One other thing I want to get into is distributing opinionated kubectl. You can actually use a package manager for this. If I want to find, install and manage kubectl plugins or distribute plugins on Linux and Mac, or if I want to share my plugins with others, how do I do that? I found this open source project that's called Krew. It's an unofficial Google open source project. It sounds like brew, I know what brew does. So, it's like brew for plugins, and it itself is a plugin and so you can just install it and start using it. There's actually a master list of plugins, and so once I installed it, I started running. It has a search command so you can search for plugins.

One thing I wanted to look for was, we kind of have this hacky way of interactively debugging with a pod with a sidecar that kind of just like sticks around. It has an interactive shell, and I realized someone else had already created this. I didn't even need to do that. One other thing I was trying to do is create a tool for our developers to SSH in as a particular user. For the majority, we run our containers as a non-privileged user for security reasons. But sometimes people want to debug their containers as a user with some set of privileges. So kubectl makes it really hard to SSH in as a user that is not the user running the program. But in this case, we actually do want to run SSH in as a different user. That actually also exists as a kubectl plugin. I also don't have to write that. I think that is super cool, not having to write things.

The takeaways in this section, is I think you should create opinionated tooling on top of Kubernetes. You can think of Kubernetes both like the configuration and the tooling, the wrapper, the kubectl, as sort of a base platform. You can build your own platform on top of that, which is actually what people, including yourself, interact with. You can automate common K8s workflows with kubectl plugins or with this bash wrapper script.

And the thing that I want people to take away from this section is that, this isn't just something that I built for other developers, people who aren't interested in learning Kubernetes. It's something I use myself, it's something all of our infrastructure engineers use. This is actually a useful tool for any engineer, incurious, curious whatever. That k diagnose command, I use it all the time now. because why would I do it manually? Then finally, you need some way to distribute and discover this custom functionality. You can build your own package manager or you can use one of these open source options.


Now that we have a standardized tool, I want to get into what we get with the CI and CD. So we basically run validations, build and doc build and a few other things. We use Buildkite to dispatch our jobs. But our actual CI is something we built in house. Each step in our CI/CD jobs are run steps in a build Dockerfile. So, yes, everything is a Dockerfile. Yes, we use Docker to build things. It's kind of confusing. And this Dockerfile that builds your project with another Dockerfile, just runs k commands. The commands that you run locally in development are the same commands that run in CI, and are also the same commands that our deploy tool uses.

Everything through development, CI and production, uses the same commands. What I think is really important about that, is that people are using these commands every day as part of their normal development. If something breaks in CI, they know how to test and debug that. Even more importantly, if something breaks in production, they already have their DevOps tool flow. It's the k command again. So when they're kind of freaking out internally, they already know what to do, k SSH, k logs, k diagnose in production as well to debug their service.

One other question I get a lot is, how do you optimize your containerized Java build? We have this huge Java mono repo with millions of lines of code. It´s a lot. So, yes, without any optimizations, our builds would probably never finish actually. So we basically have a base Dockerfile. It's updated on a schedule that installs everything it needs, exports and environment variables, and does a dummy build. That's the important part because it caches our compiler stuff. Then our build Dockerfile just inherits from this base Dockerfile. I think there was another talk that dived into this earlier today, I didn't have the opportunity to attend it. But you could spend a whole talk talking about optimizing your Java build. But definitely taking advantage of Dockerfile layer caching would really help. That's what we do, that's builds.

Deploy Process

Our deploy process looks like something you'd expect. What's really powerful about storing your configuration as code is, the way you do configuration is the way you do code which is develop, merge, and deploy. So we actually have all this configuration. On our left is the Kubernetes configuration and in the middle is our custom configuration. How do we apply that Kubernetes configuration? Basically you apply all the files. There's some weird stuff in Kubernetes. I won't get into it, but sometimes you need to apply, sometimes you need to replace, so we have an algorithm that kind of figures that out. Then we always restart pods on deploy to pick up changes. I think this is really important. The reason we do this is so that if you change something, that change is reflected immediately on deploy. For example, if you change something and it causes issues, you know immediately that it causes issues.

The other strong opinion we make is returning an atomic success or failure state. For example, if one of your containers restarts for 30 seconds on the beginning of a deploy, for us, that's a deploy failure, even if after 30 seconds your service is up and totally fine. So that's another strong opinion we took is, your service can't start off its life crash looping. One reason for that is if you have 10 large services, each with 100 replicas that do that, and they all deploy at the same time, you're possibly looking at taking down your cluster because you really don't want 1,000 nodes in CrashLoopBackOff at the same time. That's just a scaling problem.

One way we restart pods on deploy is, we just append a date label to the pod spec, which that always changes or always convinces Kubernetes to restart all of the pods. So that's the Kubernetes configuration. But I talked about our custom configuration, so how do we apply that? We use something called custom controllers, which lets you define your own configuration. One of our configurations is aws.yml. You basically use kubectl apply [inaudible 00:39:13] command to apply to the cluster. And, there are three steps your configuration developers need to use, which is creating a custom resource definition, creating it, and then this is the stuff we provide. It's a generic controller that calls a web hook. So for all of our configuration, it's the same kind of controller that calls a web hook. And then the actual logic is you need to build a service or a function that exposes that web hook. Then that service or function is in charge of applying those changes.

In this case, our security team is in charge of IAM roles. They built an AWS lambda function that for service will go in and create an IAM role or attach certain policies to it. So they just need to expose a web hook that our controller can call. And by the way, yes, here's a practical use case for serverless, it does exist. And, yes, but it could have also been a Kubernetes service. So our controllers don't care, just expose a web hook, and then we'll know how to apply aws.yml, project.yml, documentation, etc.

The takeaway here is that when code and configuration is deployed with the same process, it kind of just all follows the same flow. That flow is the developer flow that developers are familiar with. So it's important for deploying changes, but it's also important for rolling back. If you deploy an aws.yml change, and you've messed up and taken away all the perms for your service, you'll know that pretty quickly, and then if you roll back, you'll get those permissions back again. If you use custom resources and custom controllers, that's sort of the way that you get Kubernetes to apply all these custom files. So we're going through the process of getting all of our configuration to be applied this way.


I kind of want to split this out into a separate section, validation. So CI/CD, you're building and deploying your projects, you may have already figured out a way to do this. But have you done this one neat trick of validating all of your configuration? One other strong opinion we took is that configuration should be validated. If we can catch something wrong with your files, we will. We can do that in two different ways. We want to enforce best practices, at build time, and at deploy time.

Here's one example. Basically, we enforce that all projects have a team associated with them, and then that team is valid. If you look at the bonk service - actually, this is kind of hard to read, but -the team that I've assigned to the bonk service is my team, which is info production platform developer productivity service orchestration, and that is a valid team name. When I validate bonk, it passes validation. All of our team names are stored in another repo, and that's the source of truth for team names.

So how do we validate configuration at build time? Well, we just have a global jobs repo. That's where we store all of the global validation scripts. So you just define your global job in the global jobs repo. And then our job dispatcher always dispatches global jobs to every project. From bonk's perspective, it configures some of its CI checks, and that's the build and maybe a few other things, their test suites. I hope they have tests, maybe not bonk, because I just created it right now. Then all the global jobs also just run kind of automatically. The other thing is you can't opt out of these global validation jobs, they're required.

What do we validate? Well, if you have a YAML enter, so if you have invalid YAML, we'll catch that. That's pretty helpful. Sometimes there's Kubernetes configuration that we know is wrong. There's something wrong with that configuration. It's caused an incident before or whatever, we can validate on that. Well, we have released a few bad versions of our own configuration in the past. So if we know that version 25.2 of kube-gen is broken and you're running 25.2, we can catch that. Similarly, with Kubernetes versions that we know have problematic behavior, we can prevent you deploying those. Just other things, namespace length.

And then obviously, some more internal things that are specific to us. We know what a valid project name looks like. You can't use emojis as a project name. So we'll validate that, and we'll validate that you have a team owner. What do we do for deploy time? Actually, Kubernetes has something already built in for this. And that's admission controller. So, admission controller, can just stop things from happening. You actually try to persist your API changes and it'll be like, "Nope, not persisting. Not going into that data store."

What we do is we basically, a lot of the metadata is encoded as annotations. And admission controller just checks for whether those annotations exist. Or it checks if it's missing required annotations. It also rejects specific conditions that it can detect on. What do we validate with admission controller? Project ownership, annotations, whether it has a gate URL annotation, so it has to be stored in Git, it has to have a remote, so you can't just deploy random things from your laptop into production. We have a lifecycle for our configuration. If you're below our minimally supported version, we'll not deploy your project.

Also, these are more gotchas than maybe things we have strong opinions about, but for example, we want make sure production images are uploaded to ECR, which has certain guarantees that other registries that we have don't. Prevent deployment of unsafe workloads, prevent deployment of development namespaces to production clusters. So standardized namespacing helps us out again. By the way, this is just one of the ways we prevent having development and production interfere with each other. But it's just yet another way we can sort of enforce that they're separate environments and separate clusters.

CI/CD should run the same commands that engineers run locally. It should run in a container. Additionally, you want to validate configuration as part of CI and CD. Those validation scripts at build time, the admission controller at deploy time, is really the way you enforce good practices.

Ten Takeaways

10 takeaways. I'm going to leave this slide open, because I know I click through really quickly, but here it is. The 10 takeaways that I wrote individually at different points in the talk. We started with the configuration itself, abstracting away that complex Kubernetes configuration. Then one strong opinion we took is standardizing on environments and namespaces, and that really helped us at different levels, especially with our tooling. Everything about a service should be in one place in Git. Once we store it in Git, we can get all these other things for free. So we can make best practices a default by generating configuration and storing that in Git. We can version it and refactor it automatically, which just reduces developer toil and also gets important security fixes, etc., in on a schedule.

We can create an opinionated tool that basically automates common workflows, and we can also distribute this tool as kubectl plugin. We can integrate with Kubernetes for the tooling. CI/CD should run the same commands engineers run locally in a containerized environment. You can validate configuration as part of CI and CD. Code and configuration should be deployed with the same process. In that case, that's our deploy process, custom controllers to deploy custom configuration. And then you can just use custom resources and custom controllers to integrate with your infrastructure.

I don't want to make it sound like this is all solved. Today, 50% of our service is in Kubernetes, and our goal is by the end of H1 this year, for the rest of our services to be in Kubernetes. We have a lot of work to do. Just the scale of operating with that many services, we have some services that would run with several hundred replicas, so, that's quite a bit of scale. Moving all our configuration to this GitOps workflow.

Some examples that aren't in this flow yet, but we want to be, is dashboards and alerting. If you could have your dashboards and your alerting alongside your code, and then we could also generate default alerts, that'd be really neat. So, yes, scaling the cluster, we've got some etcd proms working on. We're investigating multi-cluster because we're not going to be able to support everything with one production cluster. Some of our sources that have TBD on the migration list are just because they have really high memory requirements. So we've actually already solved this for machine learning jobs, which have really high GPU requirements. They also have their own cluster.

But what about scheduling services with different memory requirements on the same cluster? Do you want heterogeneous cluster with multiple instance types, or do you want separate clusters for different needs? So that's something we're figuring out. Stateful services, tighter integration with kubectl, because that's kind of a new thing. And then, yes, a bunch of other problems that I've been working on and I'm happy to talk more about it. You could DM me, you could talk to me after the chat.


See more presentations with transcripts


Recorded at:

Apr 18, 2019