Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Panel: Kubernetes at Web Scale on the Cloud

Panel: Kubernetes at Web Scale on the Cloud



The panelists discuss what they have learned scaling their own workload in the public cloud. Topics include capacity and workload management, security integration, and homegrown PaaS integration.


Harry Zhang is Tech Lead of the Cloud Runtime Team @Pinterest. Ramya Krishnan is Staff Site Reliability Engineer @Airbnb. Ashley Kasim is Tech Lead of the Compute team @Lyft.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Watson: I'm Coburn Watson. I'm Head of Infrastructure and Site Reliability at Pinterest. I'm the track host for the cloud operating model. Each of the panelists has experienced running Kubernetes large scale on the cloud.

We have three panelists from three companies. We have Ashley Kasim from Lyft. We have Ramya Krishnan from Airbnb. We have Harry Zhang from Pinterest. Obviously, you've had a journey at your company, and your career to Kubernetes running at large scale on the cloud. If you could make a phone call back to your former self when you started that journey, and say, "You should really take this into account, and it's going to make your trip there easier." What would be the one piece of advice you'd give your former self. Feel free to take a little time and introduce yourself as well and talk about what you do at your company.

Kasim: I'm Ashley. I'm the Tech Lead of the compute team at Lyft. Compute at Lyft is basically everything from Kubernetes in an operator webhook, all the things that run on Kubernetes wire, all the way down to AWS and Kernel. It's a space that I'm involved with. Looking back, what started as a Kubernetes journey back in I think circa 2018, with the planning starting in 2017. I think that the one thing to consider carefully is this huge transition for us it was from just like Amazon EC2 VM based to containerizing and move to Kubernetes. Just to think about when you have orchestrating infrastructure, it's different than building this large deployment of Kubernetes from scratch. Think carefully about like what legacy concepts, or other parts of unrelated infrastructure too that you're planning on bridging, what you're planning on reimplementing, fork better in Kubernetes, and what you're going to punt down the line. Because, spoiler alert, the things that get punted, it sticks with you. Then also, like when you decided that, we're just going to adapt something, you can quickly get to this realm of diminishing returns, where you're just spending too much time on this adaptation. I think how you build that bridge is as important as the end state of the infrastructure that you're building.

Krishnan: I'm Ramya. I'm an SRE with Airbnb. I've been with Airbnb for about four years. We started our Kubernetes journey about three years ago. Before that, we were just launching instances straight up on EC2. I was back then managing internal tools that just launch instances. Then about three years ago, we started migrating everything into Kubernetes. What would I say to my past self? Initially, we put everything in a single cluster the first one year, then we started thinking about multi-clusters. We were a little late in that, so think that you are going to launch about 10 clusters and then automate everything that you're going to do for launching an instance. Terraform, or any other infrastructure as code tool, use them efficiently. Do not do manual launches and use ad hoc startup scripts because they are going to break. Think about how you're going to split deployments across availability zones and clusters. Customers are going to be ok about splitting deployments across three availability zones. These are the two things I would tell myself about three years ago.

Zhang: I'm Harry Zhang from Pinterest. Currently, I am the Tech Lead for the cloud runtime team. My team builds a compute platform leveraging Kubernetes, which serves a variety of workloads across Pinterest, different organizations. My team solves problems related to container orchestration, resource management, scheduling, infrastructure management and automation, and of course, compute related APIs, and problems related in this area. I'm personally a Kubernetes contributor. Prior to joining Pinterest, I worked at LinkedIn data infrastructure and focused on cluster management as well.

Pinterest started the Kubernetes journey also about three years ago in 2018. If I have something to tell my past self about it, I would say that Kubernetes is a very powerful container orchestration framework. Power really means opportunities to you, as well as responsibility to you as well. Letting things grow wildly may lead to a fast start, but will slow you down very soon. Instead, I would suggest to my past self to take an extra step to really think carefully about how you architect your system. What are the interfaces? What are the capabilities you want to provide within the systems and to the rest of your businesses? What is the bridge you're going to be building between your business and the cloud provider? If you use Kubernetes wisely, you're going to be surprised to see how significant a value it can bring to your businesses.

How to Manage Multi-Cluster Deployment and Disaster Recovery

Watson: How do you manage multi-cluster deployment and disaster recovery?

Kasim: On the whole, multi-cluster somewhat gets into like cluster layout as well. For us, for our main clusters, which are like all the stateless services, we have them like one cluster per AZ, and they're supposed to be entirely independent and redundant. The theory is that we should be able to lose one AZ, and everything will be fine. We use Envoy as a service mesh so that customers just drop out of the mesh, and the other clusters will scale up to pick up that slack. We haven't really experienced AZ outage instances, but we have in staging had bad deployments or something that go to a single staging cluster and take it out. We've found that because we have these independent and redundant clusters, we can just basically go so far as to like blow away the cluster that gets broken or whatever, and just re-bootstrap it. I think keeping your failure domain very simple, logically has worked well for us, and having this redundancy so that you can afford to lose one.

Zhang: I can share some stories. We run multi-cluster architecture in Pinterest as well. Starting in 2020, we also get into those single cluster scaling problems, so we started a multi-cluster architecture with automated cluster federation. Just like what Ashley talked about, we also do single zonal clusters, and we have zonal clusters, and we have our regional API entry point for people to launch their workloads. The Zonal cluster brings very good and very easy ways for people to isolate the failures, and the blast radius to zonal or sub-zonal domain. We have zonal clusters and one or more clusters in each zone. Our federation control plane is going to be taking the incoming workloads and split them out into different zones, and to take the zonal health, zonal load into its smart scheduling decisions. If a zone goes into a crappy state or something, we have human operational or automated ways to cordon the entire zone and to load balance the traffic to the others and healthy zones. We do see a very big of a value for the cross-zone multi-cluster architecture that we can bring to the platform with a better scalability and easier operations, and many more.

Krishnan: Right now we don't do zonal deployments. That's something that we're going to strive next quarter.

How to Split Kubernetes Clusters

How do you split the Kubernetes clusters? We don't split it by team. When a particular cluster reaches a particular number for us, we've picked an arbitrary number of 1000 nodes, we mark the cluster as full and not available for new scheduling, new deployments. We stamp out Terraform changes to create a new cluster. It's not split by team, the cost attribution happens by CPU utilization and other stuff and not by the cluster. Cost attribution does not happen at the cluster level. That way the teams are split across multiple clusters.

Watson: Do you have anything else you want to add on that one of how you break apart Kube Cluster workloads?

Kasim: Let's tease it a bit differently, where it's not like we have this cutoff and then on to the next, we instead roughly chunk out clusters based on their intended purpose. For stateless services, and it's all around like interruptibility, since we find that that is like, how well does a Kubernetes delimiter? We have stateless services on our redundant core clusters that are per AZ, and each one of them is a replica of each other that basically deploys to five targets instead of one. Then we have a stateful cluster, which is a little bit more high touch because it's for stateful services that don't Kubernetes very well. Then we have dedicated clusters for some of the ML or the ETL long running jobs stuff, so it's kind of different interruption policies. We just found that the thing that we split on is just like interruptibility, which works well for batching things into different maintenance schedules.

Zhang: In Pinterest, we provide Kubernetes related environment in multi-tenanted setups. We run them mixed off the long running and stateless workloads, and also batch workloads, such as workflow and pipeline executions, machine learning, batch processing, and distributed training and all the things to that. We treat the Kubernetes environment as an entire we call federated environment, which is a regional entry point for the workload submissions and executions, including new workload launches, and workloads updates. We totally hide the details of the zonal clusters, which we call the member clusters for executing the workloads away from our user. We've built a layer of federation sitting on top of all the Kubernetes clusters that is responsible for workload executions, and there's smart scheduling decisions and dispatching logics and updates to the workload executions. Also, of course, the zonal cluster workload execution statuses will be aggregated back to our regional API endpoint, and people can know how their workloads are executed, or what are the status they have from the compute perspective?

Spot and Preemptible Workloads

Watson: I know internally at Pinterest, we've dealt with this. At a previous company I did where basis, let's make it efficient, use Spot. I know capacity management is one of the challenges with Kubernetes. Let's say you're on Amazon, you basically become the EC2 team. Everybody wants it when they want it. I'm interested to answer that question particularly because efficiency is always a concern. Does anybody have any experience with running using either Spot or preemptible workloads internally?

Kasim: Spot has been something that we've been very interested in. Currently, pretty much most of staging, it's all Spot because interruption was a concern. This is again, something that works very well for stateless service and doesn't work so well for stateful services, or like long running jobs. One of the things we've looked at is with interruption of batch using like the minimum lifetime, where you guarantee the runtime of like one hour or something. Then just using annotations to run lower priority jobs or jobs that we expect to be done in an hour on those things. I think it's less of a limitation of Kubernetes for Spot, but more like, what do your workloads look like? In general, Kubernetes favors interruptibility, so the more interruptible you can make things, and using things like checkpointing, the more Spot friendly your clusters will be.

Krishnan: We are also trying to introduce Spot for staging and dev clusters. We'll probably launch that by the end of this year.

Zhang: In Pinterest, we also have a great interest about the Spot. Not only Spot, to me we call it opportunistic compute. We have teams plan their capacities for sure, every once a while, but businesses can grow out of the bound, or people just want to have things to execute it opportunistically. Opportunistic computing to me has two categories. One is those provided from the cloud provider, which is Spot Instances directly. The other part is like all the reserved capacity from the company that is not actually currently being used. Those capacities can be aggregated as a whole pool of resources for those workloads who can afford opportunistic computing, that can tolerate interruptions and retries, such as your dev, your staging workloads, or any other batch processing that is not very high tiered. Currently, we are exploring the possibilities of both categories. We do have some experimental results inside and we try to move forward with those.

Watson: When I was at a previous job at Netflix, and we had a container runtime, Kubernetes was not there yet. We actually had to use the term internal Spot to get people to adopt that concept of, we have hundreds of thousands of instances, 97% are covered under reservations. Don't worry about it, like launch it, and we'll make sure that you're running in some compute area. Because that nondeterministic behavior of Spot became a pain. At Pinterest, we create blended cost rates, so we roll together on-demand and RIs to normalize cost for people, because we find that most individuals can't even control their on-demand very well on a large shared account.

Service Mesh for Traffic Management

Do you run Istio or another mesh in your clusters for traffic management?

Zhang: In Pinterest, we have our traffic teams who have built very internal mesh systems based on Envoy, we don't use Istio, but we have our Pinterest specific mesh systems to connect all the traffic.

Krishnan: Here at Airbnb we have our traffic team work with Istio. We are just migrating from our internal legacy traffic service discovery into Istio based mesh system.

Kasim: Envoy was born at Lyft, so we of course use Envoy. We don't use Istio, we have our own custom control plane. Actually, we had an older control plane before we moved to Kubernetes. That was, for the longest time, probably the most basic thing out there. We actually had to rewrite and redesign it for Kubernetes to keep up with the rate of change of infrastructure in Kubernetes. We use Envoy as a service mesh, and we are very happy with it.

Starting with Mesh from Day One on Kubernetes

Watson: Thumbs up if you think people should start with mesh from day one on Kubernetes regardless of the mesh solution, or it's too much overhead.

Kasim: It depends on where you are coming from. We started with mesh beforehand, so would have been hard to rip out that stack. Then for us, it's very important to not run any overlay networks. We needed something to do that [inaudible 00:19:00].

Krishnan: Definitely don't run overlay network. Setting up Istio and mesh requires a considerable investment from your traffic team. If they are ready to undertake that as you migrate to Kubernetes might be a lot to ask from your traffic team. It depends on how much time and investment bandwidth you have on you.

Zhang: I would echo what Ramya said as well because mesh could be a complicated system, and it really depends on what is the scale of your businesses and how complex your service architecture is. If you only have a very simple web server plus some backends plus some storage, probably like mesh is too much overkill. If you have a lot of microservices and want to communicate with each other, and like Ramya said, your traffic team is able to take the big responsibilities with all the communication and traffic, probably there's more values in mesh.

Watson: I know at least in Pinterest we have mesh. Like you said, Ramya, if you have a traffic team you can put in the cycles, that's really important. Given our huge architecture outside of Kubernetes, it's like trying to replace someone's skeleton. It's pretty painful. What I've seen people do is there's a capability of a mesh you get, maybe you use mTLS, and you have secure traffic communication, so you try to find that carrot that out the gate, they get that. Yes, going back to people and saying, in all your spare time, why don't you adopt mesh on everything? It's a painful composition.

Kasim: For us, it's the other way out where we already had that skeleton of the service mesh, and was putting the Kubernetes on top of that.

Managing the Kube Cluster for Multiple Teams

Watson: How do you manage the choice to have Kube cluster for one or multiple teams when you see that a team needs more services than the number of their apps? Does anybody have a perspective on that?

Krishnan: If a team has more services, just split them across clusters. I'm a strong believer of don't put too many eggs in a single cluster. If a team has too many services, or too many important services, don't put all your level zero, subzero services in a single cluster.

Zhang: To us, currently for the dedicated cluster use cases, we evaluate very carefully. Because we are putting this investment into a federated environment, we want to build on one single environment that is very horizontally scalable, and easy management. We try to push people onto this big compute platform we built. However, within the compute platforms we do have those different tiers of resources, different tiers of quality of services we provide to our users, so if people really want a level of isolation. When people talk to you about the isolations, we usually ask them, what is the exact thing that you're really looking for? Are you looking for the runtime level isolations? Are you looking for control plane isolations, or you simply want to get your clusters because you want to control? More clusters and sporadic clusters across the company may bring you extra burdens into supporting them, because Kubernetes provides you with very low level abstractions, and it can be used in very creative or a diverged way. This could be hard for people to support in the end. Currently, in Pinterest, the direction we want to push people is to guide people to the multi-tenanted and the big platform we are using. As we clean up those low hanging fruit and moving forward, I do see the potential that some people really need the level of isolations. Currently, it's not the case here.

How to Select Node Machines by Spec

Watson: How do you select the node machines by spec? In your clusters, do all the nodes, are they the same machine type? I assume in this case, we're talking about like EC2 instance type, or so.

Krishnan: All this while, we had a single instance type C5, 9xlarge. Now we are going multi-instance type this second half. Now we have moved to larger instances. Now we have added GPU instances. Now we have added memory instances all in a single cluster. We have Cluster Autoscaler that scales up different autoscaling groups with different node types. It works most of the time.

Zhang: In Pinterest, we also have very diverse instance profiles, but we try to limit the number of instances, which means like user cannot arbitrarily pick the instance types. We do have a variety of different combinations like compute intense, GPU intense, or those that have local SSDs, or different GPU categories. We do have those different instance types. We try to guide our user to a way to think about the resources that, what exactly is the resources you're using? Because sometimes when we bound the workloads to the instance types, we can suffer from the cloud provider outages, but if you try to get your workloads away from the particular instance types, there are more flexibilities at the platform side for the dynamic scheduling to ensure the availabilities. There are our workloads that wants the level of isolations, or their workload is tuned to particular resource categories or types and they pick their instances, or they tune their workload sizes to exactly that instance type so they get scheduled onto that instance. In Kubernetes, we do manage a variety of instance types, and we leverage those open source tools, like Kubernetes itself can manage those different instance types scheduling, and autoscaler as well to pick whatever instance groups to scale up as well.

Kasim: A limiting factor, I think, [inaudible 00:25:44] is the Cluster Autoscaler. It only allows node pools with homogeneous resources. The way that we get around this, because we do want to hedge against things like AWS availability, and like being a fallback to other instance types in the launch template. Because we just have formal pools that are just sewn together via just like the same labels on them so that the same workloads can schedule there, and they don't really know where they're going to end up. This allows us do interesting things that are maybe beyond the scope of what autoscaler is able to do. For example, we were interested in introducing AWS Graviton instances, which is the Arm instances now being supplied by AWS, and Cluster Autoscaler doesn't handle this whole architecture concept very gracefully. We ended up just using AWS launch templates to have multi-architecture launch templates. Then the autoscaler just doesn't know the difference, and just like boots nodes to either arch, and so we prefer Arm because the price is better. We can fall back to Intel still if we run out of Arm since demand is high and so we don't want to get squeezed out of instances.

Watson: Amazon just keeps releasing instances, so it's something we'll deal with.

Preferred Kubernetes Platforms

There are many Kubernetes platforms like Rancher, OpenShift, which one do you prefer and what makes them more special when compared to others?

Kasim: We use just like vanilla Kubernetes, that we self-manage ourselves on top of AWS. I know that cloud providers all have their own special offering out there like EKS, GKE. Part of this is historical concern. When Lyft started the Kubernetes journey, EKS was not really as mature as it is now. We weren't comfortable running production workloads on it. Probably if we're doing the same decision process today, we might take a closer look at some of the providers. I think the key thing is all about the backplane. If you manage your own, run your own backplane, you have a lot of control, but also a lot of responsibility. Debugging SED is not fun. There are many ways to completely hose your cluster, if you do something, cut SED the wrong way. There is a tradeoff where if you don't want to deal with operating the backplane, upgrading the backplane, then looking at a managed provider even just for backplanes may make sense. On the other hand, if you need to run a lot of very custom codes, particularly like patching a GET server code, then probably you want to host your own just because you can't really do that with most providers. A consideration for Lyft in the beginning is that we wanted that debuggability of being able to actually get on to that API server and tailor logs and run commands, that you just weren't comfortable having to file a ticket with some provider somewhere. That's just consideration as well.

Watson: In Pinterest, we have a similar journey of evaluating things, and we're constantly evaluating the question about what layers of abstractions we would like to offload to the cloud providers. For Kubernetes, particularly, the majority of the work we've been spending on is to integrate this Kubernetes ecosystem with the rest of the Pinterest. We have our own security. We have our own traffic. We have our own deploy systems. There are a lot of things that we need to integrate, and also metrics and logging, all the things we need to integrate with the existing Pinterest platform. At the end, the overhead of provisioning clusters, as well as operating clusters compared with all the other work is not that significant. Also, like to have our own clusters, we have more control over the low level components. We can quickly turn around with the problem, just like Ashley described before, to provide our engineers with a more confident environment of running their applications. Currently, up to now, we're still sticking with our own managed Kubernetes clusters.

Krishnan: We have a similar story. We started our Kubernetes journey about three years ago. At that time, EKS was under wraps, and we evaluated it, and it did not meet any of our requirements. Particularly, we wanted to enable beta flags, feature flags, for example, port topology spreaders and a beta flag at 1.17, and we have enabled it in all our clusters. We cannot enable such flags on EKS. We also run to a patched scheduler and patched a API server, we cherry picked bug fixes from future version and patched our current version API server and scheduler, and ran it for a couple of months. We feel that if we just use EKS, we may not be able to look at logs and patch things as we find problems because we cannot do all this. We are little bit hesitant about going to EKS. If we have to reevaluate everything right now we may make a different decision.

Watson: I'll just double down on what Harry said. We had conversations a few months back with the EKS team, because much like the question of, why are we not using EKS? If you take your current customer, you say, here's the Amazon Console, go use EKS. They say, where's my logs? Where's my dashboards? Where's my alerts? Where's my security policies? It's like, no, that's our PaaS. That's about what 80% of our team does is integrate our environment to our container runtime. When we talk to the EKS team, we're like, how do you guys solve and focus on user debuggability and interaction? They say, here's a bunch of great third party tools, use these. I think it's that tough tradeoff of the integration.

Managing Deployment of Workloads on Kubernetes Clusters

How do you manage deployment of workloads on Kubernetes clusters? Do you use Helm, Kube Control, or the Kubernetes API?

Krishnan: We use a little bit of Helm to stand up our Kubernetes clusters itself. For deployments of workloads, customer workloads, we use internal tools that generate YAML files for deployment and replica sets. Then we just apply those during runtime. We just straight up call the API server and apply these YAML files. That's what we do.

Zhang: In Pinterest, we have a similar way of abstracting the underlying compute platforms away from the rest of the developers. After all, Kubernetes is one of the compute platforms sitting inside of Pinterest that the workloads can deploy to. We have a layer of compute blended APIs sitting on top of Kubernetes, which abstracts all the details away, and our deploy pipelines just call that layers of API to deploy things.

Kasim: Yes, similar at Lyft. Hardly any developers actually touch any Kubernetes YAMLs. There is an escape hatch provided for people who want to do custom or off-the-shelf open source stuff, but for the most part, service owners manage their code. Then there's this manifest file that has a YAML description of general characteristics of their service. Then that feeds to our deploy system. Our deploy system generates Kubernetes objects based on those characteristics. It applies those to like what clusters it should be deployed to based on if it's a stateless service, in a Kubernetes cluster. I think that helped smooth our transition as well, where developers didn't actually have to learn Kubernetes. It just preserved this abstraction for us. I think another way of looking at this is like this is somewhat CRD-like, not necessarily yet. It's a similar concept where there's just one layer above all of it that can raise objects for developers to interact with.

StatefulSets for Stateful Services on Kubernetes

Watson: StatefulSets are great for stateful services on K8s, do you find that that's the case, or is there some other mechanism used to support stateful services?

Kasim: StatefulSets work well for small stateful components, the big problem that we have for developers is if you have StatefulSets that have hundreds of pods in them, it'll take a day for you to roll your application. Developers using them will be like, "Developer velocity. I can't be spending a day to roll out a change. Yet, it's not really safer for me to roll faster, because I can't handle more than one replica being done at a time." We've looked at a variety of things. There's been many misuse of StatefulSets, meaning people who just didn't want things to change ever, but actually were in danger of data loss or something, in case of a replica going down. Just making sure everybody who's using a StatefulSet really needs to be using a StatefulSet.

Then looking at StatefulSet management strategies like sharding, or looking at third party components. We have some clusters running the cruise controller here, which runs some extensions on the built-in StatefulSet object, like clone sets, and advanced StatefulSets, which just are basically StatefulSets with more deployment strategies. They're basically breaking down StatefulSets into shards, and these shards can all roll together so that you can roll faster. That has helped address some of the concerns. There's also a lot of issues there. Bugs with Nitro and EBS, and just volumes failing to mount, which could go smoother and [inaudible 00:36:56], actually ends up taking your node and cordoning it, and then you have to go and uncordon it. Yes, stateful I think is one of the frontiers on Kubernetes, where I think a lot more can be done to make it work really smoothly at scale.

Krishnan: For our stateful there are very little StatefulSets in our infrastructure, Kafka and everything else is still outside of Kubernetes. We are right now trying to move some of our ML infrastructure into StatefulSets, into Kubernetes. The problem is we cannot rotate instances in 14 days. They are very against killing pods and restarting pods. We are still in conversations about how to manage StatefulSets and still do Kubernetes upgrades and instance maintenance. It is a big ask if you're just starting out.

Zhang: Pinterest does not run StatefulSets on Kubernetes. I was working at a previous company in LinkedIn, I worked in data infra, and my job is to do the cluster management for all the online databases. A couple very big challenges at a high level to upgrade StatefulSet, one is about the churn. How do you gracefully shut down a node? How do you pick the optimal data node started to upgrade them? How do you move the data around? How do you ensure that all the data shards have their masters on? If you wanted to do leadership election and to do a leadership transfer for different replicas of the shard, how do you do that efficiently and gracefully? This would involve a lot of deep integrations with the application's internal logics to achieve that. Also, just like Ramya said, to update the StatefulSets, we needed to do very smooth and in-place upgrades of the things, and if you put other side cars inside the pod, if you just shut down the pod and bring it up, it will finally converge, but it's about the warm-up overhead. It's about the bigger risk that the top-line success rate of all the data systems is going to be having a dip during the upgrade. There are a lot of challenges that's not very easily resolvable.


See more presentations with transcripts


Recorded at:

Jul 30, 2022