Transcript
Guo: When talking about Alibaba, many people may know there are Double 11 events, which is essentially a special each year, December 11th. Alibaba will host a special promotion event for all the vendors, just like Black Friday in the United States. In 2018, for one day, the total online shopping sale was about $13 billion. To make a comparison, the 2018 North American Thanksgiving sale was around $3.7 billion. Basically, it’s about nine times the revenue from one company, and in one day.
Many people think this is a miracle, including me, and I think it is probably the largest business and technology collaboration in the world. To support this event, the Alibaba infrastructure scales the normal computer capacity by about 10 times, reaching about 10 million core capacity, and the storage subsystem is scaled to support about a million parallel IOPS operation. Our infra team needs to maintain more than a million containers to run more than 10,000 applications. In other words, this Double 11 event is indeed a campaign for the infrastructure teams.
My name is Fei Guo, I'm a senior staff engineer from the Alibaba Cloud. I'm working on the cloud container platform team, which is the core team that has supported the Double 11 events that I mentioned. My current work focuses on the scalability and building a service solution using Kubernetes. I'm also leading several open-source projects about workload automation and multi-tenancy in Kubernetes.
Here is the outline of today's talk. I'm going to first talk about how Alibaba is moving towards Kubernetes, and the architecture innovation. Before talking about the details, I will list several challenges that we faced during the integration, and I will talk about some solutions to overcome those challenges, including some controllers - by the way, they are all open-sourced -, one in-house scheduler, why we built it and how we built it, and a special colocation system, which is essentially a system that can allow us to mix the online jobs and offline jobs together to achieve the high-cost resource utilization. After talking about some contributions we made to the upstream during the integration, I will conclude my work with some lessons that I have learned during the process.
Principles
Before talking about the Alibaba container platform architecture, I want to go back a little bit to discuss some of the design principles of designer infrastructure. I think many of the people in this room are experts in infrastructure, and you may come here from startups, or web scale companies like Google or Facebook, or you're coming from cloud providers, like AWS. Let's see if we can come to some agreements on this topic. In my opinion, I think there are four domains that infrastructure can focus on, based on your needs. The first is utilization, then stability, standardization, and the DevOps.
For startups, since they run very quickly and the application iterates very quickly, it is very important that the infrastructure can provide convenient workflows for DevOps, as well as for web scale companies that are your own data centers. Because the cost of maintaining such a resource pool is so high, then the resource utilization can be the primary concern. For cloud providers, there's a slight difference, because cloud providers needs to provide SLA to the users, and sometimes breaking the SLA can be catastrophic. Then, the stability may be the most important thing. For some enterprise users, if you cannot invest too many human resources on the infrastructure, the standardization is key to keeping the cost of maintaining your infrastructure low.
It is actually interesting; Alibaba goes through almost all four stages. It was a startup, then grew to a web scale company that was hosting a lot of large scale applications. Recently, we started a process of moving the entire Alibaba container platform to the Alicloud, and we need to provide an SLA to our users. In the map, more and more BUs within Alibaba want to leverage our container platform to run their business, though they ask us to provide the standard interfaces. In the end, Alibaba's infrastructure has to consider all four factors.
The Path to Kubernetes
That being said, the requirements of our team actually shift back to 2010, where we simply assigned a dedicated pool of different BUs. They ran their own workload, their businesses grew very rapidly but at different paces, and they implemented their own schedulers. Soon after, we found that having multiple small pools was a big waste to the resource as the scale increased. In 2012, we started to use a centralized cluster, which we call Sigma; we use the code name Sigma to combine all the small clusters into big one, using a unified scheduler. The primary goal is that you use a single system to mix all offline and online jobs together in order to improve the cluster utilization.
It seemed to work well until 2018, because we found the old system so hard to extend, because it doesn't support standard interfaces. When we started to think of using an open-source solution, and Kubernetes became a choice, it was a danger because it has a very big large ecosystem. It supports standard APIs, and the entire architecture is very extendable, which can help us to realize the stability and the utilization requirements of our system. We already have a few large scale Kubernetes clusters in production.
In this figure, I give you a high-level overview of our container platform architecture, using the Kubernetes-based approach. Within the red dash line, these are the Kubernetes components we built. You will see some terms that look familiar, like API server, controller manager stuff. All the blocks outside of the red dash lines are all Alibaba internal services that we need to collaborate with for this transition, and it is totally okay if you don't know any of them. I will explain. We need to convert our PaaS layer to change the API to directly use the Kubernetes API. We need to build a set of controllers to interact with each node's services for storage and networking. We need to hook the existing system that already uses the Kubernetes API to the new system, and we need to make some changes in nodes. Basically, we slightly modify the Kubelet and the CRI shims. I will explain later why we need to do that.
We need to build our own CRI plugin in the CNI plugin. We need to work with another Eastern system, which is the system that is running all the batch jobs in Alibaba. I just named it MaxCompute Master. We need to make sure these two systems work together; that's the reason we introduced our colocation API server, which I also will explain later. Lastly, an offline placement engine needs to work with the unified scheduler, which is used when we build a site for the promotion events, like the Double 11 events that I mentioned earlier.
Challenges
Overall, I just wanted to give you a highlight which is that the infrastructure upgrade is indeed a very complicated task. It can be challenging in many aspects. For us, I think there were three major challenges when we did this infrastructure innovation. I remember a few guys raised their hands, so you also did the infrastructure upgrade to Kubernetes. I bet you guys may have had a similar problem to the one I'm mentioning now. The first one is scalability. Our cluster typically has 10K nodes, and is running more than 100K containers. Nearly no one runs Kubernetes at this scale. Therefore, we may have hit a problem that nobody has ever hit before.
Our old system has some very critical features, but those features do not exist in Kubernetes. In a new system, we have to provide a feature parity, otherwise, we are not even allowed to do the infrastructure upgrade. The third part is operation. The Kubernetes has a lack of the standard operation tools, so in practice many people have to implement their own operators to maintain the system. Sometimes, this can be risky, because the system upgrade operation problem can in practice easily distract a lot of resources to fix all those problems.
For these challenges, we actually came up with a few strategies for the integration to overcome those challenges. Our primary goal is to provide a standard for Kubernetes pod API support. This is the bottom line that we cannot compromise. Otherwise, there is no meaning in doing the entire infrastructure upgrade. In terms of feature parity, we first want to make sure that in upgrading to the new system, we are not sacrificing the utilization, which means our new system has to work with the existing colocation system to make sure the online and offline jobs can work together. We leveraged the standard Kubernetes CRD controller and the scheduler plugin to implement all the features that we wanted to implement to extend to upstream Kubernetes. When we designed these new controllers, new CRDs, we put the scalability as our first constant when building the new components.
In terms of operation, we built some internet tools using the data collected from the Prometheus, and we built an inspection system to detect all kinds of problems or errors happening in the cluster. Lastly, we tried to make our codebase upstream native, which means we don't change any core API server code. For the Kubelet, we make minimum changes. The purpose of doing this is if we want to upgrade the Kubernetes version, this upgrade process will be as easy as possible. This is the strategy that we follow when we do the integration.
Solutions
Next, I'm going to talk about some solutions that we built to overcome these challenges. Let's start with the controllers. In our application, most of our online shopping applications are stateful applications, therefore, the statefulsets are widely used in our infrastructure. The upstream statefulset has a few problems. The first problem is if you want to upgrade the pod image of the statefulset, the controller will recreate a new pod for it. If you want to upgrade the whole statefulset, the controller will upgrade all the past sequentially in the reverse order of the pod index. The problem with this process is it can be slow, if your stateful set has a lot of parts, which can be a problem, because in our deployments, our statefulsets can have thousands of pods, which takes a really long time for doing an upgrade.
We enhanced the upstream services from two aspects; we call one the new controller, we call the other one statefulsets. The first thing we do is we introduce a so-called "in-place upgrade," which means if the container images’ only component is getting upgraded in a spec, the controller will not recreate a new pod. Instead, the Kubelet will just restart the pod with the new image, and the pod is kept the same. The benefit of this feature is that in this way, the pod states are preserved; the pod states, including the IP and all the PV configuration, they are all kept the same during the upgrade, and you avoid a lot of unnecessary rescheduling, because you still need to call the scheduler and go through the all the scheduling pods if you want to start a new pod. If you do the in-place upgrade, those offhand can be just eliminated.
Another thing that we found, at least in our use case, is the upgrade order doesn't matter. The upgrade order is actually something that the upstream stateful insists. This doesn't matter to us. For this matter, we introduce a new feature called MaxUnavailable. What this feature does is this flag just specifies what are the maximum number of the pods you can do the concurrent upgrade. Take for example, in this case, if you say the MaxUnavailable is equal to three, then the controller will upgrade all three pods at the same time. This can speed up the entire upgrade of pods, if the upgrade order is not a big concern for the workloads.
Next, we use sidecar containers in almost all our production pods, but the upstream doesn't provide a standard controller to manage all the sidecars. For this reason, we built a new controller which we call the SidecarSet. Basically, it injects sidecar controllers for all the pods based on the pod selector during the pod creation time, using the standard Mutate web admission hook. The sidecar controller will manage the sidecar containers using the in-place upgrade, so if you want to upgrade the sidecar [inaudible 00:15:12] image, you don't have to recreate the pods. It will just be managed using the in-place upgrades. The sidecar lifecycle management is independent from your main container lifecycle management. Lastly, see how it is removed; all the containers injected by the sidecar will be removed as a result. I think this sidecar container is very important for us too for another feature, which I will talk about later, which is to get rid of the rich container.
One or two days ago, right during the KubeCon Shanghai, we formally announced the release of the OpenKruise project, which is essentially an automation suite for the workload management in Kubernetes. In the first step, we open-sourced three controllers, including the advanced dataset and the sidecar that I mentioned before. The third controller is the broadcastjob, which is, you can imagine, just like a combination of a daemonset plus a job. This is our first step. Our goal is that we're trying to use this project, we hope the community will send us the requirements and the use cases, and even contribute in this area to complement the upstream Kubernetes. Feel free to give it a try, and please give us some feedback. I hope I can work with you guys on this project, as well.
Next, I'm going to talk about the scheduler. Our scheduler runs in a separate pod, so using a scheduler plugin mode. It maintains its own cluster cache, and it has a queue to queue all the pending pods. There are a few reason why we made this architecture. The first one is when we started a project, the upstream scheduler was not scalable at all. It's getting better now, but at that time, it was not scalable. We decided to build our own scheduler to solve all the scalability issues, and we had to implement some feature parity that our old system provides. Using separate pods and using a scheduler plugin, you can decouple the release cycle, so the release of our new scheduler can be decoupled from the API server release. Lastly, if we somehow introduce a bug in the scheduler, at least we want to make sure the API server is not going to crash due to the bug. This is the reason we introduced a new scheduler as a plugin in this effort.
Compared to the upstream scheduler, the biggest difference is that our scheduler is CPU topology aware. This means our old scheduler can place the pods to a specific core in the node. They do this because they want to have a very strong performance isolation, because many of the workloads, our online shopping workloads, have very strict requirements on the latency. They want to have the highest performance isolation as much as possible. This is a requirement that we cannot deny, so our new system has to come up with the same capability. What we do is, we keep the details on the nodes of our topology in our scheduler cache. Let's say if a pod just says I want one core, and our scheduler will operate the cache saying that this pod is placed into one socket is zero with one core. If another pod needs two cores, the scheduler will put the pods into another socket in this node, and mark the two cores being occupied. This is just in the scheduler side.
In order to honor the scheduler decision in the node, what we do is the scheduler will write the decision or the bit mark of the CPU sets the pods want to use, write that into the pod annotation, and the Kubletes slightly change to pass the pod annotation to the CRI shim. What the CRI shim does is after it starts the process to run in the containers, it will just call the standard taskset system core to bind rows process to the CPU set that the scheduler specified. In this way, we can make sure that the content in the pods we are running are specific to the cores in the nodes. The whole process looks a little complicated, but in fact, the changes to Kubelet and the CRI shim to support this feature are minor.
Next, I'm going to talk about some algorithm changes on the scheduler side. From the scheduler perspective, at least in Kubernetes, the scheduler is in my opinion simple work. For a pod, just choose a node that's the best node to place in a pod. The typical workflow is something like this. They go through a bunch of the predicate checks. Predicate checks are crossing all nodes in the cluster, and find a list of the feasible nodes. Then they sort the list of feasible nodes based on the priorities in the array center, and find the best node to run a pod. The predicate checks would be the one that takes the longest time. In the case that you have a cluster which has a 10K node, this whole process can take a really long time.
A very straightforward optimization is we can write the node array into multiple regions, and we start multiple threads to do the parallel checks for each region, then find the feasible nodes through the predicate check. In fact, even if you do this, if your cluster has more than 10K nodes, even if you do the parallel - because at most, you can start, let's say, 10-16 threads to do the parallel check - it's still slow. Another way to reduce the time is you can introduce a flag, mean feasible nodes to find. Let's say the number is 100. This means the parallel search actually stops. If you find a total of 100 feasible nodes, then you stop the entire parallel search. In normal cases, this effectively reduces the algorithm complexity from the order of N to order of one [O(N)->O(1)].
I think in August last year, the Upstream Scheduler actually implemented almost the same idea, but it was not there when we started the project. There was one difference between our scheduler and the Upstream for this, which is Upstream was using the round-robin node selection, but our scheduler uses a randomization. What I mean in terms of randomization is, for each pod is before we do the auto-predicate checks, we use shuffle to completely randomize the nodes array before we do the predicate check. The idea of using shuffle is we're trying to uniformly distribute the pods across nodes as uniformly as possible, and I think the same ideas were used in box scheduler. But in some cases, we don't do this because if the pod has affinity rules, or it has a specified node, we don't do this randomization because this doesn't help.
The next feature we introduced is called the group scheduling. The idea is if the two pods have the same spec, then they actually can share the same feasible nodes. What we do is we use the hash of the pod labels, pod annotations, and the pod spec to form a key. We use a key to group the order pending pods into the groups. Then for the pods in the same group, we only do the predicate check for the first pod. Let's say, in this example, we will do the predicate checks for pod A, and find that pod A is a feasible host. After placing A in one of the nodes in the array, for pod B we will use exactly the same feasible nodes array. Unless it has affinity rules to pod A, it will be placed into another node, based on our spreading strategy.
In this way, we can save time doing the predicate checks for pod B. I think in practice, this optimization is very useful, because considering you use workload, a deployment statefulset to manage your workload, most of your pods are created by the controller, so they share the same spec. In fact, there are quite a few pods which can have the same keys, and you can save a lot of time on doing the feasible host searching. Overall, our scheduler can achieve a throughput of about 150 parts per second in a cluster with a scale, more than 10K nodes and 100K running pods in the system.
Next, I will talk about the colocation system that I mentioned before. This is the technique that we use trying to run in both the online applications and the offline applications in the single nodes. Currently, all our own applications are using Kubernetes, but all our offline jobs, like batch jobs, are using another scheduler system, which I just call MassCompute Master. In order to make sure these two systems work together, we have to resolve the conflict in two layers. The first is the scheduler layer, and the second is in the node in the execution layer. In the scheduler layer, we introduce a so-called colocation API server. This is actually just an aggregated Kubernetes API server. What it does is it provides API to the user to specify the amount of resources in the nodes that can be assigned to different type of jobs.
The colocation API server has a full view of both systems, of the resource allocation. Hence, they can perform admission control to make sure each type of job doesn't exceed its quota. On the node side, we have a colocation agent, which reads the quota from the colocation API server, and adjusts the node C-groups hierarchy accordingly to make sure the online jobs and offline jobs’ performance is isolated. We also use some operating system features; we use hardware cache partitioning to further make sure our online applications are not affected by offline applications [inaudible 00:26:52] scheduled. I'm not going to talk about the details in today's talk.
Overall, when we're mixing both the online application and offline applications, the node CPU utilization can reach about 40%. Without doing this, for our online applications cluster alone the CPU utilization is only about 15%, so this is a big jump on utilization, which is critical for us. As you can see, this solution is not an ideal solution. It introduced a lot of complexity. It would be much nicer if we could convert to MaxCompute Master using Kubernetes directly, since in my view, it would be much, much easier. But in fact, it's very common in big IT organizations like ours, that it takes time to convince other people to switch to a new infrastructure. We have to make some compromise to build some solution in order to make progress because we cannot afford any utilization loss during this transition.
Contributions
During the whole process, we did find some issues, and we submitted most of the solutions back to upstreams. Most of the issues we found finalizing the scalability area. For instance, we found that that etcd has concurrency, locking, algorithm issues, and we found that in the large cluster, if there are ever more than 10K nodes, the amount of [inaudible 00:28:36] that they send to Kubelets, is a lot. Many of our controllers use [inaudible 00:28:42] and we find that in the large clusters, the amount of listener and watch events handled by the best server is huge. The scheduler side is generally ok because we built our own. We find the problems in all the other three major components in terms of the scalability.
Here, I’ve listed some PRs that we send to fix these problems. For etcd, we submitted to fix the locking problem. I just want to emphasize one change, which is the algorithm change in the etcd. This change can improve the etcd performance by about 24 times in the large-scale data scenario. In the API server, we added indexing for the pod list operation. This optimization improved the pod list operation by about 35 times, and the solution will be up very soon. We also added bookmarking in the watch operation. The performance of the watcher is getting much better if the API server is restarted. We also fixed a Kubelet heartbeat problem. It is interesting that the heartbeat problem that I mentioned before was later fixed by the upstream. What we do is we cherry-pick the fix to our build. Overall, I think this is a nice story about working with the open-source community. Basically, we contributed together, and we benefited each other.
Lessons Learned
Next, I want to talk about some lessons we learned during the whole process. The first is sometimes we have to make good trade-offs to compromise the older systems. In the case that I mentioned for the colocation system, it is so important to come up with a compromise solution. But in other cases, we may have to insist on asking the old system to change. One example is we used to have so-called rich containers used in much of our Alibaba infrastructure. In fact, it is a monolithic Java system. The container starts with the systemds, and it has an application, and it has order [inaudible 00:31:01] other services in one big, giant container. People typically work with rich containers in this way; they do SSH login to the container, and the core script to start application. And then a stop script to stop the application. This is the way they operate the rich container.
The problem with rich containers is they do not follow the [inaudible 00:31:27] operating model. The problem is now, the application lifecycle is completely different from the pod lifecycle. It's causing a lot of trouble for us to maintain the rich container like this. During the process, we designed to change the way that the upper layer organized the apps. What we did is we used one container to run apps, then we deployed a bunch of sidecar containers to combine all the services into the sidecar containers, and then using the sidecarsets that I mentioned before to manage all the sidecarsets. We use the standard container API to specify your LivenessProbe script, and to specify the script to starte your application and to stop your application. In this way, the entire pod lifecycle management becomes much, much cleaner.
Some concerns. During the process, I think scalability can be a problem, but it is not a hard problem because the Intel community is trying to solve the scalability issues. In fact, the operation and upgrade can be more risky. Therefore, we build a lot of tools for analyzing and inspecting the Intel cluster. We did upgrade our codebase from Kubernetes 1.10 to 1.12 successfully, but the process was not that smooth. We still get some of our users complaining all their containers are suddenly restarting, and they don't know why. “I didn't do anything, why are my containers all starting?” We don't know why. After some analysis, we found out that it's because we were upgrading the API server, and in the new version, the pod spec API changed. Then the pod signature is changed, and the Kubelets thinks the pods need to be recreated. The problem is this is unexpected behavior, and our users are not prepared for the container restarting.
Overall, I think Kubernetes is a big project - I think it has about 1.3 million lines of code, and even we are still learning. But during this learning process, some unexpected behavior can be risky, at least in my opinion. In summary, I think moving to Kubernetes is not as hard as it looks, although it is combined work, because using the CRD controllers and plugin can ease the integration effort, and the Kubernetes is able to manage the large cluster. Our in-house experiments show that the latest etcd throughput can reach about 5000QPS for running 100 clients, in parallel reads against a million random KV pairs. The scheduler throughput can reach about 150 pods per second in a large-scale cluster with more than 10K nodes and 100K running containers.
The last thing I want to point out is, just try to stick to the upstream, which means do not fork. This means that we should make the core Kubernetes API server as intact as possible. We should honor the declarative API and the controller patterns of doing the extension. We should also honor the Kubelet and the CRI for the execution to run in the containers. Of course, if you do upstream, every few months you have to pay the cost of rebasing your code to the upstream. I think that's definitely worth the effort because in the end, you will have an entire community support the system that you are running. With that, I conclude my talk, and I'm happy to answer any questions.
Questions & Answers
Participant 1: You mentioned with your scheduler you were able to raise your utilization from 15% to 40%. Do you regard that 40% as an upper bound that you're shooting for, or do you have ambitions to go beyond that?
Guo: In fact, if you look at that, first, we are not using a unified Kubernetes for all workloads. This is two systems. One system is running an online workload, and one system will be running batching workloads. If you think about the online workloads pattern, it's busy at noon or at night, people doing their shopping, but it's completely idle in the middle because people are sleeping. The offline jobs can leverage the time window in that area to do their job. Our offline jobs can tolerate failure, longer execution time when you're running batching jobs, once the online job starts. Once the online job starts, the offline jobs have to be [inaudible 00:36:59]
If you look at it in this regard, the 40% is pretty bounded by the online jobs, because as I said, they have a strong requirement on the resources that they want. They're even using the core that we need to assign them to a specific core, and they want to occupy the full core. If for some reason they don't configure the core number in the right way, that utilization cannot be bumped to your high value. So we have a lot of constraints. If you want to bring the utilization to an even higher value, it needs something more than the scheduler, more than the infrastructure. We need to tell anybody who uses the system to set a reservation resource that you want. This is a bigger effort than when the infrastructure upgrades. This is the story and the 40%. Reaching that 40% is really hard work, in fact. I think even Google’s utilization is about 50%. This is my story. If I reach 50%, I'd be really happy.
Participant 2: Yesterday, Uber presented Peloton, so an orchestration platform like yours. How do you compare your solution to Peloton?
Guo: I think I need to learn more about the Peloton before I answer your question. I would like to discuss this with you offline.
Participant 3: You mentioned that you're colocating the batch containers with stateless containers. What are you doing for the batch containers for the disk I/O isolation?
Guo: I didn't mention that because some of the solutions are so Alibaba-specific because we have our own OS teams; they have some QS features implemented in the storage stack layer and I don't know the details. But this is something we don’t do in the scheduled layer or in the Kubernetes layer; it's fully in the OS layer, because we somehow have to interact with an operating system to provide the performance isolation. In the CPU memory side, we typically just play with the SQL settings, but for network and storage, we play with the driver side. They make some changes.
Participant 4: The other question I had was are you doing any resource sharing, elastic resource sharing in your scheduler? What are you doing for the quota management?
Guo: For now, we don't do very sophisticated quota management, because we typically have big pools, and we typically honor your request as long as the pool has capacity. That's it. Because in general, for running a business like this, our infrastructures really tells the business guys we are out of the budget, we need more machines. We will just get more machines. But in fact, that's a very good question. Nowadays, because the cost is so high, we are thinking about your very careful quota assignment to the BUs that you can use, and sometimes if you want, you can provide solutions to third-party users, not BUs. For other customers, you have to enforce the quota. We do have our quota controllers inside, which I didn't mention today. This is a purely separate CRD, just for quota purposes. With resource quotas, you can do very fine-grain quota management. We can discuss this offline, if you like.
Participant 5: You mentioned that to maintain the feature parity, you actually introduced cost CPU topology-aware, and in your scheduler you maintain a cache of the CPU topology of all the nodes. My question is, how often do you update your cache? How do you deal with the cache inconsistency?
Guo: That's a very good question. The cache consistency is key because there are two things. First, you have to put the pod informer in the cache because every time the API server has some changes on the pod, you need to upgrade the cache accordingly. For now, at least, we completely rely on the stableness of the podwatch. If there is a network issue in which we're missing one pod, we probably will introduce the cache inconsistently in our cash. We need to periodically reconcile to make sure the caches are consistent. But the consistent requirement is not that strict, because in general, we are in good scenarios, and the network problem causing the missing one or two pods is rare, and we somehow can tolerate this kind of misinformation. As long as we have a mechanism to back up, to resync every 10 minutes, something like that, I think we are fine.
Participant 6: [inaudible 00:43:22]
Guo: Yes. This is a very common perception. The guy is asking about Double 11 - for such important events, if you have a problem with the scheduler then the pod cannot start, the service is done. This is [inaudible 00:43:42] right? This is a very common perception. I asked the same question for the scheduler guys. “You guys don’t make a good job, how do you handle Double 11?” In fact, Double 11, think about it. Mostly, the pressure is in owner applications. They've already been launched, let's say 10 days ago, before the event. The actual pressure to the scheduler is the minimum to Double 11 events, where it will stress the rest of the system, but not the scheduler.
Partcipant 7: Actually, a lot of the talk here is about the group scheduler, and also the topology-aware of the scheduling. Actually, in the open-source community, the project has been driving. One is driving [inaudible 00:44:32] scheduler, and another one is driving by the signaled, where I am the tech lead. I really encourage you guys to upstream and contribute to those things because there's the mini proposal made by the community. The only reason that's not happening is just lack of the contributors. I think a lot of people attended the earlier keynote speech, and the talk about the leadership. In the open-source community, we lean on that leadership; someone stands up, says, “Oh, here it is, I want to supervise a certain project …” I just wanted to follow up.
Participant 8: You guys maintain quite a lot of clusters, right?
Guo: Yes.
Participant 8: How similar are all those clusters, and how do you schedule between them? How do you decide that a given workload's going to run on one cluster versus another?
Guo: We currently don't have the multi-cluster scheduler, like federation stuff. We don't do that at this moment. I react to a question like this because typically, our cluster is big. You rarely get a case where you want to deploy your application into two clusters. We solve this problem by just building big clusters to work around this, you need to spread things across clusters.
Participant 9: In your algorithm, you mentioned that you do a randomization on all the metrics, and then find the nodes. What do you do for the bigger containers? This will actually lead to a defragmented cluster, right? Because you would not be able to place the bigger containers, because once you're randomly choosing the nodes, you're actually defragmenting the cluster.
Guo: A few things. At least, currently our scheduler doesn't have this notion of a prerequisite move, reschedule, funny things. Here I need to solve two problems. The first problem is if the scheduler, if the cluster is empty, it's likely that the load is very low. I need to break the fast, and then I use randomization, because if the Intel cluster is kind of empty, this won't introduce too many fragmentation issues. Then, I have another optimization I didn't mention here. I constantly check the allocation rate of the cluster. If I find that the cluster allocation rate is kind of high, basically close to the full, the fragmentation can be an issue now. I just don't do the randomization. I'd probably stick back to the first come, first serve kind of feature to reduce the chance of having the exact problem that you mentioned. But I skipped the details in this talk. I have some thoughts about this, but this is a hard problem, the scheduler.
See more presentations with transcripts