BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Global Capacity Management through Strategic Demand Allocation

Global Capacity Management through Strategic Demand Allocation

Bookmarks
40:20

Summary

Ranjith Kumar discusses abstractions and guarantees presented to service owners with global capacity, the design and implementation for managing workloads across 10s of regions, categorizing & modeling different demands, and achieving global capacity management by shifting demand across different regions.

Bio

Ranjith Kumar works on building systems towards enabling fungible capacity management. He’s passionate about distributed systems @scale, and in the past has worked on Autoscaling, Capacity Management, Cluster management, etc. Currently, he’s focusing on enabling Meta's Global Capacity Management across Geo-distributed Datacenters.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Kumar: I'm Ranjith. I work in capacity management. Specifically, my goal has been around, how can we make capacity management fungible within the region, across the region? I've been working in distributed systems for the past seven years. I'm always fascinated how the problems continue to evolve over time. I'm here to talk to you all about global capacity management. What is global capacity management? Why do we need that? How do we achieve and transition from the current way we manage our capacity, and what is our longer-term vision?

History of Capacity Management at Meta

A little bit into the history of capacity management at Meta. This is not a design talk about how we got here, but I hope that this gives some context of our landscape. We have a single regional cluster, which means we manage our capacity contracts at regional level. This also means that various intents and properties are often stored at a per regional level. For service owners, this means they would have to manage and have context of different regions and availability, and so on. This also allowed us to have a perpetual regional-wide resource allocation system that satisfies the capacity requirements for various regional contracts we do have. The advantage being we are super fungible within the region, the service owners can be mission agnostic within the region, and allows infrastructure the flexibility to abstract away infrastructure related operations like buffer maintenances, system maintenances, and hardware refresh, and so on. While we abstracted some toil away, these fundamental systems provided the capability for service owners to be mission agnostic within a region. What happens when there is more regions?

Service Architecture

From a service management perspective, we have large microservice architectures consisting of thousands of services. This is like big whales and lots of little fishes. There are a lot of shared platforms used by multiple different products. We have a lot of different products like Facebook, Instagram, and so on, and we can see that many services are often shared across a bunch of them than specifically designed for one particular use case. Looking into some of the dependencies, we also notice that the call depth of the service dependency can even sometimes go beyond 10. That shows the scale of the problem that we are dealing with. On such scenarios, the growth and placement of our services and our capacity becomes challenging, and it impacts others on the dependency chain as well. Often, whenever services and regions grow, this leads to a lot of local optimizations, because growth of one service impacts another as we saw how things are being shared. For example, feed services are present wherever the Facebook web frontend services are. Feed cannot arbitrarily get capacity. Rather, its growth is tied a lot to the web frontends. This means sometimes when we think of growth, it might not be possible to the regions where we are thinking of. For infrastructure, this is also a problem as well, because when we ordered capacity for the future, it is very constrained since it is tied to a lot of growth of how the various demands and the products are intending to grow. From an infra management perspective, the utilization across the regions, the data centers we have is not well balanced, which means we are losing money for idling resources. For us, this also means our buffer costs more, because the imbalance allows us to have a higher disaster readiness buffer or operational buffer.

Right now, we have 21 data centers. Over the next coming years, our scope will continue to grow. At some point in this trajectory, the operational toil from managing services at a per regional contract would become intolerable. For the company, from an efficiency perspective, we would need to operate our data centers in a more optimal way for us to get the most out of it. This challenge started us to take a step back and think about, how can we think in a more holistic way from both reducing the toil for the customers and also a better placement for the infrastructure? This made us to think from a more global perspective. What if we can abstract away further for our customers to not only be mission agnostic within a region, but we are also region across too. We are able to think, operate, and optimize from a more global placement. The vision is to think of world as a computer. I'm sure many of you would have read, Datacenter as a Computer. If you haven't, it is thinking of data center as just resource. What if we are able to think of the world as our computer, and we are able to look and utilize the resources across our data centers. This provides us a lot of fungibility and availability for the systems. It also allows the service owners to reduce their toil and focus on the expectations that they have on the infrastructure. Getting there would take a lot of challenges. This allows us also to take better infrastructure decisions.

The biggest problem is to change our trajectory from how we have been operating, which is thinking from a regional aspect, and change it into a global aspect, and also at the current scale we are operating. It all starts with understanding the service spectrum we do have. Looking at some of the services we do have, for instance, if you use Facebook or Instagram on your phone, there are parts of the product we expect a much faster response. For instance, when we search something, or when we respond, we want a much faster response from the product. These are typically latency-sensitive services. However, parts of the product is also latency-tolerant. For instance, when we upload a video, it's ok to take a few minutes. If someone comments on your photo, maybe we can get the notification a minute late, or things like that. These are typically latency-tolerant services. Thinking of latency-tolerant services, there is not a lot of restriction on how do we place them and where do we place them. This means the upstream and downstream dependencies are less restrictive, allowing us to adopt a global traffic pattern, and also arbitrary regional capacity distribution. Which means we can place those services and also move them without impacting our reliability, or the actual workloads that are running on them. This is not true for latency-sensitive. We have a very tight requirement for both our upstream and downstream services, we cannot arbitrarily move them around. We would have to adopt a more regional traffic routing for all the services in the chain. Also, regional capacity modeling. Any capacity changes we make, or any capacity we want to provide for growth, needs to be more strategically allocated than arbitrary placement.

For one category of the services, we have an easier way to provide global capacity management, but we can see deeper into, how do we solve this for latency-sensitive? The insight is for us to know what is actually driving our service. We need to understand what our demand is. I think knowing what the services are driven by allows us to take better decisions for those services. Once we know the service, and once we know the service demand and we are able to attribute them, we can better understand the colocation and the geographic requirements for these services. This fundamentally allows us to take these services and operate at a global capacity perspective. The process we do this is called demand shift. Demand shift is essentially understanding the various traffic sources we have by categorizing our demand and coming up with ways to move our demand and place them that takes into account the global value than a local value. It is much harder for a single service to know all this and change both the infrastructure and their placement. Doing this at a traffic level or at the demand level means we are also able to automatically change everyone who's impacted by the demand and continue to iterate towards a more global placement. Effectively, on one end, we need the attribution to understand our demand sources. Attribution is not enough, we also need the ability to model our services. Without modeling, we won't be able to figure out how to safely move this demand, because ultimately the safety of the service and the reliability of our products comes back to impact us. Constraints and optimization functions are the ones which we use to tune the demand shift to benefit the purpose that we are looking for. We will go a little bit deeper into that with an example later.

Attribution (RPC Tracing, and Demand Sources)

Starting with attribution, we rely on RPC tracing to understand our products. Traffic typically enters through the web frontends. These entry points are our products. In this simple example, these are Facebook and IG traffic. These are our products. After entering through the frontend, the application service does some business logic, and starts calling RPC requests to perform other actions to tens of services, which then continue to fan out further. Typically, we often end up with a large reach. This is where, as I mentioned, in some scenarios, we saw call depth for up to 10 levels. The average fanout of services at each level is around 100. We do have this complexity, and the complexity at scale exclusively because of the sheer amount of demand sources we have, and also, the amount of shared platforms that is used between different services. Demand isn't necessarily something that is just external, it can also be something that is more internal, as well. Once we are able to sample our RPC traces, we attribute it to some product, and we understand the trace all the way towards the propagation. In this example, you can see there are some services like I mentioned, specifically serving one particular product, like IG and foo. There are also services like web and feed that caters to two different products. On such scenarios, understanding this attribution provides us the capability to know what is driving a particular service, and how much of the service capacity is being tied to that service.

Once we are able to perform this tracing, we can construct a call graph that represents the entire view. Most of these RPC exists within a region, because like we saw in this previous category, we have latency-sensitive workloads, which means these requests typically do not leave the region the moment they enter. Similar to the example we saw before, feed needs to exist where web is, so it is less expensive for the RPC to exit the region rather than staying within the region. This is what we categorize as a demand source. Demand source essentially provide us two values. The first one being, it provides a proportional relationship between the capacity of the service and the units of independently shiftable demand. In this example, not only we know the traffic distribution of the various demand, but we also know how much percentage of the capacity is impacted by a specific demand. The second one being the full capacity footprint that is required to place a single unit of demand is something that we can understand out of this. If I want to move IG traffic from region A to region B, this allows me to know for a single unit of traffic, what is the capacity I need? Where do I need to place them? Which service is impacted, how much, and so on? Some of the properties for categories in these demand sources, they need to be independently placeable, and they are mutually exclusive and exhaustive. These generally do not escape a region. If they are typically escaping the region, again, it goes back to the particular demand being latency-tolerant than being latency-sensitive.

Service Modeling

Once we have the demand attribution, we need to model the services that are tied to this demand. Not only do we need the demand, but in this case, we need the ability to model the IG service, or foo, or web to tie it back to our solution. The goal of service modeling is for us to understand what is the optimal distribution for a service capacity across the various regions that is present considering its demand. This is challenging mainly because many services are driven by multiple demand. Like I said, some demand is internal, some is external. You could still have a service which is 90% serving by a major product use case which is driving the service, but the 10% can be driven by others. This is often the challenge in trying to model the services. For the service modeling itself, the way we approach modeling is we take into account two parts, we understand the baseline of the service. Leveraging the baseline, we construct a model for the service. The baseline itself consists of two parts, we have demand baseline and capacity baseline. In demand baseline, we rely on the RPC requests to construct the call graph that allows us to know what are the products that's driving each service. Once we identify those product gateways, we also know their current distribution. In the capacity baseline, we rely on profiling to understand the peak usage for each service and attribute it to the corresponding traffic that is driving it. In the example we saw before, for feed we will be able to understand what percentage of feed capacity is tied to the traffic that is driven by web, and the traffic that is driven by IG. We are assuming a linear relationship here between traffic and capacity. The baseline can include multiple demand metrics, but often the challenge here is to pick the metric that, A, reflects the workload on the service, and, B, that reduces the modeling error.

Going a little bit into the capacity model further, the service model predicts the amount of capacity we need in each region to support the traffic we get in that region. We are assuming the product traffic is fungible across regions. If you are getting some traffic in region A, and you needed 100 instances to support the traffic, moving that to a different region should cost us the same.

That is one of the assumptions that we take into account in this model. Here we are suggesting a linear model, where the total capacity for a service in a given region is based on the capacity that is needed for all the products that it is serving. You could also see there is a model residual which we are adding here. The residual often results because, in some cases, the linear assumption between the traffic and capacity does not always align. In such scenarios, there is a gap in our modeling and this often comes up as a modeling error. There is also other capacity imbalances that lead to residual. If we look at our current placement of services, often this is stemming from back of the envelope calculations, or historical ways of thinking and managing capacity. Our current placement is not an optimal global placement. Often services are either underprovisioned or overprovisioned. This excess capacity imbalance is also reflected in this modeling. We would have multiple capabilities in the solver to deal with those residuals, but there is always an option to see if we want to solve them immediately where we balance those residuals as a capability or we can handle those residuals in a separate action as well.

Model Regional Demand Shifts

With both the capacity attribution and the capacity modeling, we can move towards generating plans. To generate plans, we need to think of both constraints and optimization functions. When we think of constraints, the constraints are often imputed from our service modeling. We need to model both service and traffic in the same way. Also, when we do this constraint, we take into account two constraints, the first hard constraint being the net capacity we should allocate in each region cannot exceed the available supply in that region. This ensures that whatever placement plan we come up with is feasible. The second one being the capacity demand for the service is met. This is a soft constraint because it is possible that some services are overutilized, and they can remain overutilized. This doesn't change the utilization of the service. Rather, it balances your utilization across your regions. Technically, you're not using this to drive your utilization up, but your average utilization across the regions could be much higher. Also, we have the capability to leverage future demand, which is also why this is a soft constraint.

With respect to the optimization objectives, these optimization objectives help us manage the global benefit for us. Whenever we think of objectives, there are three we particularly consider. The first one is service capacity balancing. In service capacity balancing, we want our capacity for each service to be balanced across regions. This is essential for us to reduce the buffer cost. If a service is unbalanced, we are paying higher for disaster readiness and other buffers. The second is unused capacity balancing. You can think of it as available buffers. This is also essential, since if our capacity footprint varies across regions, our corresponding buffer needs to be proportional with respect to the regional size. If you have 50% in one region and 30% and 20% in other regions, your buffer also should reflect the same proportion to withstand any changes in placement error, or other disruptions. The third one is a stability objective. While it is great to come up with a plan to shift demand, we should be mindful of downstream impact toil thrashing the services. There is a lot of operational cost that often comes with such shifts. It is essential to ensure that stability objective in mind, to avoid the thrash of back and forth. Once we have all this, this is fed into a solver. It becomes maybe an assignment problem. We leverage a solver here to produce a placement plan that covers various demand sources. Like I said before, at our scale and the products we support, we have a lot of different demand sources, there is no one single demand source that rules them all. We need the capability to come up with an optimal placement on a continuous basis. This totally depends on various constraints we saw, like the available supply, and the capacity modeling, and so on and so forth. The output of this assignment problem is what we call a demand shift plan. Once this plan is generated, we also need to execute this plan end-to-end in a both safe and efficient way.

Strategic Demand Placement

Let's walk through an example of what a demand shift plan is, to get through some idea. Let's consider a very simple use case to simplify what a solver would spit out as a demand shift plan. In this case, we are considering a single demand source, Facebook. There are no other demand sources. It's split between two regions, 60% and 40%. We do have three services: web, feed, and bar. These services are driven by the product. You can see how much they are driven by below, it's 80% on the region A, and 75% on region B. We do have available capacity at region B. In this example, our optimization function is going to be something like, balance this region. I have 60% on region A and 40% on region B. The solver takes into account that there is available supply on region B, and it can come up with a plan to move this around. Essentially, the plan we would come up with distributes the available capacity to the region B because the region should withstand the new traffic it's receiving. Hence, we distribute the capacity in accordance with the traffic, again the linear proportionality between traffic and capacity. Once the capacity is distributed, we are then able to size up the service and start shifting the demand. We can see that the various services sized up, and we can start shifting the demand.

Demand is a zero-sum game. Traffic can either go to one region or another. The moment we shifted the traffic out of region A, we don't have a necessity of the same capacity that we once had. We are able to shrink the capacity from region A, creating available capacity in region A. Thus, we will be able to reclaim the capacity at the end of it. We started with a different footprint where our capacity is 60/40, and we had some available capacity in one region. We did the shift where we actually moved the capacity without teleporting them through a FedEx truck or something. This is the capability of the system. In this case, our optimization function was to balance the traffic, but the capability is to make these decisions as tradeoffs. Maybe we wanted to have available capacity in region A to support out of region hardware refresh, or maybe we needed that capacity to support a product launch. This is more about the ability to take these decisions and move our demand by controlling the demand rather than just balancing in mind. This is what we mean by a global optimal plan for the infrastructure rather than local service management.

Orchestration

We saw how complex this plan is even for a two simple region setup with one product. In reality, when we have so many demand sources, and we have thousands of services, the sheer amount of effort and toil that is generated from going through such execution is enormous. The amount of automation we would need is a lot. The service automation here is crucial. Without that we won't be able to automate this end-to-end, especially in this case where we have thousands of services. The automation system we do have leverages integrations with various systems like autoscaling, traffic management, and so on, to make this as seamless as possible, with always a focus on automation, visibility, and metrics. Because we need to tie this feedback back into the system. The plan that we saw is often represented as a DAG. DAG is useful because directed acyclic graph. It is presented in that way since it allows us to sequence various actions in a very specific order. For instance, in the example that we saw, the very first step we are doing is to distribute the capacity to the services. Once we distribute the capacity, we upsize the actual service. This is the jobs that are running on those services and so on. This is crucial, because when we start shifting the traffic, we would want to avoid any failover or reliability issues. They need to be in a state to withstand the new incoming traffic that they have. Once we sequence these actions, we move on to performing the validations for demand shift. This includes pre-validations, like can we shift this demand? Are there any other gaps, and so on. Once our validations are successful, we shift the demand and effectively a post-validation demand as well, and we go back to downsizing the service. If you do this for one demand source, or 10 demand sources, the product or operations is all the same. This gives us a very repeatable template that we can automate and refine over time.

Frequency of Plans

Until now, we saw why we are creating demand plans, and what are some of these demand shift plans. We walked through an example. How frequently do we run them, and how frequently we have to run them? The advantage of these plans are, these can be periodically running. We execute it periodically to continue shifting our global footprint to a more optimized place. Our global footprint is not an exact state that we can achieve, because every day new products are there, new services are forming, baselines are changing, and our service profiles are varying. What is an ideal placement for today might not be an ideal placement for tomorrow. This requires us to operate in a continuous fashion. The advantage of this approach is that it is self-correcting. Any changes or errors from a first run can be balanced out and taken into account when we model and come up with a plan for the next run. Another advantage of this approach is the capability to do this in multi-step. When we think of shifting demand, the first thing that comes to people's mind is safety. Can we do this without causing production outages? Can we cause this without disrupting end user workflow? Because that is more critical for us than the efficiency. When we think of multi-step plans, how can we think of a larger change, but breaking them into tiny smaller steps or smaller plans that allow us to continuously optimize at each step? This is where the self-correcting comes back again, because to achieve step A to step B, we can create thousands of plans, or 10 plans, that depends on our scale and that kind of change that we would want to make. Thinking of multi-step plans, it's also helpful because one of the things that gets into our objective, being the stability objective, is to reduce the thrash. Having a larger picture in mind would mean we are slowly making incremental steps towards getting there without making a lot of back-and-forth movements. With respect to the duration, when we started executing this, it took us a lot of time. I still remember the very first day we ran such a plan, everything was manual, and there was so much coordination that was involved. Over time, we were able to reduce the duration of executing such large infrastructure plans, and bring them down by a lot, and more consistently.

Recap

The demand shift enables us to optimize globally, trading off local inefficiencies for better global outcomes. This homogenizes our hardware footprint towards a standard set of region types like storage, compute, and so on. From a service owner perspective, this allows them to become more regional agnostic. They don't care what their placement is, they care about the guarantees and the constraints that they will have.

Live in Action

While this is so far good, can we see this in action? Yes. So far, we have used this for three different scenarios. In all those scenarios, we have found a lot of value. On case one, we saw that the service utilization improved after a single run of such a demand shift plan. We are showing five regions here for readability, but we can see that the regional utilization for the service on an average is better after the shifts than before. Like I said, if your service is underutilized or overutilized, the plan is not going to change them, but it improves the average utilization across the regions. This also allows us to perform out-of-region hardware refresh. For instance, the example that we saw with the animation, we saw that the capacity is available in region A, which was not available before. That is something that often is a very challenging step, because every use case needs to be preempted. The last one is balancing our existing capacity and available capacity to serve for future needs or to serve as buffer.

Public Cloud Applications, and Challenges for Demand Placement

In terms of how we can leverage this for public cloud, we have a couple of ideas. This can be super helpful for public cloud consumers with a large pool of capacity to intelligently place services that can improve the utilization. We can also get insights on which regions and which demand is more useful for us, and how we need to accommodate them better. For cloud providers, they can provide us a new type of contract that is of lower cost, but the cloud provider can change how it is placed or where it is placed. We will lose the regionality but it will give us an ability to get a better contract. While these demand shifts provide us with a lot of capabilities and opportunities to perform such large operations, there is also a set of challenges that we do need to consider. We assumed a linear relationship between traffic and capacity. Whenever this is not true, it is quite challenging for us from an accuracy standpoint. There is also challenges for some services with respect to coming up with capacity models. We might not be able to identify a demand metric that reflects the service completely, thus leaving a lot of residuals on the place, impacting the baseline. All of the inaccuracies that exist on our demand baseline or capacity baseline impacts the accuracy of the plan. Whenever a plan is inaccurate, we are often ending up with a lot of issues than the benefits.

What Else is on the Spectrum?

So far, we saw two types of services on the spectrum, there is the latency-tolerant ones where we can do a global arbitrary placement. There are latency-sensitive stateless ones that we can use demand shifts for. Does this constitute all the services in the spectrum? There are some more. We have AI training workloads. The services themselves are stateless, but the data that the AI training workloads rely on are often locality specific. There is also stateful multi-tenant services, which are challenging because of how they are there for caches and databases. Typically, data moves often take much longer than shifting compute traffic. In this case, when data moves are longer, how can we safely perform the shifts over multiple days or weeks? That is quite challenging as well. There are many more types of services that might not be covered here. Our vision has always been surrounding, how can we reduce the toil for customers and improve the value for the infrastructure? This is where our vision has always been around seeing the world as a computer. What necessary abstractions do we need to build and provide to support this vision of global capacity management?

Evolving Service Expectations

Today, many service owners still provide their contracts at a regional level, which means when we think of a service, a service owner needs to understand the various availability zones, how much capacity they need to have in which region. They are thinking from a contract, which happens at a per regional level. This is a lot of cost to the service owner. In a company with a lot of services, this is a lot of time spent by each and every team trying to think and manage capacity. This is where the local optimizations lead to. We want to move from this state to a state where both the contracts that are built are global, and the jobs and our executions are more global. That would become the abstraction that bridges the intra-operations of shifts to the service owner. We will be able to build global reservations that takes into account various constraints from the user, like latency constraints, sharding constraints, and more. There are definitely a lot more constraints. Effectively, all this is represented as a single global requirement, which is then broken down into parts. There is a capacity regionalizer which looks at these components and constraints and figure out a regional mix for that particular service. This is very similar to the attribution and modeling we saw. Once we are able to come up with a regional mix, we can then orchestrate the change by an orchestrator. In here, we are basically thinking about how in longer term, we want to extend the capability that we built, but in a more generic way to support the various kinds of services and templates we see.

As an example, let's see how we can adopt global sharding into the demand shift pattern. This would look very similar to how we orchestrated regional shifts today, where we first distribute capacity, and add replicas. As the next step, we also add shards at the end of it. Once the building of shards is done, we move on to shifting traffic. Once we shift the traffic, the downsizing is the exact opposite. We need to wait for the shards to be dropped. Once the shards are dropped, we can then downsize the service and then downsize the capacity as well. We can see how our model is also extensible to adopt other capabilities. The extensibility of the orchestration system is one of the biggest value props when it comes to, how can we continue to evolve our pattern to adopt different kinds of shifts for different patterns of services? Much of this work is futuristic, and still being designed and still being iterated on. There is a lot of challenges on the way ahead.

Current Global Capacity Management Challenges

How can we summarize the right abstractions to capture the intent but also provide flexibility to the infrastructure? Many stateful storage services and DB services and platforms that provide resources to tenants in the form of virtual resource, how do we model them? How can we shift them in a safe and accurate way? In this growing complexity, and with the number of regions growing as well, how does our disaster readiness capability scale and changes over time? The progress is 1% done, and we are super excited to see where this challenge takes us.

 

See more presentations with transcripts

 

Recorded at:

Apr 30, 2024

BT