InfoQ Homepage Presentations How Netflix Shapes our Fleet for Efficiency and Reliability

How Netflix Shapes our Fleet for Efficiency and Reliability

View Presentation

Speed:

Download

50:56

Summary

The speakers explain the inherent tension between service efficiency and reliability at Netflix's global scale. They share a mental model for "risk-adjusted net value," moving beyond simple CPU utilization to focus on capacity buffers. They discuss hardware shaping, proactive traffic steering, and reactive levers like "hammers" and prioritized load shedding to protect critical playback.

Bio

Joseph Lynch is a Principal Software Engineer for Netflix who focuses on building highly-reliable and high-leverage infrastructure across our stateless and stateful services. Argha C leads Netflix's Cloud Scalability efforts for Live. Over the years, he has led key business initiatives focused on scaling and availability at the Netflix Edge, including architecting for efficient DDoS techniques.

About the conference

Software is changing the world. QCon San Francisco empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Joseph Lynch: I'm Joey, and this is Argha. We're going to share how Netflix shapes our fleet for efficiency and reliability. How many of you have been asked to save money on your services? How many have been asked to make it work all the time? How many have been a little frustrated by how that's like an inherent tension? Our hope is that at the end of this talk, you'll have some tools and techniques that you can use to try to both be efficient and reliable. Argha is a fantastic engineer at Netflix who works across our infrastructure stack on our compute and network problems, all the way from clients down to services. I'll be bringing the perspective from the services down to the datastores. I'm also a nerd.

We're going to walk you through how we achieve efficiency and reliability. It's really fundamentally a balancing problem between supply of computers and hardware, via things like capacity planning and fleet planning, balancing that with demand from your workloads to traffic patterns, and then finally, finding the balance between those two parts of the equation. It helps at the end that when you're wrong, we have some good techniques that you can use to allow you to take a little more risk and still be ok with it.

Why does Netflix care so much about this? It's because we're a fundamentally global business. We serve customers all over the world. They expect Netflix to work. How many of you would be frustrated if you wanted to watch KPop Demon Hunters and you couldn't do it? We also have to handle a diverse set of use cases, things like mobile devices, PCs, and TVs. These have very different demand characteristics on our cloud. In order to support this diverse use case, we have a full active 4-Amazon region deployment that handles all of the control plane activities, getting you to press and play. We pair that with a world-scale defining CDN called Open Connect, which brings the playback bytes, to your computer from as close to you as we possibly can get.

Goal: Efficiency and Reliability

In the talk, we're going to see how we leverage this combination of global infrastructure across multiple Amazon regions with a fully distributed CDN to provide both efficiency and reliability. Sounds like a tough challenge. Let's start by defining what we think efficiency actually means, because I think that sometimes the struggle with this tension is that we're not thinking about efficiency in the right way.

At Netflix, we think about it like this. Workloads create business value. They do something useful. They have a cost. We have to run them on computers. We have to pay humans to tend to them and keep them running. That cost is often proportional to some form of resource usage. This talk won't tell you how to manage human costs. I don't know how to do that. It will give you some techniques for how to manage the CPU, network, RAM, disk, the compute costs. A key insight at Netflix is that failures also have cost. When a service fails, that has a cost to your business. Fun statistic here. Amazon has, I believe, $700 billion in revenue a year. Every second that they are not accepting orders is $20,000 per second. Downtime has real costs to businesses at scale. Our revenue numbers are not yet that big, but hopefully they will be. Here's the first bit of math, but I promise it will be fun. We can think of it like failures of services have some cost.

In this case, we're modeling it with a beta distribution. You can pick your favorite distribution. They have some frequency. Like they happen at some rate, some Poisson process. Then, finally, you can integrate those things together and you can look at the tail of that distribution. When things fail, they have impact on your business. We usually care about the tail because we don't want to be tweeted about. That would be the worst possible circumstance. Then we can use this to construct a mental model around the cost of failure. We use this to create what we call the risk adjusted net value. Think of it like this. A service has value. It has cost. Then we're willing to spend money if it mitigates substantial business risk. In this case, the efficiency of the service is the value minus that loss adjusted cost. Different services have different characteristics. For example, services that have no fallback.

For example, if you can't hit playback services, you can't hit play. It doesn't work. Those ones have a very high loss function. Ones that offer degraded services, like maybe instead of serving you a personalized thumbnail, we instead serve you a fallback thumbnail that has a lower loss function.

Finally, we have services that are best effort. They have very little cost. You have to respond to the on-call incident, but there's no real business impact if that service is down. The goal of putting this here is that if you're not thinking about the loss function, then you're not really reasoning about the efficiency. Because when there is no loss, the efficiency is easy to define. You take your value, subtract your cost. You want to spend a few dollars and get lots of value. A highly efficient service, therefore, is easy. It's a case where the value is significantly higher than the cost. A highly inefficient service is easy to understand. If your value and cost are inverted, you've done a bad job. That's inefficient. You notice, though, that I didn't use the word utilization. I didn't define efficiency as running your fleet really hot. Because it turns out that running your fleet really hot has risk. You have to find balance between the risk and the cost.

Now we get into some really fun math. Here's utilization. Utilization is, in fact, one component of how efficient a system runs. Kingsman's approximation is a good queuing theory finding that essentially tells you what the quality of your service is. How many people have been on the right side of that curve where your service is heavily utilized and you're up the hockey stick and you're having a bad day? That's what utilization is useful for. You want to stay out of the red zone. In steady state, the thing that defines the efficiency of your product is actually the other parts of this equation. For example, how fast is your service? What is the average service time? We know how to make services run faster. There's a whole discipline on it. It's this thing called performance engineering. Throughout this slide, we'll have some QR codes. They just link to resources if you want to learn more about what's on the slide. We can make our services faster. We can make them do more for less money. We also can make services consistently fast. This is another element of the equation is how variable is the performance of your service.

If your service is reliably consistently performant, then you're going to be able to run it more efficiently. Finally, there's the variation of arrival rate. This is why Netflix gives so many talks about load balancing and how we spread load. Because if you can take bunched up load and spread it out, then you can take more risk with your fleet. You can essentially run them at a higher utilization safely. It all comes back to that graph. The reason why I emphasized all the components of this equation that aren't utilization is because utilization is often a lie. Here are three systems that all have 30% CPU utilization. If your strategy for running this efficiently is just heating all these services without understanding the shape of the utilization, without understanding the variation of the service time or the arrival rate, you're going to have a bad day. In particular, there's only one service on this graph that's particularly safe to run hotter, and it's definitely not the yellow one.

To conclude this, instead of raising utilization, ask, where do we spend our dollars? Do those dollars mitigate risk? If you don't think they mitigate risk or business cost, then what lever out of that equation do you pull to reduce cost? I gave you three that aren't raising the utilization of the service. To try to drive this home, let's look at what I call a very inefficient fleet. Here are four, at Netflix we call them tiers. They're basically tier 0, we have no fallback, tier 1, the service is degraded, tier 2, best effort. If we run the whole fleet at 50% utilization, I contend this is a very inefficient fleet shape. It makes no sense. I want this fleet shape. I want to save dollars on those unimportant services and I want to spend them to pay down risk on the most critical services at Netflix. Playback needs capacity. Personalization, if we have it. If we don't have it, I'm willing to sacrifice it to ensure business outcomes.

This brings us to defining reliability, which is that services at Netflix we want to respond with low latency to the business domain that they serve. We schedule batches with low latency. We care a lot about failures. Failures should be rare in frequency, they should recover quickly, and they should have low impact. I showed you the math that backs up the intuition behind this. The reason that we want incidents that don't happen and recover quickly is because they minimize that loss function and they allow us to run more efficiently.

Similarly, to the utilization, instead of measuring nines, we want you to ask, how often does it fail? What is the impact? How long does it take to recover? If you're curious to dive in to another example where averages are deceptive, look at these three services. They all have the same number of nines. They're all five nines or four nines or whatever. How you would fix them, how you would make them more reliable is dramatically different. For example, in one case you might use backpressure or load shedding, and in another you might target it with failover. Suffice to say, similarly to how utilization hides the shape of the distribution, availability in nines, nines availability, also hides the shape of the distribution. The shape of the distribution matters.

Understand Hardware Supply

Now that we understand how we think about efficiency and reliability and balancing those, let's dive into how we achieve it. How do we understand hardware supply? Through typical things like capacity planning and understanding our hardware supply. To motivate this, Netflix has chosen a new way of talking about service headroom. Specifically, we talk about it in terms of something called buffer. We define buffer as the ratio over offered load that a service can accept successfully. For example, in this case, you can double the traffic on the service and it will be ok. It will respond successfully. Or that we can shed while still not entering congestive failure. That would be our failure buffer. In this case it's 4x. Think of it like as a system gets loaded, we can handle requests and then we can shed requests. Once we get over the failure buffer, it's a little undefined. We might enter congestive failure. We're not sure. We really want to stay away from that. Argha will give some good techniques for that.

Another key insight about buffer is, thanks to the magic of the cloud, buffers can expand. It's not instant. There's some time period that it takes for you to inject more capacity into a service that has entered its success buffer. A little bit formally, we can think of it like buffer is a function of three really important pieces of context. One, how important is the service? Two, to which part of your business? Those two things basically lead to that loss function. Then, third, how quickly, when we enter the buffer, can we recover it? Low utilization is a tradeoff, not necessarily inefficient. For example, fast starting services can run with less buffer. There's less risk to them running hotter. Slow scaling services, like maybe the ones that have to shuffle a bunch of bytes around when they fail, like the stateful services, you might run them with more buffer. That's not necessarily inefficient. It might be for a really good reason, which is that that is an important category or an important component to this equation.

Now that we understand buffer, let's talk about how do we set the buffers up. One of the easiest ways is to go and reserve it. Go and ask your cloud provider, like I guarantee you I will pay you money. You guarantee that I have computers. We're very happy. Asterisk on the guarantee. Our slow scaling services are all mostly going to run on reserved hardware. That's very efficient. Our fast scaling services, though, do not. It is not maximally efficient to reserve the entirety of your need. It turns out depending on the ratio of the price you pay for reserved instances relative to the price that you pay for on-demand instances, relative to the duration that you expect to use that capacity, the sweet spot for efficiency is somewhere between the bottom and the top of that curve. This sets up two really important things. Number one, we're going to be taking risk, capacity risk, during that on-demand phase. There's no guarantee that you get those computers. You're not running on them.

Then during that reservation trough where your demand is below the amount of computers you've reserved, you've created an opportunity to relocate work for free. We're taking risk when we're on-demand, and we're relocating work to the trough when we're under the demand. This sounds good. How do we think about reserving efficient hardware? Because I heard that those new computers are pretty cool and cost efficient. They are cost efficient, but they have a risk. Newer computers have less capacity. Think of it like the cloud's not actually real. Somebody has to actually rack and stack computers, and that takes time. Is the seventh generation or eighth generation the most efficient per dollar? Yes. Are you taking a massive systemic risk if you run your entire fleet with no fallback options on those newest computers? Also, yes. You're also taking a risk at the back end of that curve because the probability you can secure large amounts of capacity of old computers is similarly difficult.

At Netflix we think a lot about how does the shape of our fleet evolve along this capacity demand curve. We don't want every workload running on the newest computer. We don't want every workload running on the oldest computer. We need to have this balance that evolves as our hardware providers evolve.

Another interesting cloud reality here is that computers have variable supply and time. If you show up to Amazon and ask them for lots of computers on Black Friday, I bet you other people are doing that, too. You have to think about when are we going to locate our demand relative to how everyone else in the industry is looking at their demand. If you need to do it for a business reason, there are some tools you can use, like reservations or on-demand capacity reservations. You can also leverage some of those buffers that you've kept in reserve. One of my favorite ones is you can preferentially allocate servers. Argha will show some techniques later on how we can say, you get a computer, but you don't get a computer.

Another interesting reality here is that this isn't even consistent between different kinds of computers within a generation. This data is not real. I don't think you can actually get this data. For example, at any given time, the capacity availability of different shapes of computers is highly time variable. For example, this could be it now, and this could be it now, because someone goes and runs a bunch of jobs on that particular kind of computer. The key insight here is not that this is risky and we should be scared of the cloud. The key insight here is that if you have flexibility in where your workloads can run, then you can access more capacity, save dollars, build reliability. Similar to how the industry moved away from naming computers and we moved to cattle. We have to move away from naming the particular computer we want and move towards naming the shape of the computer that we want.

This leads us to an unfortunate realization, which is that these computers are meaningfully different. When you take your workload from an m6i and you try to run it on an m7a, you rapidly run into things like, m7a cores are not Hyper-Threaded. Those vCPUs are like roughly twice as effective as the vCPUs on the Intel instances. You run into things like the instruction per clock and the clock frequency is different. At Netflix we've spent a lot of time, we've open-sourced a library where we provide an apples-to-apples comparison that allows us to say if we take a workload that's running on this computer and we run it on that one, how do we expect it to look? Do we expect it to use more cores, fewer cores? Will it work at all? We can take that shape data, combine it with the workload context that we talked about earlier. We talked about those reservation differences, like whether computers are reserved or not has a big difference in price. You feed all three of these into a model, and then you can output the cost optimal computer.

I think an important saying comes to mind here, which is that most models are wrong, but sometimes they're useful. This is definitely one of those cases. Is this really the exact optimal computer? No. If you apply this across the whole fleet, across your whole fleet of compute, then you have good overall efficiency and reliability outcomes. If you want to learn more, I gave a talk a couple of years ago on the details of this. Warning, there's a lot of math in that talk, too. It's also pretty useful because you can use this technique to plan for uncertain needs. If you have concepts like buffer and capacity planning, then you can start asking questions like, what if the demand was between these ranges? You can start calculating regret functions. At the end of it, you end up with a computer, that if you launch on that computer, you're going to be least regretful in the most circumstances.

Another interesting thing that pops out of this math is cloud providers typically price linearly, but the shape of the computers, the efficiency is not linear. This is a consequence of some interesting queuing theory. TL;DR, larger computers, you can run hotter to achieve the same buffer because they have more cores. The probability that a job arrives and finds a non-busy core is better on larger computers. Higher, lower, one of the two. To drive this home, let's try it on the x-axis here. We're going to graph risk.

On the left-hand side, we're taking high risks. We're running our services with low buffer. On the right-hand side, we're running with low risk. The y-axis is efficiency. The bigger the computer we run on, the hotter we can run it for the same buffer. We just run the whole fleet on m7a.32xlarges? We're done. Wait, there's a problem. Just like trying to find contiguous memory for a memory allocator, trying to find contiguous cores on a cloud provider is more challenging. This is not real data, but it expresses the tradeoff that you're making. When you're moving to that more efficient hardware, you're taking risk in being able to acquire that capacity. Differently, a lot of fragmented CPUs is easier for your cloud provider to find for you than many colocated cores. We see the tension: efficiency, reliability. Now we can start balancing those once we incorporate the final element of this, which is that buffer differs by workload. In particular, a datastore, which has to do a bunch of things with data in the background, needs to leave some background buffer, some free CPU for those background activities. How many people have had an incident where Postgres autovacuum knocked their cluster over? How many have had Cassandra compaction? It's a pretty common problem for stateful services. Stateless services don't have as many of these problems.

The key insight here is that if you look at the utilization of those two services and you're like, the Java stateless service is running more efficiently because it's hotter, you've totally missed the point. Do not heat your Cassandra database. You're taking systemic risk. Or put differently, you're picking up pennies in front of a steamroller. You have to understand the buffers in order to solve the equation. What does that look like? We can feed each one of these demand curves. Maybe it's long running variable, stable, or maybe we're just using it for a few minutes. We can feed that into a model that optimizes for financial cost and capacity guarantees. Depending on the risk, it might choose to not reserve the capacity but still issue a capacity reservation, or it might choose to reserve some number between the projected demand, and the min and the peak. To tie this all together, takeaways are that buffer is the relevant configuration for efficiency, not CPU utilization. Capacity must be prioritized by risk. Time to recover matters deeply to the efficiency of your fleet. Finally, if you do the right math, compute is highly fungible and you can mitigate capacity risks.

Understand Software Demand

Argha C.: Now that we have a way for us to understand supply, let's see how we can build a mental model for understanding demand. Because to be able to balance the both, we need to understand both. The first key component to understanding demand is you need to understand how individual workloads behave, which means that you need to understand at a workload level their profiles. This is essentially the amount of resources across your fleet that a service needs to do its work. Most of you are familiar with this. There are three typical components to this. You're talking CPU, memory, and network. CPU is pretty standard, like regardless of whether you're running on containers or VMs, you have standard CPU metrics utilization. Memory can get interesting.

For example, in this case, we learned when doing this for ourselves, Netflix runs a large fleet of JVM apps. You can't just be looking at life size. You need to factor in allocation rate, because different apps based on their profile have a different allocation rate. When we do the memory budgeting, we need to factor allocation rate. It depends on things like, what garbage collector are you using? Is it a pacing collector, non-pacing? It's fun. Network utilization, on the other hand, is again interesting, because for stateless, it doesn't really matter that much. There's very rarely stateless apps that are network bound. Stateful, on the other hand, can be interesting, because depending on what activity they're doing, while they have a baseline, they can have these periods of bursts. We need to factor that in too. The second is, once we've understood a workload profile, how do we understand how they scale? Because that's essentially what we're talking about here. If you don't know how they scale, how do you allocate capacity? To do that, the simplest techniques is observe the workload in production. Look at what load it serves.

Typically for us, it can be quite predictable, the load. Based off of the actual traffic and the service's behavior, we can do this fitting, which is very useful. Essentially, you do correlations between a CPU and request, which is the amount of work that a service is doing. What that allows us to do is it allows us to derive scaling targets, which we'll look into. Of course, this is a function of load balancing. In this example, I'm oversimplifying and assuming we know how to load balance fairly. The third component is often missed, but it can be very important. You need to factor in startup times, because it's not just enough to say that I need compute, bang, schedule compute, and it's there. Startup times can have many different components, which we'll talk about. Essentially, if your service is slow to start, that by definition means that you need to preserve more buffer, to buy that headroom. Then the other piece is like, it's quite practical that you have a library update, you introduced a regression in the service, and you affected startup time. Unfortunately, we do have to compensate for that, too.

Now that we've spoken about supply and demand at a service level, we also need to understand demand holistically, because we are talking of thousands of microservices here. To understand demand holistically, we need to understand what service-level scaling or how service-level scaling actually impacts the fleet. We have to look at traffic analysis, and this is a global construct. Before I dive in, this is very simple, oversimplified, our high-level streaming architecture. We have traffic coming in globally from the clients on your left, across the globe. The hits are like front door, which is Zuul, it's an open-source API gateway. We then go into a federation layer, and that federation layer typically calls out to the actual services. Those services obviously need to talk to their datastores, they do that through data gateways. This is interesting in that this architecture, like microservices in general, has implications for the call graph, or very different call patterns. In this example, what you'll see is that though you have a 4x load spike coming in through the front door, it actually means very different things depending on the type of service you are.

For our tier 0 critical playback service, which means if this thing is down, you can't play a title on Netflix, that means double the load. For something that is discovery, which means you're browsing the titles, that potentially means 3x the load. For something like that's personalization, it can just mean a quarter of the load maybe. The point here is that microservice call patterns have a very significant effect on demand. You can't just linearly extrapolate a 4x spike across the fleet and provision capacity for that.

In addition to the architecture itself, there are other things with regard to how we serve traffic globally matters. Because we're running in four different global control planes, like four different regions as we like to call them, you end up with the situation both within the region. The regional peaks are temporarily distributed. Pay attention to the left side where there is a 10x delta in traffic on a single region between peak and trough. What that means is if we were to provision and buy compute to scale for the peak traffic, that would be highly inefficient at our scale. It's not what we want to do. We do not want to buy for peak. However, there's value in that predictability. This is like a diurnal pattern. It allows us to plan for things like failover.

One of my fun examples is like, recently there was an AWS us-east-1 outage. You didn't hear us trending on Twitter, and we've learned this through many years. When there's a problem, we get out of that region. It helps us plan for that. This part is now interesting. I just spoke about predictable demand. We also have to account for things that we cannot control. One great example is content. How many people here have binge-watched The Witcher or Stranger Things when we launch? Like season efforts. Exactly. This is great for viewers, but we need to factor that in, too. It's not something you can predict.

Balance: Manage Supply, Manage Demand

Now that we've spoken about understanding and having mental models for demand and supply, we actually get to the fun part. The analogy here is like, we're basically asked to manage demand and supply and make them fit perfectly. It's like someone telling us, go predict the future, or, how hard can it be to time the market? What could possibly go wrong? It's very easy. We'll see. We'll talk about strategies. Before we dive into strategies for balancing, let's talk about or agree on tenets. In terms of efficiency, the main thing to remember is you can make decisions that are locally efficient, but they add global risk. When you zoom out and you have this maybe single service or a set of services that are provisioning huge amounts of compute, and that's eating up into 40% of your performant hardware capacity, it might be very locally efficient for that service because we have good hardware.

If you look at the global level, you're basically starving a bunch of your other critical services off of that capacity. You definitely want to avoid that. The next tenet is that for our most critical services, we are tasked with ensuring that buffers exist where it matters and when it matters most. This is a big capacity problem. It essentially means that these critical services have the capacity available to them at the times when it matters for them to scale out most. Then the final piece is to be able to do this balancing act or manage both demand and supply, infrastructure, which we are both part of, needs to do the heavy lifting. This is not something that, as a solution, scales or can potentially even scale at a service level when you have orgs of thousands of people and teams.

Now to the actual balancing act. We've learned ways to understand supply. We're aiming for balance. Let's dig into how we would manage the supply of compute. We've understood supply. Let's understand how we would manage compute supply so we can allocate better. Joey talked about hardware, and let's look at what that means. We effectively have developed capacity models, at minimum, one for stateful and one for stateless, because they behave very differently. What these models essentially do is they have two primary sources of input. One is the supply context, which means, how much does your provider or data center capacity have compute? The second one is hardware specs. For every single hardware instance that we choose, they have different clock frequencies. They can be Hyper-Threaded or not, cores, memory, everything else. We have to factor that in. We refer to these as shapes.

Once we do that, the output looks something like this at a workload level. In this example, we are essentially saying that you're a stateless workload, and we've found what we think is the optimal instance type for you. That happens to be a c7a.2xl. This basically just means that we have inferred that this is like a CPU-bound workload, not a memory-bound. Otherwise, we would be recommending an r7a or an m7a. The key thing to note is we also have pricing data in here, which means that when we say optimal, we are saying both performant but also price efficient for Netflix. This is a great idea that we can have optimal hardware for a fleet of services and assume the capacity is there, but that isn't reality. We also need to have fallback types, which is, what do we do when our preferred instance type isn't available? In our models, we not only generate preferred instance types, we also generate a list of fallback types. More importantly, we need to make sure that everything we're recommending is actually validated. We invest a lot in validation, like through squeeze tests, which means we'll run these actual workloads on these hardware and note the results. That's a very important thing we invest in.

Finally, after us claiming to have successfully done the matching game, we need to now drive these recommendations at a fleet level. We refer to this as shaping. We've come up with a list of preferred instance types and fallbacks for our workload. We need to drive this through the fleet. The right side is interesting. We've taken pricing context, capacity, which is supply, business context, app tier, and then we spit out a rough capacity demand. This is broken down per region, per instance type. In an ideal world, we'll just have it. It's unlimited compute, elastic cloud, whatever. That's not true. What ends up happening is, there will be capacity context, for example, in this, where it's like, based on what you desire, there's actually a delta of about like 10k for m7a.xl instances. This is a problem we have to solve. What this means in practice is, we need to figure out solutions to reorchestrate or orchestrate some reallocation of capacity. There are ways in which we have to approach this, in that we have to minimize risk while we do that.

If you pay attention, it's like, while it's saying m7a is short, you'll see it's also saying that r7a and m6i, completely different generations, AMD versus Intel, like they have healthy pool depth. What this tells us is that we can be a little more adventurous and move some of our less critical services, non-tier 0, onto old generation hardware, less performant, like Hyper-Threaded on Intel, to the m6i. We will not touch our tier 0 as far as we can. Also, fundamentally, we can do what we call very strictly safe fallbacks. m7a to r7a is actually a very safe migration, if, and this is the big if, your app is memory bound. The model allows us to identify candidates that will be eligible for that move, like both from 7a to 6th gen, less important, but like 7a to r7a, which is a safe fallback. This is a great visualization of what I'm talking about, but in a global context.

Essentially, everything that we're saying is, we need to be able to run this loop globally, across our four regions, and do many of these migrations continuously, as context changes. The trouble is, this is effectively possible, but this operates in the order of months, because we have pricing negotiations. The time it takes for this context to flow, for us to do migrations with our service owners, this is not something you can just flip a switch, and do it in the order of days or hours.

We've talked a lot about the balancing act, managing supply, the other side of it, is, how do we manage demand? We'll start with the simple things first. What can we do? We can just pre-scale the fleet, because we have, and I'll talk about it, reasonably good estimates of what load looks like. How do we do those estimates? It's funny. I think of this as like a crystal ball, like prediction models, where we basically try and estimate, how much viewership we are going to get, like how many viewers are going to watch this popular event, or title we're doing. Then we take that and convert that into what that means for traffic globally. Then we take that and convert it into what it means for traffic, but at a regional level. Once we've done all of that, we're like, this is the number of computers we need, let's scale up the fleet.

The important thing to note is this doesn't scale at a service level, like we need automated fleet scale-up. This is the simplest thing we can do. We have demand, we just scale up the fleet. However, even pre-scaling, like a simple thing like that, can be interesting. In practice, all we're saying is like that service was running 25 instances, we think it should be running 300 instances. We call it like min pinning, raise them up, traffic comes in. It's very interesting if you pay attention, is like there is still incremental scale-up. I'll talk about that, because that's just us being inefficient. We don't want to scale it so high, that there's no autoscaling. Then when the traffic is passed, everything scales down, it's like nice and slow towards the right. We've served the traffic, done a good job. However, pre-scaling itself can be inefficient or efficient.

On your left, you have an example, where we've over-scaled the service so much that we are running well below the target, which means we could have pushed the efficiency of that, like the tier 0, tier 1, tier 2 higher. We've just over-scaled. We're running with way more buffer than we want. In the middle is something that we call efficient, because for tier 0, we are not pushing it into the success buffer, tier 1 and tier 2, they are in a tolerable zone. We call it efficient pre-scaling. On the third side is where we want to avoid this, because we've under-scaled it so much, that there's a risk that tier 1 falls over. We want to avoid that.

In practice, an efficient pre-scale, this is a real service, what it looks like is you can see that we've scaled up the service, the traffic comes in, and that blue line is very interesting, I'll go into details here a bit later. It's essentially that for a very brief period, there is this transient but minimal load shedding. What that means is, before the traffic scaled up, we had to shed some load, which is ok. We don't mind that, because it was transient. Once that happened, once the service scaled up, there's RPS normalization, which means the new instances started taking traffic, we've redistributed the traffic load. This is good.

Again, like with pre-scaling, like I said, it's the simplest solution. What more can we do? What do we actually do with our global traffic, and control it before it ingresses our data centers? I've always referred to this as dynamic traffic shaping, because dynamic sounds cool. There are two parts to this. One is, you have to redistribute the existing traffic, that's still coming into your data centers. It's predictable, but you have to have a pattern to redistribute it. If we did not do anything, which is like if we did not do shaping, this is what it would look like. What it says in this graph is, you have two regions where the peaks are separated, huge variances in global traffic. If you pay attention or squint, like eu-west and us-west-2 in this, like that's the red line, there is so much divergence. It's not great for us, because it means we cannot do compute demands with confidence, it's wide variance.

On the other hand, when we did apply shaping, you see how more balanced this traffic looks like. These are real production graphs, where we were able to redistribute our existing traffic, and we are much more tighter across four regions. Again, this has significant implications for how we provision compute.

We spoke about existing traffic, what do we do about new traffic? There's a big talk from Sergey, talking in details about how we steer traffic into our regions. Essentially, the TL;DR is we have authoritative resolvers, we control DNS tiering. In a typical scenario, this is what it looks like. We will always optimize for latency. We want you to play that title as fast as you can with minimal delays. This doesn't necessarily mean balance, because look at how hot us-east-1 is compared to the other regions. We want to change that. We want to use steering again, but to balance this traffic better.

In this case, we're basically pumping out new traffic that would have hit us-east-1, moving them to us-west-2, and us-east-2 instead, because we have more capacity. What this means in practice is, while I spoke about when you do default shipping versus balancing, look at how much the delta is when you take on incremental traffic. We had our regular as well traffic, then we had this big event on your left. If we had done nothing, there would be this wild variance.

Inevitably, for the region in green, which is us-east-2, I think, would have run short of capacity. Because we balanced, see how much we were able to normalize both the traffic, but as a result, like compute demand. This has significant implications for us, because if we don't do that well, there is going to be streaming impact. As anyone who's built systems for a while, this is one of my favorite slides. If I were listening to this, I would be like, this claim is a lie. How can you keep talking about managing demand and supply so perfectly? For all the problems that we get thrown, we do need to have contingency plans, because hope is not a strategy. Let's look at what happens. What are some final levers we can pull if none of this works? That's actually pretty interesting.

Recap

To recap, so far, we've tried pre-scaling our fleet, we tried re-steering traffic, but now we're at the point where even that's not enough. We still have more traffic or more demand than we expect. What do we do? The answer is we need to have mechanisms to add new capacity and really fast. Enter reactive autoscaling. What does this mean? This just means that, again, like buffers for critical services are provisioned and they're mathematically derived thresholds. You have to be extremely good about not just how fast your service scales up, but at what threshold, which means we have to determine three things. The first thing is what we call target tracking. It's a number that basically measures for a given service on a given hardware at what number, at what CPU percentage should you start scaling out. This is like we are adding capacity, but controlled. The second is when you're pushing in right end of that success buffer, which means if you went further, you would have to load shed, you also need emergency injection of capacity, and that's called hammers. We do this both for CPU and RPS.

Finally, even if the hammers don't work and are not sufficient, we have to rely on load shedding, which I'll talk about. In practice, what this means is you have three numbers. You have a key target tracking CPU base number, you have an RPS number, and you have a CPU hammer. What this looks like in practice is you have a bunch of scaling events when those thresholds are hit, and you can see us raising our mins. We're trying to say that we want X amount of more compute, and the blue line is those instances being provisioned. Similarly, on the other side is when the traffic is gone, we are also firing events, like these are all automated, and so it's like, you can now release that compute.

Back to the fleet scaling loops, what this means is like, we've done a massive change. We've moved from something that we control in the order of months now to minutes, which means we can react much faster as an infrastructure org. However, for all the nice things that we said, autoscaling can be slow. Ryan goes into a lot more detail in the talk. It's a function of so many things that we don't control. We have to use something else to understand what happens when autoscaling isn't fast enough. If we did it right, what does it look like? This is a tier 0 service that was taking 8 to 15 minutes to recover from a 10x spike for all our investments last year. That thing is three to four minutes. It's like a 70% reduction. Fantastic. We actually achieved this. We like measuring things at Netflix, this is real data.

The last lever is when autoscaling can't keep up, what do we do? This is load shedding. Load shedding is two types. It's like you can shed load in bulk, which is undiscriminated shedding. You don't care about which type of traffic. This is like how a service operates. This is good. This is still good. Eating into success buffer. This is not so good. This is, absolutely you want to avoid. This is congestive failure. Never take the 101 to the city, you take the 280. I'm just very opinionated. However, what we want the shedding to look like in practice is that we would want to start with the least critical traffic, which is what we call bulk, and progressively move to best effort degraded. The last traffic you want to shed is critical traffic. How do we do that? Enter priority-based load shedding.

The main emphasis here is in this mode, when you're taking on more traffic in the success buffer, you want to let an ambulance through on the highway, but there's going to be a host of people who are driving their sports car around, they can wait. That's what prioritized load shedding does. In production, what it looks like is your non-critical shedding starts much earlier, and way before you have to even touch critical shedding. Note that while your errors go up in between, fundamentally, you've not dropped RPS success. This means we've done an excellent job where there's no degradation. We've shed load without compromising service capacity. Finally, back to the fleet loops. This is our final thing, so between hammers and shedding, we've reduced the feedback time to seconds now. Imagine we went from months to minutes to seconds. All of these three things operate independently, and they're reacting to very different pieces of context.

Lessons Learned

To summarize, we are at the lessons. My biggest ask or experience learning from this is, please invest both in proactive and reactive levers. One is not sufficient. I like to also think of this as pre-ingress, being a network person, things that you can do before traffic lands in your data centers, and things that you must do after the traffic has landed. This is post-ingress. It's a good mental model I've followed for this. Lesson two is everything that we are talking about needs holistic solutions. You have to have systems thinking. You can't just buy raw compute to buy down risk. That's not enough. You have to understand that efficiency, by definition, is like a compounding game at scale. I'm very opinionated about this, just being spoiled by Netflix, that you need end-to-end control over your traffic. A lot of what we are talking about is not possible unless you do that. Finally, this is my favorite trademark line, it's like math, for both of us, and for Netflix, is our safety blanket. When you combine math with some Econ 101, you can solve hard problems and make a lot of progress.

Questions and Answers

Participant 1: It seems like you're using EC2 instances directly instead of containers in this situation?

Joseph Lynch: We do both, but the containers run on EC2, ultimately.

Participant 1: I could see the bin packing on the containers. These things can get complicated. Could you talk about that?

Argha C.: Bin packing is hard. It doesn't matter whether AWS does it. Your provider is also doing bin packing. I think for us, what it means from a capacity context is when we bin pack ourselves, like the compute that we buy, because you're buying fixed sizes, it reduces the capacity problem quite a bit. What it also does is, if you remember the peak to trough, it allows us to hold on to that compute, and if we can reallocate capacity, it makes us use things like trough better. Bin packing is going to be a hard problem regardless, but fungibility is the thing to focus on, where when you hold on to a single instance, it's easier if you make that fungible versus have scattered instances.

Joseph Lynch: I think this was an important learning for us as part of this project is that very critical, latency sensitive, needing very consistent performance, direct EC2 is a much better fit. We also have those multi-tenant container systems for the batch workloads, for the less critical workloads, and that creates a lot of our fungibility because we can reallocate capacity from those container pools into the latency sensitive EC2 pools. The answer is we use both, and the system actually balances both supply and demand for both those pools. Not so much the supply for containers. For example, that model that matches shape of workloads, that also is used to right-size container footprints. A big problem for us with efficiency was that people would over-allocate their containers. They would say, I need 80 gigs of RAM, and our container platform would give them 80 gigs of RAM. Then we would observe that they only use 10 gigs of RAM, and then we would suggest, what if you didn't allocate 80 gigs of RAM?

Argha C.: There are some advanced concepts we didn't get into. The other part of fungibility is you can leverage things like preemption. When we talk of criticality and you put preemption together, that means you can preempt non-critical workloads to schedule critical workloads, and that's where a lot of value comes from.

See more presentations with transcripts

Recorded at:

May 05, 2026

InfoQ Software Architects' Newsletter

How Netflix Shapes our Fleet for Efficiency and Reliability

Summary

Bio

About the conference

Transcript

Goal: Efficiency and Reliability

Understand Hardware Supply

Understand Software Demand

Balance: Manage Supply, Manage Demand

Recap

Lessons Learned

Questions and Answers

Related Sponsors

This content is in the DevOps topic

Related Topics:

Related Editorial

Popular across InfoQ