BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Using Traffic Modeling to Load-Balance Netflix Traffic at Global Scale

Using Traffic Modeling to Load-Balance Netflix Traffic at Global Scale

Bookmarks
47:13

Summary

Niosha Behnam and Sergey Fedorov discuss how Netflix has shifted from geo-based DNS load-balancing to a latency-based approach, relying on real-user measurements and building a model of Netflix traffic.

Bio

Niosha Behnam is Staff Software Engineer @Netflix. Sergey Fedorov is Director of Engineering @Netflix.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Fedorov: The problem of traffic management is something that's pretty close to majority of engineers. That's the problem that usually starts relatively small. For the new products, we have relatively few users, they're often localized. Choosing a data center around the traffic is relatively simple. As the product and service grows, we have to deal with more users, and users often become increasingly more geographically distributed, adding more complexity into the architecture of the data center or cloud locations, and adding more complexity into the process of connecting the devices with the proper server. That problem grows with the growth of users and growth of scale in terms of number of requests. One of the questions that fundamentally needs to be answered is how to connect a particular device with a server that needs to provide data for that device, and which composition of server locations of data centers or cloud regions that need to be used. This talk is going to share Netflix's journey on that, as we grew over time. We've had a fantastic growth. In less than 12 years, our number of users grew by over 10x in terms of size. It changed the locations of these users, as well as complexity of our service, of our offering.

What we ultimately had to solve is the problem of traffic management. What we want to do is to make sure that we, for each user, choose a server, choose a location that is performant, that has the lowest possible latency to ensure a good quality of experience. As well as choose the composition of servers and map and load balance the traffic between devices and service in a way that leverages our infrastructure efficiently by lowering the costs. Of course, all complex systems at some point may fail. We want to make sure that whatever we do is resilient to the failures of individual subcomponents. As part of the operation of the system, we want to have a good ability to control and observe and monitor the traffic.

Keys to Success

What do we need for that? First of all, we need to know who are our users. Where they are, how many of those we have, and how they access our infrastructure. Then we need to know the fundamentals of our backend, on which principles do we build our cloud or data center services. Finally, we need the proper technology to route and load balance the requests between devices and servers. Building that technology requires close collaboration between the teams that are responsible for routing the request throughout the internet, and the infrastructure team that provides the planning and capacity and load balancing of the requests within the locations on the server side. That's why we are here.

Background Info

My name is Sergey Fedorov. I'm a Director of Engineering, working on the content delivery team. We are responsible for connecting the devices, for steering the devices from their user locations to the destination in the cloud.

Behnam: My name is Niosha Behnam. I'm part of the cloud infrastructure team. My team is responsible for determining where we steer. Specifically, we own cloud traffic management as well as our ability to regionally fail over.

Once Upon a Time, at Netflix

I joined Netflix 11 years ago, and our cloud infrastructure was very different at the time. What I didn't know was 3 months into my tenure, we would have a transformative event. This wasn't a content launch, or a new product feature, but it was a failure in our cloud infrastructure that changed the way we decided to architect it. In Christmas of 2012, we had a major load balancer outage in the U.S.-East-1 region that we operated in. This resulted in a vast majority of our customers becoming unable to access our systems for the better part of 24 hours. You can imagine at the time, there was a lot of impact to the brand, as well as a lot of unhappy customers because of the length of outage and the timing of the outage. Out of the aftermath of this, you can imagine we wanted to rearchitect the cloud in a way that we wouldn't suffer from regionally isolated outages like this one. Josh Evans, director of engineering at the time, actually presented on this topic in March of 2016, at QCon in London.

Regional Fault Domains Improve Resiliency

What we ended up doing is establishing regional fault domains for our cloud infrastructure. That allows us to not only withstand single region outages, it allows us to provide a very similar customer experience to our users, regardless of what region they interact with. In order to do this, we had to rearchitect not only our stateless tier, but our stateful tier as well. From a stateless perspective, this includes all our microservices, hundreds of microservices that are responsible for things like the discovery experience within the UI, things like search or playback. What we had to do there was ensure that these systems themselves are region agnostic. There's no region-specific configuration, or business logic that would make one region different from another. At the stateful tier, we had to rearchitect the way we were doing storage such that data would be replicated globally. What this meant was extending our Cassandra rings globally, as well as building regional data replication for things like our Memcached layer. The changes, though, to enable us to be able to evacuate regions and withstand regionally isolated outages wasn't limited to just our cloud infrastructure. It impacted the way that we steer traffic mechanically, as well as the tradeoffs that we make between things like quality of experience, and capacity management, and availability risk.

Cloud Capacity and Cloud Cost

Regional failover itself, mechanically is relatively simple to think about. If we're operating in three regions, we have some traffic distribution, where we steer users to three regions to balance for various reasons, for quality of experience, for managing cost, for availability. When there is an impact to a region, what we want to do is steer users that are impacted to healthy regions to restore service. This has an impact on our cloud capacity and cost. In order to do this safely, we need to ensure we have enough guaranteed capacity in all regions to withstand any single region failure.

At a high level, our regional peaks drive our cloud cost. Let's assume, in the case that Netflix were to run in a single region, we would need to serve peak global traffic in this one region. That would translate to enough capacity to cover our peak usage. Just to make this analogy a little simpler, let's assume the total cloud cost to run Netflix at peak is $10 million. If we redistribute our global traffic amongst multiple regions, we may drive incremental cost. The reason for that is that each regional peak can drive its own cost profile, because we need to cover that with guaranteed capacity. In this case, with this traffic distribution, we're actually showing $2 million of incremental cost when compared to that single region world. If we factor in our failover cost, on top of that, our capacity requirements at a regional level increases and could get driven even higher. In this case, $20 million. Is this $20 million efficient? In order to understand that, let's look at how cloud cost changes as we add additional regions.

In the single region use case, we have $10 million to cover our global peak traffic, and we can enable failover. That's our cost. If we have 2 regions, we could redistribute traffic evenly to have $5 million of cost to cover our normal usage. We'd need another $5 million in each region in order to enable failover. Now, if we expand to 3 regions, we again can evenly distribute that $10 million for the nominal state, but the cloud cost actually goes down to cover our failover conditions, because if we were to only lose one region, the two remaining regions could cover our peak at $5 million each. Cost is driven across the different modes that we're operating in. It depends on the steering policy that we've implemented. In the case that we looked at before, that $20 million is well in excess of that $15 million, which is optimal, which is not great. Ideally, our traffic management solution would be able to steer traffic, such that failover capacity doesn't drive additional cost.

Availability Risk, and Quality of Experience

Another consideration is availability risk. When we evenly balance, we're minimizing the availability risk to our customers. The reason for that is we are not putting too many eggs in one basket, basically. We're not exposing more customers to a potential failure than is necessary. In the example on the right, we don't have an even distribution so we're incurring some additional availability risk, specifically in the EU-West-1 region. The availability risk that we're incurring is equal to the amount of traffic we're steering above that one-third. The area in red indicates the additional availability risk based on the steering policy. There might be many reasons we end up doing this. It might be in order to provide a great customer experience because of latency preferences of our customers. Last but not least, quality of experience is very important to us. We want to ensure that users have a great experience when they're interacting with the Netflix service. That their play delays are low. That their interactions with a server are fast, and the UI is performant. At the end of the day, we have a tradeoff between quality of experience, availability risk, and cloud capacity, and managing this risk is part of what our traffic management solution needs to do.

Steering and Balancing Device to Server Traffic: Geo-based DNS Steering

Fedorov: What we need to do right now is to have a solution to solve for that tradeoff. The effectiveness of a solution would definitely be heavily influenced by the level of control that we have over the traffic, how exactly we can direct the devices to individual regions. Let's go over some of the fundamentals. How can we actually do that? What are the options? The first option is something that's based on the DNS with our Geo database. It's something that Netflix has been using for many years. The way it works is that all devices are configured with the same hostname. In order for the device to go to the region, it needs to resolve that host into the IP address for which DNS was used. Going over the DNS resolution path, the device first contacts the recursive DNS resolver. The recursive DNS resolver can see the client IP exactly identifying its network location. Where the transformation between the hostname and IP happens is an authoritative DNS server. That server only has access to the resolver IP, which is quite essential for the rest of our presentation. Then on authoritative DNS server, what needs to happen is the conversion between IP address and the geographic location that is thought to be associated with that IP address. Then the operator can configure the mapping between different geographical locations and the regional preferences for the user's traffic. At that point, their IP address has been returned back to the device, and with that IP, a device can establish an HTTP connection and have an exchange with a server.

There are a few challenges with that solution. First, as I mentioned, the granularity of the decision is based on the recursive resolver level. That's something that's outside of the control of the service provider. That's something that users are configuring, and that's the configuration of the device that is independent. It means that the recursive resolver might be close to the client or it might be pretty far away, influencing how precise the identification of the location is. There is even a bigger problem that geographical distance and network distance are not necessarily the same. Network path can be configured in many different ways, and the traffic can actually travel in a pretty interesting route throughout the internet and be heavily dependent on the destination of the region that the traffic is reaching.

To recap, this solution has a big advantage of being easy to integrate. Every device has the same configuration, the same hostname. The latency efficiency is relatively poor, for many reasons, because of the resolver-based granularity, dependency on geo. Also, the fact that even the mapping between that IP address and geographical location is a very hard problem, which is also ultimately solved with some loss of precision. That conversion also introduces a loss of lack of control, because typically that mapping is done by an external vendor. The frequency of their updates to the geographical database might not be aligned with the frequency of the server changes or the knowledge of how the user base is changing over time.

Per-Device Steering

How can we do better? One approach is to change the granularity of steering to the device level. For that we need a specific service that would be providing for each device an answer which region that device should go to, based on the client IP address. Once the device has that configuration, it still needs to connect. Leverage DNS is still a solution that's typically used, where the device is provided with the regional hostname. Then the DNS resolution path is relatively simple, just mapping that regional hostname to the IP address for that particular region. That's what device use to go to that. That approach can be quite accurate, because it has the ultimate granularity of control on device basis, which gives a lot of options how to optimize for the traffic. At the same time, it's also a problem because at the Netflix scale with over 200 million users and even more devices, you need to manage a lot of moving parts in the critical path of the system, and deliver and update all of that in real-time. It also means that because the devices need to have the new configuration, the integration costs are higher. That is a problem for Netflix, which commits to supporting devices for many years. We have devices which are 10, 12 years old, where we cannot even update the hostname that's been used to handle the HTTP interactions.

Load-Aware Load Balancing

The third category of solutions is based on adding the load balancer between devices and the cloud infrastructure. In that case, devices are configured to the same hostname, but the request part goes to the load balancer. That load balancer is aware of the latencies between the load balancer and individual regions. It may also be made aware of the load of each region, allowing it to create a PID loop where it dynamically balances the load across regions based on the latency and the overhead that each region experiences. Because in realistic scenarios, there will be many load balancers in different locations, actively sending the traffic back and forth. That approach is very adaptable. It allows to quickly adjust to the changes in popularity of the source of our traffic in different regions. As you can imagine, it has relatively high integration costs, not only because of the integration of additional components, but also because in order to work well, it requires a signal from the region about the relative load of all these services behind that are working in that region. It's just not something that is aligned with the Netflix architecture where we have relatively independent microservices, where it's hard to get a single metric, how loaded that region is. Because of how dynamic the system is, observability story is not very easy, because many things change all the time. If you want to debug and troubleshoot for particular decisions, you may need to build a pretty sophisticated tooling for that.

Solution: Latency Based DNS Steering

With those three solutions, as it always happens, there are some good things, and there are some challenges. We did not have a clear path forward to advance beyond the Geo DNS solution that we had. We ultimately ask ourselves a question, can we combine most of the strengths of different solutions while avoiding most of the weaknesses? We're realistic. We're not trying to be perfect at everything. What combination of factors would work well for our particular use case, for our architecture, for our users? That's where we landed on the DNS based solution by introducing the latency decision making instead of geographic decision making. Let's see how it works. With that solution, devices are still configured with the same hostname. We still use the DNS path for the resolution where the authoritative DNS server gets an input from the recursive resolver. Then, on the authoritative DNS site, we have the latency database, which gives us an insight about what's the expected latency for the users behind that resolver to each one of the different regions. We leverage that data to make a latency informed choice to which server IP we will be handing to that particular request. The rest of the path follows the same as before. The secret sauce, and the key to the solution is building this latency map. Without that, this simply wouldn't work. Building the map and integrating that into the DNS are the two big questions here.

Let's walk over how we're solving for that problem. First, we need to collect the data. Ultimately, what we need is to collect the latency measurements between devices in our server infrastructure. Luckily, at Netflix, we've built a system called Probnik, which allows us to have controlled latency measurements, from the devices pretty much to any server out there. There are a few past presentations on that. Let me just go very quickly over the high-level details. What Probnik essentially is, it is a piece of code that's deployed alongside the network application, which we can remotely configure it to run network latency tests. We can tell individual device, go ahead and run a few network checks. Download a few pieces of data in parallel from each one of the regions that we operate in, and measure the time it takes to do so. In addition to that, we send another request with a unique hostname that allows us to trace the DNS resolution path. Giving us an association with the latency measurements, with the resolver IP, and location of the DNS server that was making the resolution.

That data is quite powerful. Once we collect all the information from all individual users, we can send it to the server for the processing. That allows us to aggregate the data. Now we know that latency from each user to each server, and we know how to group them by their units of control that are at our disposal, which could be the recursive resolver IP, or it could be the allocation of the DNS site. One important aspect about the Netflix infrastructure is that we run our own CDN. We run the authoritative DNS service on over 80 locations of that CDN, which are very geographically distributed. With that aggregation of latency by the recursive resolver, or by the DNS site, essentially what we are doing is clustering all of the users that we have into individual buckets, into some of the areas that we can control the traffic for and map those areas into individual regions. The key part here is, how do we actually make the selection? How do we allocate a particular group of users to a particular region? Which is the next big problem that we need to solve for.

Implementation

Behnam: Probnik itself provides a wealth of information that we use as the basis for building a model. This model itself needs to decide how it should steer traffic. Should it steer at the recursive resolver grain, or should it steer at a different grain? Before we make that choice, we need to figure out what kind of a model we want to build. Working with our data science partners, we use the linear programming approach. In order to use linear programming, we can't really have three separate objective functions. We can't optimize for quality of experience, availability risk, and cloud capacity constraints. What we want to do is identify a single objective function. Given that we want to provide a great customer experience, we decided that quality of experience will be our objective function. What we'll end up doing is constraining for availability risk and cloud capacity. We'll use the Probnik data to inform how we view quality of experience while using the traffic buckets that are defined by either recursive resolvers or DNS sites to inform the availability and capacity constraints.

Steering by Authoritative Resolvers vs. Recursive Resolvers

As Sergey mentioned, we have an option. We could either steer authoritative resolvers, or recursive resolvers. In order to make that decision, let's dig into the differences of steering by one or the other. First, there are hundreds of thousands of recursive resolvers around the globe, based on what we've seen in our Probnik data. However, we have roughly 80 points of control if we look at our authoritative DNS sites. This has an implication on various aspects of steering, for example, the precision. Given that we have potentially 100,000 recursive resolvers that we could independently steer, it gives us a finer-grained control of traffic steering. It also allows us to get a bit more optimal in terms of the quality of experience that we provide to our users. I think, when we measured this, this worked out to something like 10 milliseconds at the median across our customer base. However, it comes at a huge cost in terms of runtime complexity. As a point of comparison, an authoritative DNS site-based model can be solved in a few minutes, whereas one that's using 100,000 points of control can take days or weeks to generate a new result. It doesn't provide us a lot of flexibility. In addition, observability becomes more challenging, given there's many points of control. It's harder to understand whether your system is operating in the way that you expect. Because of these reasons, we decided that authoritative resolvers simplifies a lot of our problems, and provides great observability.

Steering Solver Core

Now that we've made this decision, we need to build the core of our steering solver. The steering solver itself gets a few inputs. It's the telemetry that Probnik provides, and a set of policy configurations that it consumes in order to generate our steering policies. At a high level, our steering policies are simply a mapping of authoritative DNS site to an AWS region. The wrinkle here is that, for our use case, we not only care about how we want to steer traffic during a normal operating mode for the cloud, we also want to steer sites to regions in various failure modes. In case a region is unavailable, we need to know where we want to serve this user base.

Steering Solver Overview

Let's talk a bit about what the solver is. The solver is an integer linear program. Since I'm not a data scientist, I asked Bard. Bard said that it's a mathematical optimization that helps you solve NP-hard problems. The specific difference with an integer program is that the decision variables are integers. In our case, they're actually binary decision variables where they can be a 0 or a 1. If you think about the decisions that we're making at a high level, we're basically deciding if a site should be steered to a specific region. We could think of that decision of choosing a site as a true or false. Should San Jose go to U.S.-East-1, yes or no? U.S.-West-2, yes or no? Or EU-West-1, yes or no? That's basically it. At a high level, this is the type of model that we want to establish. Now we have to establish our constraints that frame the problem, as well as identify what our objective function is, in order to optimize.

Fundamentally, we want to only steer a site to a single region. That's our first constraint. We want to have the model generate a policy where we're mapping a site to a single region. If we look at the way we could define this in code, we want to iterate across all our DNS sites, and set up a constraint where the sum of the decision variables for that site for each region, if we sum it up, it should equal 1. Next, we want to define our objective function. Our goal is to minimize latency for clients, because that's the whole goal of what we want to do. We want to provide a good customer experience to our users. Let's take a look at this objective function definition. We want to ensure that for the site to region mapping that we've defined, we are minimizing the latency penalty of steering a given site to a region. We want to multiply that by the site daily traffic. One thing that we haven't talked about, though, is what these mean. What is the latency penalty? Why are we multiplying it by the daily site traffic as part of our objective? Let's dig in there.

The latency penalty can be best described by looking at it in relation to a single probe measurement. Let's assume this client in Africa generated a probe, and now we understand its connectivity and its preferences to our three Amazon regions in which we currently run in. As its most preferred region, which is the EU region, and its least preferred is on the west coast of the United States. With their current steering approach, we can improve this selection beyond their most optimal choice. What we could do is instead of defining using latency directly, we could define a latency penalty being the additional latency incurred by the client if we steer them away from their most preferred region. In this case, the latency penalty for steering this client to the EU is zero, whereas steering this client to the West Coast is 150 milliseconds.

The Probnik data provides us a wealth of information. In order to calculate the latency penalty, we can just focus on the latency and the DNS site information for each measurement. What we could do is convert these latencies to the latency penalties at a site's region grain. What we get is three histograms that represent a single site's regional preferences. In this case, visually, you can see that this site prefers to be steered to U.S.-East-1. We can't use this histogram directly, what we want to do is use a statistic to represent the latency penalty for the site to each of these regions. You could do something like a mean, or a median, or some percentile. In our case, we decided to use the median to represent the latency penalty for this given site. Now we know its most preferred region is U.S.-East-1. Its least preferred is EU, and the cost of sending this site to the EU is 300 milliseconds.

The next step is to understand the traffic volume each site represents. For that we actually don't need most of this information. All we need to know is the frequency that these DNS sites occur in the probe data, because the probe data is actually normalized to represent the actual traffic volume. Now that we have probe counts, we could look at probe counts in different ways. We could look at it daily, or we could even look at it hourly if we want to establish things like capacity constraints on an hourly level, or availability risk, be able to balance between regions to address our availability risk. Now that we've defined these two things, why are we multiplying them together as part of our objective function? We want to make sure that we are making a utilitarian decision when looking at the latency penalties so that large sites and small sites aren't dealt with similarly. We want to ensure that we're steering most of our users as optimally as we can. We multiply the latency penalty by the amount of traffic or the number of users really that a site represents. That way, the optimization function will choose site to region mappings that are most optimal across our entire user base.

Next, we care about cloud capacity and defining that constraint. As we talked before, cloud capacity or cloud cost is really driven by how high each regional peak is. Via this constraint, we want to ensure that we could dampen the amount of traffic each region is serving, such that we can minimize cost for the business. Visually, we want to establish a limit and have the solver steer sites to regions such that we don't exceed that peak traffic. The difference here is now we have an hourly constraint. For each hour of the day, for each region, we want to ensure that the sites that are steered to that region, that their combined traffic is less than the peak traffic that we serve globally times this regional traffic percent limit.

Lastly, we want to define an availability constraint. If you remember back when we talked about availability risk, really what we're trying to do is evenly balance traffic. Each region ideally serves an even amount of the global traffic that we're currently serving. Let's assume this is our traffic shape, given the number of regions that we're steering towards. We can't steer traffic precisely to this because we have 80 points of control, nor would we want to do this from a quality of experience prospective, because customers will have regional preferences, and we want to give the model flexibility. We can establish bounds within which the model can basically have some flexibility in terms of how it steers traffic at a regional level. That way, you can build a policy where the regional traffic fits within our bounds. Similarly, here, we'll establish an hourly constraint, where for each region, we want to make sure that the absolute difference between a perfectly balanced traffic distribution and the amount of traffic that the region is actually serving based on the region selection of the sites is within the bounds that we established. In this case, we're establishing the bounds as a percentage of global traffic for that hour of the day.

Policy Configs

Now we've defined our steering solver and the constraints that will shape the way that we solve the problem, the only thing that's remaining is to define our policy configurations. These are actually fairly simple. For a 3-region policy, for example, for streaming, we first define the regions that we want to steer to. In this case, this is a 3-region steering policy, so we've defined all three of our regions. Then we establish what our regional capacity limit is. This case, we're allowing each region to grow to 50% of our peak traffic. Lastly, we're going to define an availability bound. We're basically telling the model that you can shape traffic such that we're up to 20% off from what is ideally balanced. The last little bit is that we want to define the operating modes that this steering policy should support. We have a nominal mode, as well as multiple failure modes. In this case, we're defining a U.S.-East-1 failure mode. We're going to tell the model that for this mode, you can only use U.S.-West-2 and EU-West-1, because it's a U.S.-East-1 failure mode.

We could change the parameterization for different purposes, however. In this case, it's a latency optimized policy. We are defining the regional capacity limit and availability limit to provide the model as much flexibility as it needs to steer optimally. We're basically saying it could steer 100% of peak traffic in one region, if that is most latency optimal for us. Or we can shape traffic in interesting ways. In this case, we're telling the model to apply the capacity limit to individual regions. In this case, we want to dampen U.S.-East-1 to at most 20% of peak traffic while allowing the other regions to get as large as they need to, to serve global traffic. In this way, we could generate policies that not only cover the nominal mode, which is great, but also all our failure modes that are key to us providing the resiliency that we need in the cloud. More than that, we generate not just a single policy that represents streaming, we generate many policies for the different use cases across the business, for example, for gaming, for customer service, for partner integration, you name it. We can generate policies for the different use cases by re-parameterizing policies to meet their needs.

Full Integration Flow

Fedorov: At this point, we have all the components of the solution. We have a way to measure the latency and we have a way to model the traffic, so we know how we want to steer it, the production requests. Now let's describe how we actually productize it, how we deploy the system into our infrastructure. The way it works is, on each device, like I described, there is a probing agent. It keeps measuring the latency continuously. It sends the results of these latency measurements to our data center so it can be aggregated and transformed into the steering policy using the solver. That configuration goes into the map that configures the site to region mapping for different types of requests that we have for different policies, for different states. Then we have the highly available service, which deploys this configuration to the authoritative DNS servers. The configuration is pretty simple. It just informs that the request of this class needs to be steered to this region at this point. Then when the request from devices, actual production requests come in, they follow the DNS resolution path. That's where the region configuration is being applied to choose the proper IP address to be returned to the device to talk to the region that was configured. In case the failure occurs when the target region goes down, we already have the policy precomputed. That's the responsibility of this configuration server to send the updated configuration knowing the current state of the cloud infrastructure. In that case, all new requests from the devices would be mapped to the failure region, changing the shape and configuration of our traffic. Existing requests, we have a few options to reroute them. Eventually, the connections will be reestablished, and then new connections would follow the updated path. We also ensure that we use low value TTLs for the DNS records.

Impact on Netflix Traffic Once We Moved to This System

We deployed the system a few years ago, going from the geo-based to the latency-based DNS solution. We were quite pleased with the results, because we were able to reduce the overall latency across all users to our infrastructure by 1% to 3%, while also achieving a much better traffic distribution leading to millions of dollars in infrastructure savings on an annual basis. Beyond that, the system enabled us to have much better control and confidence in how our traffic is being managed. With that confidence got simplicity, because we could see visually what happens with the traffic right now, what would happen with our traffic if some of the points of our infrastructure would fail. As we started learning more about upgrading the system, we realized that it's actually even more powerful because it allows us to model some of the changes that are not even in production yet.

Steering Model Provides Flexibility and Enables Exploration

Behnam: This steering-based approach using modeling is extremely flexible. It allows us to explore and experiment. Not only experiment in production, we could generate many policies that we could use for experimentational use cases that doesn't even steer actual client traffic. Two use cases that are interesting, are that we could generate policies for specific events within the world or within our business, for example, world events like the World Cup, or a Super Bowl may have different traffic distributions than we typically see. We could generate steering policies that address the needs of those days. Or in cases of content launches, or live events, we may want to steer in a different way that trades quality of experience for things like additional headroom in the regions, or better balance to provide lower availability risk to the business. This same system can be used to do forward-looking experimentation. We can analyze how the state of the system would change if we changed, for example, our footprint.

Currently, we run in three regions, but Amazon has many regions across the world. One question that we asked is, what would be the most ideal footprint for Netflix? If we didn't have a system where we could build steering policies using a data driven approach, this would be very difficult. Because we have Probnik, we can enable Probnik recipes that collect the latency preferences of our users across Amazon's entire footprint. Then we could feed these into a model. We could create specific policy configurations that steer to a subset of the regions that Amazon operates. We could go over a number of combinations and permutations of the regions to understand the various benefits that they provide. We get to understand, of these combinations, which ones provide us with good quality of experience for our users? Then we can marry that information with pricing information about the regions to create a business case for the regions in which we should operate.

Learnings from Building a Model-Based Steering Approach

In our journey to develop our own steering solution, we learned three real key lessons. First, it's to know our customers. The information about our user preferences was key in the way we went about approaching building the steering solution. We also understood the needs of our internal customers and the various use cases that steering needs to support. We also learned that learning or understanding our architecture was important in that steering isn't a one-size-fits-all problem. Depending on the various needs of your architecture, for example, failover in our case, or the relative cloud costs that we have, we may build a steering solution that's very different than other folks that run on a public cloud or in their own data centers. Lastly, data was key to the way that we built this model-based approach to steering. Without data about customer preferences, we wouldn't have been able to build a steering map, nor would we know how well our steering performed in production.

 

See more presentations with transcripts

 

Recorded at:

Dec 14, 2023

BT