InfoQ Homepage Presentations Netflix Networking: Beating the Speed of Light with Intelligent Request Routing

Netflix Networking: Beating the Speed of Light with Intelligent Request Routing

View Presentation

Speed:

Download

38:56

Summary

Sergey Fedorov discusses how to build the Internet latency map, using network protocols and edge infrastructure, and how to use a data-driven approach to evolve your client-server interactions.

Bio

Sergey Fedorov is Director of Engineering at Netflix.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Fedorov: This presentation is about improving performance of network requests. It's been known for years that latency of network interactions has large impact in many business areas. For example, internet retail, with Amazon showing the correlation between latency and online sales revenue, or Google now including performance of page speed in their ranking algorithm. Companies are willing to invest hundreds of millions of dollars to reduce latency just by a few milliseconds. Over the last year, all of us directly felt the growing importance of daily interactions over the internet on our day to day lives.

Static Content Delivery with Open Connect CDN

When one thinks of latency in scope of Netflix's business, video streaming naturally comes to mind first, and for a good reason. Over 10% of all downstream traffic on the internet is Netflix's video streaming. To make it performant and efficient, Netflix has built and scaled one of the most well connected content delivery networks, or CDN in the world. This network is called Open Connect and is used today to deliver pretty much all of the Netflix's static content like videos or images. This data is delivered from a service called Open Connect Appliances. They use a combination of hardware components and software optimizations to deliver content with incredible efficiency.

These servers are deployed in thousands of locations all around the world, embedded into ISP networks or placed into internet exchange locations. The internet exchange locations are itself connected together via the private Open Connect backbone network. This network also connects CDN servers to the control plane in the Amazon cloud. This powerful infrastructure has been deployed and optimized for best streaming delivery over 10 years, and is one of the core building blocks of smooth Netflix streaming experience.

UI Personalization Is Powered By the AWS Cloud

Network dependencies for Netflix service are not limited to video streaming. Before a user starts the stream, they need to choose a title to watch. For that, they interact with the content discovery experience within Netflix UI. These experiences also depend on data delivered over the network. However, these interactions are heavily personalized for each user, with data provided by a collection of services running in the Amazon cloud infrastructure. Today, Netflix's cloud infrastructure runs on three Amazon regions. In those regions, hundreds of microservices work together to provide this personalized metadata sent to UIs on user devices. From user measurements, Netflix knows that delays to get this personalized data from cloud endpoints may contribute quite significantly to user perceived delays when they interact with the Netflix UI. For example, calls to AWS services can take up to 40% of user wait time to render the homepage on Android.

With that in mind, Netflix engineers were faced with an interesting question. Would it be possible to leverage various distributed CDN edge infrastructure to improve performance of network requests between user devices and cloud infrastructure in the AWS? Standard HTTP caching techniques wouldn't work because personalization makes these interactions practically non-cacheable. While there are many options that involve larger architecture, the ideal solution would minimize disruption to device and server teams, so they could focus on delivering product features. Ideally, the system itself would be easy to maintain, no matter how distributed it is.

Background & Outline

My name is Sergey Fedorov. I'm a Director of Engineering on the Content Delivery Team at Netflix. I'm going to present how Netflix has solved this problem. First, I'm going to explain how to leverage distributed CDN edge infrastructure to accelerate network requests between devices and data centers. Then, I'm going to describe in detail, Netflix's solution, all the way from the concept to deployment and to operations of the system. Lastly, I'm going to highlight what all of you can learn from our experience to make sure that your network requests can fly through the internet as quickly as possible.

How to Accelerate Requests Using CDN Edge

First, let's talk about some fundamentals of network acceleration. In our case, we have three regions in the cloud, but we have clients all around the world. We have way more distributed CDN edge infrastructure across thousands of locations generally quite a bit closer to users. Due to personalization, we cannot position the content on the CDN servers, but what if we route the request from the client to the cloud via the closest CDN server, the proxy on that server? Can it be faster? In order to explain the answer to that question, let me get back to some of the networking 101. In order for the client to start getting the data from the server, it needs to first establish TCP and TLS connections. That time depends on the round trip time, or RTT between client and the terminating server. In the situation when that latency is 100 milliseconds, it will take at least 200 milliseconds to establish a connection, and another 100 milliseconds to start passing useful data from the server to the user, to the total of 300 milliseconds. Now, if we install a CDN server in between with a proxy that terminates that connection, and let's say, it's 30 milliseconds away, the overhead of establishing the connection is only being paid on the shorter segment of the path, and we only travel the full 100 milliseconds of latency once, to get useful data. As a result, the total latency to start getting the data gets reduced to 160 milliseconds, quite a bit lower.

Reducing Data Transfer Times

The benefits do not stop at the connection establishment. Because when the data is already flowing, if there is any loss in the network, which could be quite common on wireless networks, the recovery from that loss depends again on the round trip time between the client and the terminating server. Then between CDN edge server and the cloud, we could leverage our private backbone network. We can configure specific rules to prioritize latency sensitive traffic, and help to avoid the congestion on the internet. Then, we can also multiplex several requests from the client on the same connection between CDN edge and origin server further leveraging more benefits of network link. On one side, this is all networking fundamentals. From the user perspective, this can look like magic, because we are managing to make data flight faster, without changing anything on the client or on the server side. It may appear like we're defying the laws of physics. That's why we decided to call this project, faster than light.

Measure What You Can, Isolate the Rest

Now with a strong theory and a cool sounding name, all that's left to do is to launch this project in production. Not so fast. At Netflix, proving your case with theory is not enough, because we know that reality may bring you many surprises, especially when you operate at Netflix's scale. Which meant that for our project, we had to validate the improvements promised by the theoretical background with real world data. We had to go and invest into building the network measurement system. For that measurement system, our ultimate goal is to understand the patterns of connectivity between our users and origin servers. In a way, we wanted to build the latency map of the internet, the one that we could then use to validate the performance of different routing options for our requests.

Real User Monitoring

For that system, the critical part is to have estimates for our users using our Netflix devices across the full range. Of course, we want to do it as quickly as possible, and do it in a way that minimizes the risk to our production systems. First, we look at available options. One of which is to use real user monitoring, where you observe the performance of network requests for your production traffic. It's nice because you get the full coverage of all locations and devices for your users. However, you can only test the production path. In order for us to test the performance of our CDN edge proxy, we would have to productize it first.

Synthetic Monitoring

An alternative is to use synthetic monitoring, where you have a test service in the lab, where you can test any server and any path you want. You have full control. However, building a lab representing all of the locations and devices for Netflix users is pretty much impossible. We built Probnik, a synthetic measurements platform running on user devices. It consists of an agent that we call a probe that gets deployed alongside of the Netflix UI, and allows us to run the control network experiments. The way it works at some moment, ideally, with little network activity in the UI, the probe makes a call to the control plane in the cloud, and gets the test configuration that we call a recipe. Recipe specifies the set of targets that the client needs to download a specific set of information from and measures how long that request takes. That then reports the result for analysis. There is more information available about Probnik online, with former presentations over at the Netflix GitHub open source page.

Prototype - Verify Performance Expectations

Once we have a measurement system, it's time to build a prototype for our CDN edge proxy. For the prototype, our guiding principle was simplicity. We went with a Go based proxy, choosing Go for the power of its networking library and ability to deploy it as a static binary, so we could just drop it on our CDN server, open the port, and start routing traffic. Then we also have to figure out how to connect devices with one of the thousands of edge servers. For the cloud server selection, we've used GeoDNS to choose one of the AWS regions for each user. Geo wouldn't work so well for our CDN edge. We went with TCP Anycast, where we configure a single IP address for all of the servers on the CDN, and let the client be connected to one of them using the network distance. Now, for each client, we have two options, going to the cloud directly, or being proxied over the CDN edge proxy.

The question is, which path is faster, unless we're using our probing system to measure the performance. For each user, we are asking them to send two requests. Download the same piece of data from the cloud over a direct path, or have the request go via the CDN edge proxy, and compare the time. Then they're aggregated over all users and use this visualization to look at the results. On the vertical scale, we're looking at different regions where users are located. On the horizontal scale, we're looking at the percentage of acceleration that proxy gives compared to our control experience with clients going to the cloud directly. The intensity of the color show the number of users achieving this specific percentage of acceleration for the request compared to control. Ideally, we'd like to see all the users to the right of the red dotted line, meaning that we have equal or faster experience. However, on this heatmap, we see a good number of users to the left of the client, meaning that for them, the proxy actually results in a slower experience. This is not what we want to have. We don't want to have a compromise. We don't want to make the performance fast for some users to make others faster.

At this point, we learned that the solution purely based on CDN edge proxying wouldn't work for our users. We get to this point using quick prototyping. We're quite confident in our data. Most importantly, we didn't risk anything in production. Yet we were not ready to give up on the idea. We went back to the drawing board. One thing that we've learned from the first prototype is that many users had much faster performance with our proxy, but not all of them. What if we could intelligently choose the most optimal path depending on the user, depending on their connectivity? In another way, we'll have to route or steer them intelligently. We'll have to do that decision without making any API calls to the cloud, because that's where we try to send traffic to. We still wanted to maintain the easy integration with the client, so no complex logic.

Using DNS

When we look at the available options, DNS came to mind. DNS converts the hostname used by the device to the IP address the device needs to connect to. What if we plug into that process and either return an IP address of the AWS server for the direct path, or the TCP Anycast IP for the CDN proxy path? One complication with DNS is that authoritative DNS server, the one that does the conversion between hostname and IP address, doesn't have visibility all the way to the client. Instead, it only sees the IP address of the recursive resolver that is configured by the user. Which means that we'll have to make a routing decision for the groups of users sitting behind the same recursive resolver. In order to do that, we'll still use the same testing recipe with our probe, where we have two paths measured between the users and the cloud. One going directly to the cloud, another one using the proxy. Then they'll aggregate the latency measurements and group them by the resolvers. Then for each resolver, we'll have to make a single decision to route all the users for that resolver either over the proxy path or going to the cloud directly. This map is built based on the resolver IP and is then loaded onto authoritative DNS server, which then uses that map to route the future request by responding with the IP address based on the map decision.

What we've done is we've built this control loop where we are looking at the data collected from the users, latency measurements that are then being sent to the cloud, going through the data pipeline, where the results are aggregated via resolver. Then the map is loaded onto our DNS infrastructure, which is then used to make the decision for production traffic. This is the idea. Of course, we had to measure whether this would work. That's where we use probing again. At this time, we're using a different recipe where we compare our control users going to the cloud directly, with the users following the decision made by our DNS steering. Then we're comparing the results from all users, and looking at the results in the same visualization that we used before. That's the case where we see the picture that we want to see. All of the users are to the right side of the red dotted line, meaning that we either significantly accelerate the network performance, or keep the performance the same as it was. That's the picture that we want to see. That's the reason to celebrate.

At this point, we've found a solution that works. It's based on DNS. It doesn't require any complicated changes on the devices server side, and produces faster network experience for a majority of our users without a compromise. We've been able to test and evolve the system using quick prototyping and analysis based on data from our users. As we've been doing that, we didn't have to change anything in production, minimizing negative user impact.

Productize

To some of you, it may seem like the hard part is done. In reality, things were just getting started, as we had to roll out the system in production. What we had to do is to teach writing of millions requests per second carrying data critical for Netflix service. That data is coming from more than 200 million users across thousands of different devices. This request would be dynamically routed over tens of thousands of CDN edge locations on the way to the cloud infrastructure. While that is being deployed, hundreds of device and server engineers keep evolving the Netflix services with dozens of changes per day. If that's not enough, the faster than light space force consists of only three team members who would have to build, deploy, and maintain that system. In order to be successful, we had to focus on a few things. First, we had to dress well. Second, we had to embrace failure as a core component of our architecture. When operating over a distributed edge infrastructure, it's better to expect that something is going to break. We focused on minimizing the scope of each failure, and when it happened, failing gracefully.

In a nutshell, integration of our request acceleration system was quite simple. Make devices use a different DNS hostname that would follow an intelligent routing decision. In addition to that, we asked to implement a fallback mechanism where clients would monitor connectivity errors over the accelerated path, and at the signs of consistent failures, switch to the default path. That system allowed us to move faster, to be more risk tolerant when making changes, while still protecting us if we missed any edge case condition during our synthetic testing. Yes, despite all of our efforts, we did discover a few edge cases that we missed. Of course, we didn't blindly rely on a fallback only, we still followed all of the best deployment practices. Making small changes. Testing them with a probing system first. Then proceeding with an A/B test or canary to catch any edge conditions. Then, go into the progressive rollout to catch any potential capacity concerns.

Operate - Minimize Operational Overhead

Once we rolled out this faster than light acceleration system, our job is still not over. We still had to operate it. When you operate on a distributed infrastructure having thousands of points of presence, depending on the global internet network, issues are going to come. We didn't want to spend most of our time troubleshooting these issues, especially the ones that can be avoided automatically. We realized that in addition to using user based probes as a network performance signal, we can also use it to route around outages. For that, we would still be relying on the same client based probe configuration that we used to observe network performance over different network paths to the cloud. Instead of looking at network latency, this time we focused on reachability. Having the visibility into user connectivity over various network paths allows us to detect the type of network failure, and then distinguish between cases when only one network path is affected, meaning that it's still possible for a user to reach his destination. Like this example, when ISP loses connectivity to AWS region only, or another example of a failure of our backbone link. In these cases, the appropriate response is to change the user path. Have them fall in an alternate route that is still reachable. In other cases, like a failure over last mile link, there is no routable path for the user. There is literally not much our team can do to help.

As we didn't want to be overwhelmed with individual failures of widely distributed network, we integrated the reachability signal into our request routing. For that, in addition to considering performance of network requests, we would also look at errors and automatically reroute users over the path that's still reachable, if there is one. The response of our request steering pipeline is 5 to 15 minutes, which could still be a problem. That's where our client fallback process comes handy again, failing gracefully and maintaining user connectivity. It may only result in slightly slower experience for a few minutes, while significantly reducing operational overhead for our team.

Recap: Device Cloud API Acceleration at Netflix

To travel faster than light, we rely on hundreds of thousands of synthetic measurements performed at user devices. That data is then aggregated in the cloud data pipeline, where we determine the most optimal network path across 200,000 segments at DNS resolver granularity. The output of that model is then used by DNS to route millions of production requests per second. HTTPs requests on new connections are accelerated by 25%, and on existing connections by 10% on median. The system is built, deployed, and maintained by a small team of three engineers. It required minimal amount of changes on the device and server side to deploy. Operationally, it's mostly hands off.

Lessons Learned

Some of you may be thinking, optimizations like that are great for companies like Netflix. What if I don't have a CDN, or a private backbone? What if I don't run my own DNS service? What can I learn from that? My answer to you is that you can still learn quite a bit. Majority of what I've talked about, the process, the workflow, the mindset, can still easily be applied in your domain, at most with some minor modifications. Even if you don't run your own infrastructure, you have plenty of choices. There are many cloud vendors, hosting providers, or CDN edge networks. My point is, as you think of evolving your data delivery infrastructure of your network architecture, question your intuition and question marketing claims. Make your decisions based on data collected from your users, and on metrics that are relevant for your application. It doesn't limit you to only network routing. As a matter of fact, the same team of three engineers that was working on DNS based steering solution, tested and evolved dozens of other network improvements. Starting from testing new application protocols like HTTP/2, to playing with different transport protocol options, or rebalancing how we send traffic to Amazon regions, or migrating to the different DNS provider.

The key to the team efficiency is a short loop that allows us to measure different ideas with the data collected from devices without touching production. For the DNS steering system that I've described, it took us less than six months to iterate on different prototypes and come up with a final solution that we knew is going to work. It took less than six months, including building the measurement system. Then it took us over two years to productize it, because productization takes time and it's also risky.

Embrace Failure

As you productize your ideas, think about failures first. Because when you deal with networks, and especially operate over a distributed edge infrastructure, something is going to break. You probably don't want to be in the middle of it. For our team, despite managing millions of requests per second of critical Netflix traffic, running dozens of experiments, changing the network configuration, the team receives less than one pager per week. The secret to that is embracing the failure as part of our architecture, having the fallback, failing gracefully.

Summary

I encourage you to invest into your networking size workflow. Follow the data driven process to evolve your data delivery architecture. Have operational mindset. Get ahead of failures by automating the response as much as possible. Keep learning every step of the way. I believe that following this workflow would make your network request fly through the internet faster than you could ever imagine. Make your day to day operations free from fire drills. Your customer is delighted with a smooth user experience, free from network delays.

I hope that you found the learnings and experiences at Netflix useful. If you want to learn more or share your story, you can always find me on Twitter or LinkedIn.

Questions and Answers

Bryant: Always love learning more about networking. I really like the callout to experimentation. That is key to success in many areas of life.

I know there was only three of you, but you mentioned it took over two years to fully productize this. What were the main challenges that made it take this long?

Fedorov: I think the main difference is that while we are prototyping and proving that the concept would work, it was primarily considering only those three folks that were in the picture. Meaning that we could iterate very quickly, fewer dependencies, like we could be really focused on the core aspects of it. When you go into production, like I've shown some of the stats, we got millions of requests per second, we have the CDN that we run on that is built for videos, that's a lot of traffic. That's very critical. First, in many ways, you have to explain and make sure that all the risks are understood, like just communication overhead is quite substantial. In our case, we have at least four different client platforms. The teams that write application for iOS, for Android, for web, for TVs, those are completely different teams. They have completely different code base. Every integration point, even though we had a relatively lightweight integration, that's already quite a bit of work. Also, understanding what needs to be done. There is a lot of stuff that can be lost in translation.

Then, as we are actually going and approaching the routing changes, you have to think about stuff like provisioning and capacity. We are leveraging the backbone. Historically, we've built that backbone to pre-position the video content on the CDN edge. The pre-position is mostly done during off peak. It's a relatively small amount of data, even though it is videos, but it's actually not that much in totals, because we're pretty effective at placing that. Here, even though we are accelerating API requests, which are relatively tiny in size, in aggregate that's actually a lot of traffic. We have to augment the backbone. When you augment the network, every step is months, because you will have to preorder. You have to install it. Before that, you have to do the modeling and properly understand which links have to be augmented. Those are just a few examples. Then, there are some of the issues about load balancer and properly identifying all of their search, like all the ways traffic gets into the cloud infrastructure. There are a few talks about Zuul, our edge gateway. There's quite a lot of nuances there that we had to go through.

Right now I've optimized the presentation for 25 minutes. You need to understand quite a bit more to properly do that. Lots of communication, lots of understanding, lots of learning. Yes, it does take time. What really helped us is that we had the data. We've just shown, this is what we have on there in terms of improvements from real users, and that's what drives attention. Without that, I'm not sure how long it would take, because at Netflix we run pretty lean. Teams always have to juggle multiple things they have to work on. In our case, we had the project and idea that was really promising from the data viewpoints. It also helps, even though two years might sound like a lot. It might not be that long for the scale of it, and for how many things that we changed.

Bryant: I agree, Sergey, and the high risk situation. I totally get that as well.

Do end users opt into this? Are they like beta testers, or do you just use production traffic?

Fedorov: We use production traffic. The first step that everyone should do before they add extra measurements is to make sure that it doesn't harm users. In that case, the first thing that we've done is that when we had implementation of the product, we had a very small scale test, and we just tested any quality of experience impact. We validated when and how many requests we can send without impacting any smoothness of the UI or the impact on the video streaming. With that, we had specific limits of how many requests, how many tests we can run at any point in time.

Bryant: I did the same analysis in the past in terms of performance. Nevertheless, in some countries, sometimes the IPs are being reused across multiple regions in the same country, which actually can impact the proxy, non-proxy approach. How do you solve this problem?

Fedorov: Generally, there are two things that are core to our solution. First, that's the part where TCP Anycast actually excels, because you are giving the user a single IP address. Even though it appears like a single location, multiple users behind the same resolver can actually go to different servers based on the local connectivity. That's one nice part of Anycast. The second part is that we are not limiting to our proxy solution. Some of the negative aspects that we've shown when we are seeing the degradation of experience might have come from those proxy solutions. There are multiple reasons why it wasn't always faster. One of them could be that, at our scale, we physically don't have an ability to investigate every single case. It just wouldn't scale. We may look at some of the patterns. In that case, our solution was purely data driven. We look at the latency, we're looking at the finer grain building code that we could use. We measure if it works, if it gets an acceleration, we go and do that. In general, if for some reason the proxy really messes up with the proxy solution, we just would steer those users directly to the cloud without leveraging CDN edge.

Bryant: Can you elaborate on what your DNS mapping looks like? How you compute it, and how is it loaded?

Fedorov: I think in a nutshell, the DNS map, it's a map between the resolver IP, so just a simple IP address that's authoritative DNS that oversees to the path. Basically, for each resolver and per shared number we see about 200,000 or so recursive resolvers that hit our infrastructure. We map them to their decision like to the IP address that we need to give. It's either a TCP Anycast address, or it's like a CNAME or IP address of the AWS server. That's what we load onto the auth DNS. There's key-value pair, IP address destination. It's being computed. Basically, we run the loop, so those latency measurements that are on the devices continuously, all the time, they're feeding all of these measurements from all over the world into the cloud. Then we have the data pipeline that's in real time, aggregates all of these data into the resolver, makes the decisions for which resolver, which path we should take. Then it goes and loads into our CDN. It's not instant, obviously. We currently reduce the latency of this whole loop to about 5, 10 minutes. We have a few ideas how we could make it faster but, in general, we're also trying to make data driven decisions. We are trying to see how many situations we are not capturing with that. The lower latency you do, the more complicated and fragile the system becomes.

That's where it's beautiful that I have this fallback on the client side, that for some reason if something happens within those 5, 10 minutes before we have the chance to update the routing, clients would gracefully fall back.

Bryant: Could you identify if users are using a wireless network like LTE or 5G? Is there any way to improve user experience if the bottleneck is a wireless network?

Fedorov: I think there are probably two questions here. A, if we are purely talking about wireless network, then generally, we can identify that. Quite often they would use different recorded IP, because with resolver infrastructure we can automatically determine that. In our case, we don't have a special treatment because it's purely latency based. Our edge proxying is more likely to help for wireless networks, so it's more likely to prefer edge path, it is actually faster.

I think the second part of the question is the in-home wireless setup. The link between the device and the router, like conventional WiFi and stuff like that, or microwave situation when someone turns on the microwave and then we have an instantaneous blip. In our approach, we intentionally don't look for those situations because those instantaneous in-home situations, that are very unique for the time and the users, they will not benefit from our solution. What we are looking for with our approach is the long-term connectivity patterns on the internet. If we need to address to any instantaneous things, likely we would have to go with a client based solution. Right now we are playing with some of the ideas on how we can do smarter things on the client side to choose different paths. Generally, it doesn't scale as nicely as our solution because then each client like edge platform would have to do its own thing.

Bryant: It's that constant tradeoff between client and server.

Changing tack a little bit. Do you see in production that clients using HTTP/2 have clear advantages over HTTP/1.1 in terms of latency?

Fedorov: That's one of the other things that we've tested. It's orthogonal to the routing question, because with HTTP/2 we're changing the protocol, we're not changing the route. That's where we use the probing and our client side measurements for. We did see improvements to HTTP/2. It's not clearly one side or another. There are some cases where HTTP helps more when we have multiplexing of requests, may have longer distances, all of that stuff. We also learned that there are some of the downsides of HTTP/2. It could be a little bit more process intensive. If you're running on an old hardware or old device, it may ultimately result in the worst experience, because it may be a little bit more memory intensive, it can be a little bit more compute intensive, both on the client and the server side.

We did see network improvements with HTTP/2. It's a little bit dependent on types of the patterns of interactions. If you only send one request, you'll see way less improvements with HTTP/2 versus parallel requests going from device to the server. You also have to keep in mind the resource constraints in some situations.

Bryant: That is something that we all forget not being in Netflix in that you can be running on a mobile device, or some really old television that's got Netflix built in.

Fedorov: Welcome to the world of 10-year old smart TVs with all the players.

Bryant: I can only imagine the challenges.

What was the ultimate problem that motivated you to carry out the project? Was it an idea that came up or was there a real problem in production that came through Netflix teams?

Fedorov: The origin of that is, we want to make sure that all of the UI transitions are smooth. We want to minimize any wait time for the users caused by the network. There are obviously multiple ways we can improve that, with like device side prefetching all of those things. Ultimately, we do have a global customer base, we have a relatively localized cloud infrastructure. It would be good running in more data centers. Ultimately, we have thousands of edge locations, and we could maybe have thousands of cloud regions. Still, you can't be the speed of light.

Generally, we've had a strong attribution, and later we've had the hard data that shows that interactions to the cloud is on the critical path. Basically, it's in the face of users, and minimizing the data transfer would benefit that. In many ways, we have this infrastructure sitting there. It was a logical way to leverage the very well distributed edge ecosystem to do network optimizations. Sadly, we still have to go to the cloud. Because of heavy personalization, we cannot do HTTP caching and stuff like that. Even here, as I've shared, there's some wins and the promise that we've seen on data transfers. Those are medians. In some of the edge cases, like regional or for specific lossy networks, the improvements are 50%, 60%, especially if you are establishing a connection.

See more presentations with transcripts

Recorded at:

Oct 02, 2021

Sergey Fedorov

InfoQ Software Architects' Newsletter

Netflix Networking: Beating the Speed of Light with Intelligent Request Routing

Summary

Bio

About the conference

Transcript

Static Content Delivery with Open Connect CDN

UI Personalization Is Powered By the AWS Cloud

Background & Outline

How to Accelerate Requests Using CDN Edge

Reducing Data Transfer Times

Measure What You Can, Isolate the Rest

Real User Monitoring

Synthetic Monitoring

Prototype - Verify Performance Expectations

Using DNS

Productize

Operate - Minimize Operational Overhead

Recap: Device Cloud API Acceleration at Netflix

Lessons Learned

Embrace Failure

Summary

Questions and Answers

Related Sponsors

This content is in the DevOps topic

Related Topics:

Related Editorial

Popular across InfoQ