BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Evolution of Edge @Netflix

Evolution of Edge @Netflix

Bookmarks
43:01

Summary

Vasily Vlasov reviews Netflix’s edge gateway ecosystem - multiple traffic gateways performing different functions deployed around the world. He touches upon the motivation behind such topology and highlights challenges it introduces. He shows how the value is added, what the operational footprint of it is and what happens when things go wrong.

Bio

Vasily Vlasov works at Netflix where he leads Cloud Gateway team. The team is providing API Gateway and Push notification services for Netflix’s customers and employees. Prior to Netflix, he designed and built the gateways and software load-balancers for iCloud and iTunes at Apple.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Vlasov: Imagine for a second that you and I, we start a new company. We've got an idea, we've got some funds and we gathered a talented team of engineers to work on our idea. It's a great, great place to be at. We will work hard for sure, but success is not guaranteed. What is guaranteed is that over time we will evolve - our product, our customer, our infrastructure and our Edge. Today I want to talk about the Evolution of Edge, and I want to use Netflix as a case study for that. Granted, Netflix is not a startup, but it went through some changes and with those changes, infrastructure and business evolved.

My name is Vasily [Vlasov] and for the last seven years, I worked on Edge services. I started with Apple, built Edge services for iCloud and iTunes. In 2018, joined Netflix to work on API platform and later on push notifications and API gateways and lately, yesterday I learned that if speaker tweets, then the speaker is more credible, so last night I wrote my first tweet.

Full disclosure. Since we are going to cover a lot, I'm not going to go deep. It's going to be an overview session.

What Is Edge

When I was preparing for the talk, I went online to the InfoQ website and I checked what people were interested in, and one of the questions that was very popular was, what is Edge. For the purpose of this talk, let's agree that Edge is something that is close to client but it's not a category. We will not say, "Something is Edge and something is not." Let's treat this as a quality that is more or less pronounced for certain concern. For example, you take DNS and you take database. It starts with D, but they are less or more Edgier. DNS is more Edgy than database unless you expose database to the client, which probably is a bad idea.

Time to Market Is Key

Let's go back to our business. We have money but we don't have a lot. Usually that's the case. Key thing that we optimize for is time to market. We need to roll fast. How do we do that? Basically, we do not overcomplicate things. We introduce sane practices and we rely on good judgment of those people who we started the company with. What does it mean? Let's start simple; let's start with three-tier architecture. Three-tier architecture means that we have a client, we have API, and we have a database that this API talks to. All the concerns we put in this API application.

The beauty of this approach is that we don't really need to spend time on building standards or tooling. Whatever we put in the codebase on this application is a standard. Our Edge right now is our API application. It's the only app and all we have is Edge. We also have a load balancer which usually does terminate encrypted traffic and sends plain text traffic to the application itself and our DNS is usually very simple. We just need our clients to be able to find us.

If I flip slides, we will see how Netflix architecture looked like in the early days, so please try to find some differences. The difference is the name of the application is NCCP and not API. NCCP stands for Netflix Content Control Protocol and it was the only application that was exposed to the client and it powered all of the experience. There was a hardware load balancer terminating TLS in front and there was one domain name, simple record. That was good enough to start the business and I hope it's good enough for us to start the business.

Summarizing, three-tier architecture, monolithic application - it is fine. Our Edge concerns are very simple. Our API is Edge, DNS and load balancer. We don't need more, we are good. If we are successful enough, we get our customers and now it's time to grow.

Scale and Engineering Velocity

With growth of the customer base, with growth of engineering, what happens is we add features. With added features, we add engineering. We add features and engineering - money flows. Beautiful. At this point of time, we want to preserve engineering velocity. We are still relatively small and we don't want to step on each other's toes. How does this manifest in the changes to our ecosystem?

Whenever we see something big on which a lot of people work on, we have a tendency to split this apart and lately we were splitting everything into microservices, so let's introduce some microservices. We took some concerns from API application and put them as separate apps. They probably have separate database, they probably have dedicated teams to work on, so it works well and our clients didn't notice that we did that. That's because API is our Edge and it is a level of indirection for all of our clients.

Over time, as we built more and more applications - separate applications - we still build logic to orchestrate within API. A request coming from a client goes, hits API and then API needs to execute some code to execute underlying micro services. Over time, the amount of this code grows and while we can say, "Thank you API team for making everyone's life easier," we know that you will become a monolith very soon. In a talk a couple of years ago, Josh Evans referred to this problem as return of a monolith. A monolith never went away, it just grew slower.

Over time, API gets bigger and bigger. What do we do with bigger things? We try to split them. Let's split API, but unlike services that we introduced on the previous page, API is an Edge service and it's a contract to our client. What do we do? We need to change clients. There are at least a couple of ways to do that. On the client side, there is an orchestration that needs to happen. If we split Edge application, one of the ways to split it is to introduce additional domain name and say, "Everything goes to this domain name for this concern; for other domain name, for this concern. We are good." The only challenge comes when the client that does this needs to change.

A good practice is to introduce another service that tells client for which concern where to go. In case of Netflix, it could be playback goes to this service, discovery goes to that service. Alternative way would be to introduce an API gateway and route traffic transparently. Transparently routing traffic is essential for splitting our monoliths or Edge monoliths further.

Let's do a small quiz. Who thinks Netflix did client-side orchestration? A few of you. Who thinks that Netflix did API gateway orchestration? Majority of people. Interestingly enough, you are all right. The beauty of these approaches is that they are not mutually exclusive. Initially Netflix split two applications using client-side orchestration. NCCP, Netflix Content Control Protocol was split into two. NCCP stood there for playback experience while APIs started to handle discovery experiences. There were multiple separate domain names and there were multiple load balancers. Over time, as ecosystem evolved, Netflix introduced API gateway to the picture as there was a need to split a lot of functionality further on the Edge. That paid off in the sense that over time more and more services were added. On this slide, you can see there are node.js servers that are added. We call them back end for front ends. They allow UI engineers to run their services that can call other services on the Edge, get data from them and form a response in the format devices expect.

Introduction of API gateway - it was a big step forward. What does it help with? First of all, API gateway reduces coupling between client and ecosystem services. How does it do that? By providing two contracts and bridging two contracts together. There is contract with the client and contract with the services. Let's take a look at an example; authentication is a perfect example for API gateway.

At Netflix, there are two different use cases. One use case is streaming clients. When you open Netflix application, this is called streaming and there is also content engineering use case. It is an enterprise part of business. Satyajit [Thadeshwar], a colleague of mine, gave a talk yesterday on how authentication works in streaming world, so I will focus on content engineering or enterprise world. In the enterprise world, we have several types of authentication. To name a few, it's OAuth and mutual TLS. You can imagine what it would like to implement authentication on every Edge service. Not only that, if you implement on every Edge service, sometimes you find vulnerability and imagine what it takes to rotate, let's say, 60 services and deploy a security patch.

API gateway - we call it Zuul because we have this open source technology called Zuul, so we use these terms interchangeably - is the one that handles authentication flow. In the case of OAuth, it can be, user comes in, they redirect a send, user authenticates, gets back to the service. Then, what happens is, the request from the user, once user is authenticated, goes to the underlying service, the backend service. The backend service doesn't need to know about how user authenticated, doesn't need to know about flow but it needs to know about identity. Identity is passed as a header. We craft an end-to-end identity, that's what we call it. It's a dual token that's assigned and the signature can be verified at any layer and we pass this identity token with our request. The service that receives it can forward this request and forward identity with some request even further. The whole indication chain knows on behalf of whom this call is executed without even worrying about the nitty-gritty details of authentication.

Another feature that helps a lot is routing. Remember, we want to decouple clients from knowing the underlying structure and the underlying shape of our Edge. That's why we send everything to API gateway, Zuul, and let it figure out. Imagine at some point an engineer on the API team knows that the API is huge and handling too much traffic. It needs to be shutted and we create new cluster with API and we decide we will only send traffic that is for TV devices there. The dotted line here is a potential route for the traffic.

That's not all, that's just the Edge layer. What about meatier services? API needs to talk to AB. AB is a service that tells you whether a customer is part of any AB test. Imagine I am an engineer on AB team and I want to test my bleeding edge changes. I don't want to merge them to master yet. I really want to see how end to end flow works. Maybe that's not the ideal way to test but I really want this. What I do is I create my cluster, I call it with my name and I need to route some traffic there. Probably it's not a good idea to route customers' traffic to the bleeding edge application that I just built on my own, so I go to Zuul and I configure a routing rule saying that for this particular customer, who is myself, I want this customer to be routed to this bleeding edge. When the request comes from TV that I own with my account, Zuul checks what rules have to be applied. It applies the rule to route to API TV cluster, but it also checks that I am the specific user that is whitelisted to go to this bleeding edge application and adds a header that says that I should be sent to this bleeding edge instead of standard cluster. It's called WIP override rule. Then, my traffic will be routed to this bleeding edge, perfect.

Now, imagine a situation. Someone else comes from laptop, opens laptop. The rule to route to TV UI cluster is not triggered and we go to API global. Since there is not customer that is whitelisted, the traffic is routed to standard AB cluster. This feature is very important because it allows us to decouple routing and sharding and everything and enable everyone at Netflix to manage their own traffic within their data center and at the Edge.

The question is, what can go wrong? Routing is a configuration. Raise your hand if you ever had to deal with incidents when the wrong config was pushed to production. Ok, majority of you - so you can relate. This is exactly what is happening and happened several times. The config is pushed to production; this is config. The route is applied globally, and you can send 100% of the traffic to this AB bleeding edge or somewhere else. The problem is that people usually don't know how much traffic they affect. Instead of closing down this functionality and dedicating an operator who can sit and the whole day configure routes for everyone, we decided to educate people. Instead of applying rules right away, what we do is, we run a job that estimates how much traffic will be affected by this applied rule. Then there is a popup that says, "You're going to send 100% of your traffic to your local machine. Do you want to do that?" You can still do that, freedom and responsibility are in Netflix culture but at least you do this consciously and after that we will talk.

The second bold statement is API provides insights and client-perceived resiliency. How does it do this? API gateway is centrally located, all the traffic goes through this. It's a chokepoint. Whatever concern you apply there, it's applied to all of your traffic, it's a perfect place. What do we do? We report metrics. When we report metrics from one place, they are always consistent. For all the backends, for all the domains, for all the clients, we have consistent metrics. All of the other teams at the company can build their tooling on these metrics. We can build dashboard, alerts, etc. You can do so much with that.

We have a system called Atlas. It's an open source real-time dimensional database and if you don't know about that, if you haven't used it, I recommend you check it out.

Metrics are great. Something is happening, we see metrics spiked, errors spiked. What do we do? We debug. How do we debug? We need to see individual requests. How do we do that? At Netflix there is a system build called Raven. Raven is a UI where you go and you create a filter saying, "Whatever request matches this filter, send it to Mantis." Another term - Mantis is an open source technology. It's a platform to build real-time processing applications. It's very powerful, I recommend that you check it out.

Here, for example, I have an outage. I see that errors for iOS devices spiked. What do I do? I create a filter. I say, "Whatever starts with iOS and if their response code is higher or equal 500, sample at 5% and send me those requests and responses as well." This is the way for me to debug without paying the price for logging everything and indexing everything.

Insights - beautiful. Since we already have integration with Mantis, it's a streaming platform. What can we do? Of course, we can build anomaly detection mechanism because all the traffic goes through the one single place. We have this uniform picture and we can alert and react to this. We stream all the errors to Mantis and there is a job that runs, we call it RAJU. It's a service that calculates acceptable error rate for every single backend and if we cross the threshold for a long period of time, there is an alert that is being sent.

Let's quickly talk about resiliency. When you build API gateway, you have so many features that you can put in, so we decided to put a custom load balancing. We decided to go with random choice of two approach and that helped a lot to mitigate certain issues during deployments of services that didn't go well. Choice of two load balancing is basically you randomly choose two instances and then you decide to which one you want to send traffic based on the criteria that you control. We really wanted to control these criteria, and there is an agreement between API gateway and backend service where backend service can send some health information to us and we can use it if we don't have our own view of this instance.

Another thing that we do is we retry on behalf of clients and this is why I said client-perceived resiliency. We don't necessarily improve resiliency but if we chose the instance that is bad and we send some traffic and it returned 500, we can retry on behalf of the client. It's not always possible but in more cases than not, we do retry on behalf of the clients.

Let's summarize stage two. We wanted to optimize engineering velocity. We started by introducing micro services and we said thank you to API team who helped us hide the complexity but then they started splitting Edge and introducing additional services. To support that, we introduced the additional domains, we introduced INIT service so that our application can go before it starts, fetch some information about what to call and then work and we also introduced the concept of API gateway, particularly Zuul, as one of the implementations. Zuul reduces coupling between clients and services, and it's a leverage point to put all the cross-cutting concerns such as authentication, rate limiting, enrichment of requests, etc.

We have a business, a successful one. We scaled our engineering organization. What's next? Next is resiliency and quality of service.

Resiliency and QoS

Before we talk about resiliency and quality of service, I wanted to put on the screen another bold statement which is, "Most of the incidents are self-inflicted." If our underlying infrastructure promises a lot of nines, it does mean something, but it doesn't mean a lot because the only way to prevent outages is to not do anything. We don't want to not do anything. We do want to be very agile and deploy very often. How do we do that? Let's just look at Netflix example and what Netflix did.

In 2013 and 2014, Netflix invested a lot of effort into supporting multi-region deployment. Not only they started deploying in multiple regions, but they also staged deployments and they made sure that all the regions are active. What does it mean? If one region goes bust, we can simply reroute the client to a different region and they will get service. For that, not only active-active data replication had to happen but there was also a lot of Edge concerns that had to evolve. Let's talk about Edgier concern than it was. We talked about API gateway, now we get to this layer of DNS. Before we only had multiple DNS records, let's take a look at one and see how Netflix did their failover.

api.netflix.com is a canonical name to api.geo.netflix.com. It is resolved by a system called ultra DNS based on the physical location of a resolver. Assume physical location of a client. If you are in United States west, you will be sent to US-West AWS region. North America, east will be sent to east. South America will be sent to east and more or less, the rest of the world, some parts of Asia excluded, would be sent to Europe. That resulted in east being a heavier region because more traffic is going there, but that also allowed us introduction of this virtual force region, allowed us to split traffic more granularly.

Since I mentioned that US East was the biggest region, let's try to evacuate it. Let's see what happens when there is an outage. We decided that it's time to evacuate the region. First of all, we change records to point to different load balancers in different regions. Our ultra DNS still returns VC name that is specific for your region, but the underlying IPs are not the same. Underlying IPs start pointing to the load balancers in the region that you are routed to. US-East, North America would be routed to US-West and South America would be routed to Europe. Simultaneously with that, Zuul, our API gateway will open HTTP connections to regions that are healthy. This is very important because DNS has this property, DNS is being cached and there is this property called TTL on DNS. We set it at 60 seconds. We assume that within 60 seconds clients will come and refresh their DNS and they usually do. DNS TTL is a myth but it's a myth that is widely accepted and many resolvers believe in that, but some do not. That's why you see a little bit of traffic still going to this region which was evacuated. We never saw 100% of the traffic evacuated from a region. We always see a trickle of traffic and we cannot punish customers or our clients for their resolvers. That's why we stay in the state of cross-region proxying for the traffic.

Let's talk about our stage three. We wanted to focus on resiliency and while focusing on resiliency, we also improved latency a little bit because right now we have three regions. We can send clients to the region that is closer to them and potentially they will have a bit better service. Active-active data replication is very complicated but thank you, great engineers who did that, it is working.

Edge concerns had to evolve. We needed to get to a level of DNS or geo-DNS traffic steering. It's not just one record, it's not multiple records. It's a dynamic system right now and it has to be managed carefully. We built tooling around this DNS management. Why? Simply, we introduced so many domain names. Imagine doing this flip manually. We really need a tool. We implemented cross-region traffic proxying between gateways to help customers who cannot rely on their DNS.

Let's recap really quick. We started with three-tier architecture monolith, introduced micro services, split our Edge and worked on our resiliency. What's next? The next concern that bothers us is speed of light.

Speed of Light

Speed of light is finite, and we didn't find how to solve this. Therefore, we need to work around this. Distance affects round-trip time. Between two places on Earth, the latency exists if we want to transfer information. Let's get back to our example. A customer in South America trying to connect AWS US. There is a latency, and let's say the latency is 100 milliseconds round-trip - it's very optimistic, by the way, but 100 milliseconds. What happens when you establish connection to a server? Most of the communication these days at least - let's wait for quick protocol or HTTP3, but these days at least - most of the communication happens over TCP. Most of the communication happens, thankfully, over secure protocol called TLS. In order to send bytes to a server, what I need to do is to establish TCP connection. Then, after I establish TCP connection, I need to do usually TLS handshake. There are tricks to work around this but not all the clients support this.

This is how a TLS handshake and SSL TLS handshake works. If we assume client latency of 100 milliseconds between the client and AWS, we spend 100 milliseconds for the TCP handshake, and we do spend 200 milliseconds for the TLS handshake because client needs to send client hello. Server responds with server hello and certificate and they finally finish the key exchange, so it takes two round-trips to do TLS. Only then we send request. In order for me to send my first request, I need to spend 300 milliseconds, as I said, in the optimistic, very optimistic case.

There are other challenges with our client sitting far away from the data centers. For example, we all use wireless networks and wireless networks are lousy. Whenever I have a connection problem, I usually have a connection problem between me and my ISP. In order to repair a TCP packet loss with my long connection to AWS that is 100 milliseconds, I need to pay quite a bit of time. Usually, it's one round-trip to detect TCP loss and 1.5 round-trips to fix packet loss.

Another problem is, you heard this metaphor of the internet and pipes; pipes get congested. The longer the distance between two points, the higher the chance is that the pipes will be congested. TCP, TLS, lousy connections, congestion. Quite a bit of problems. How do we solve them?

We trick clients. We put an intermediary in between client and our data center. We refer to this as a PoP. PoP is Point of Presence. Think about PoP as a proxy that accepts client's connection, terminates TLS and then over a backbone, another concept that we just introduced, sends the request to the primary data center or AWS region. What is backbone? Think about backbone as private internet connection between your Point of Presence and AWS. It's like your private highway. Everyone is stuck in traffic and you are driving on your private highway.

How does this change the interaction and quality of client's experience if we put a Point of Presence in between them? It's a much more complicated diagram, but you can see that the connection or round-trip time between client and Point of Presence is lower. This means that TLS and TCP handshakes are happening faster. Therefore, in order to send my first bytes of my request, I need to wait only 90 milliseconds. Compare this to 300 milliseconds.

When we send request to Point of Presence, Point of Presence already has pre-established connection with our primary data center. It's already scaled and we're ready to reuse it. It's a good idea to also use protocols that allow you to multiplex requests. On the same TCP connection, we can send multiple requests. One of them would be HTTP2. Not all the clients can speak HTTP2 but here you control a client within Point of Presence, and you control the server on the other end. In our case, it is Zuul. We control the codebase. It does HTTP2 and the Point of Presence does HTTP2. We can take HTTP1 traffic, control it into HTTP2 and improve client's connectivity.

Summarizing point stage four - we wanted to improve client's connectivity by reducing time spent doing TLS and TCP handshakes, congestion avoidance and TCP packet loss. Recovery from packet loss needs to be improved. For that, we introduced the concept of Point of Presence. We introduced a concept of backbone, the private internet connection. It can be built, it can be bought, it can be rented. Some CDN providers these days allow backbone to be rented. AWS has this service called Global Accelerator. The idea is pretty much the same.

I didn't talk too much about the steering of traffic to the PoP but there are alternative ways of steering that you can explore other than DNS. Because we control codebase in the PoP and we control codebase in the data center, it makes it possible for us to introduce protocols that we can not roll out to all the clients. Just to summarize, in roughly 40 minutes we made a journey that some companies can only go through in several years. It's quite a success, in my opinion. If there is one takeaway that I want to summarize with, it would be the statement that a well-designed Edge enables evolution of the business and think wisely when you make choices that affect your Edge.

Questions and Answers

Participant 1: You mentioned that PoPs nowadays can be rented. If that's the case, does TLS still terminate at PoP, and how secure is it to do so?

Vlasov: The question is, you don't control PoP and you put it in ISP location, how do you terminate TLS? In this case, your best bet would be TLS sessions because you would probably offload TLS termination to something that you control. You do this once and then issue TLS session ticket to the client and then you use this ticket. I think that's the best approach. Participant 2: I think I'd heard you mention earlier in your slides how Zuul and ALB were considered interchangeable in a way. In some other slides, I saw ALB being a node located in between the Zuul below it and the api.us-east1.netflix above it. Can you elaborate the difference between the ALB and Zuul?

Vlasov: ALB is still in the picture. ALB is used to terminate TLS because that's not the concern that Zuul wants to have and then route traffic to Zuul instance. In some cases, that's not what's happening. In some cases, because we need HTTP2 support and we want to support ALPN, we terminate TLS on Zuul itself simply because Amazon, as of now, does not support ALPN. We do this on Zuul. Most of the traffic goes through ALB.

Participant 3: You were talking earlier about having a routing configuration and when traffic would come into Zuul, Zuul would look at the routing config to figure out where to send the traffic. Is that only from Zuul to wherever it goes after that or is that all of the internal various layers that that traffic might go through?

Vlasov: The question is whether we can route to the next hop only after Zuul or can we route after that. There are two different use cases. First use case is to steer a lot of traffic from one place to another and that's mainly made on the first hop, but there is also a rule type that we apply at Zuul layer that is called CRR, Custom Request Routing. If request matches criteria, I can override target not only for the next hop but for any call in the chain that wants to call some other service. We have this concept of WIP at Netflix. Basically, at Zuul layer, we will say, "For AB system, for AB application, please override WIP to this new WIP." This rule will be honored down the chain of indication.

 

See more presentations with transcripts

 

Recorded at:

Dec 17, 2019

BT