InfoQ Homepage Presentations Scaling Patterns for Netflix's Edge

Scaling Patterns for Netflix's Edge

View Presentation

Speed:

Download

49:45

Summary

Justin Ryan talks about Netflix’ scalability issues and some of the ways they addressed it. He shares successes they’ve had from unintuitively partitioning computation into multiple services to get better runtime characteristics. He introduces us to useful probabilistic data structures, innovative bi-directional data passing, open-source projects available from Netflix that make this all possible.

Bio

Justin Ryan is Playback Edge Engineering at Netflix. He works on some of the most critical services at Netflix, specifically focusing on user and device authentication. Years of building developer tools has also given him a healthy set of opinions on developer productivity.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Ryan: My talks can be a little less dense on the slides here. We're going to bring it down a little. I guess we're going to talk about laundry. Everyone knows how to do laundry, we all know how it works. We get a big pile of clothes, are all over the place, and you have to start folding them and you have different clothes. I got a family of four, so I have my wife's clothes, her pants. I got kids' socks all over the place and I need a way of folding it. I have a certain way of folding it, and then my spouse has a certain way of folding it. The way she goes about it is she folds, which is great - it's great that she's folding laundry, and I don't have to do it - but what she's going to do is stack it all up in a big pile. Every time you get a piece of laundry and you're folding it, she just puts it into a pile, and then on goes the next one, and you end up with a pile of some kid pants and my shirt, and to put it all over the house, you have to walk over the house. Admittedly, I do appreciate the exercise, can't argue with that, but parents, we're busy - we don't have time for that. I have a slightly different way of doing it. I take over the whole living room - I'm talking the couch, the coffee table, the ottoman. I make separate piles for each kind of clothing - kid's pants, kid number one pants, kid number two pants, something like that. I take over the whole living room, then I can go around the house putting my pants away, my shirts away, my kids' pants away, kids' shirts away, and it goes really fast.

What we're doing here is making a space-time compromise. It sounds like software, but it's not. It's t-shirts and pants and stuff like that. I was able to take the same amount of time folding. It didn't take me crazy long to fold the clothes. I just took them in more space, and then when I was able to put them away, I was able to save time. That's this classic trade-off that we as architects have to think about. What happens when you have 158 million t-shirts?

Last reported by Netflix, we have 158 million memberships with the service. When I started eight years ago, we had less than a million, so at least you can get a sense of this growth and what happens over time and how things aren't the same as they used to be. It is not one monolith server we run in one little data center. It is the full-blown microservice architecture with tons of features, with services calling services, almost like to say ad infinitum because it does feel like that sometimes. That's a lot more complex than a few servers. There's a lot of servers, and there's problems that come up when you have that many servers and that many systems running altogether, as you can imagine.

You might hear me say edge. Hopefully, you read the title of the talk - edge is in there. That means the edge of this graph. For most people, the far left here where it says edge. I'm talking about these four nodes over here, the edge when the data comes into the service. There's a lot of other services going on here. I'm just not generally talking about that. I'm talking about edge. It's where I work, and it's where we saw the most scale.

Scaling Problems

When we talk about that scale, there's problems. That's what I was alluding to, and I also want to get to a little bit. Like, debuggability. When you have 1,000 servers, you have 1,000 logs that go through, and there's no way you're going to go to every single machine and look at the logs anymore. That's just out, that's not an option anymore. Debugging these problems becomes harder because a lot of problems can lurk in 1,000 servers.

Infrastructure - when you can keep some piece of infrastructure in one box and tracks everything, that's great, but as these number of servers is growing, you deal with 10,000 instances, 20,000 instances, it gets more difficult. Things start to fall apart. Then the third point there is managing. We subscribe to DevOps. I'm the developer, I'm operating, and I'm deploying it. I now have to deploy 1,000 instances. Now, there's automation around it, but things just take longer now, and I have to monitor more servers. Things just get harder.

On the topic of scalability, it's the ability to take more traffic without noticeable degradation. I should be able to take more traffic and not have the whole system collapse. It means that I have problems that don't exist on a smaller system, but it also means that I want to give myself some headroom. I don't need to design for 10X scale, I just want 2X. I want to be able to failover to a region. I want some new feature to ship and not have the system fall over.

That's where I want to talk about scalability and where I'm coming to here. I'm trying to give myself a little leeway here as my system grows because, as you saw, the system is going to grow. That's a given. The way we do that, we make some trade-offs. We're architects here, this is what we do. We look for a degree of freedom that we can move around in to make the solution work without either costing lots of money or lots of servers, or something like that. A classic trade-off here would be the CAP theorem. In Netflix, we have a service discovery service called Eureka. It's open-source, feel free to check it out. It leans heavily on the A and the P, available and partition tolerant. Then when it does that, it takes all the instances it's keeping track of, and it distributes it out over to all the other instances, meaning if you get a partition, it still knows about these instances in general. That's a trade-off. It also means that when you have 10,000 instances, and you add one more, you have to go tell the other 10,000 about that one new one, and that gets into the infrastructure side of things.

I'm going to focus on the architectural things that we can do as architects and not on the policy decisions. If you have a policy to keep data for six months and you turn it to three months, that's less data, that will cost you less money, go for it, save us money. That's not what I'm talking about. That's a turning the dial kind of thing, like just turning the dial down. Not super helpful. Not what I'm talking about, I should say. Likewise, if you run less servers, there'll be less problems. Don't run less servers just to run less servers. Don't run them hot. We've learned that this has caused additional latency, and that's not great either. I'm talking about being smart about this complexity as we add it.

The things that I'm going to show, I have some case stories I'm going to go through. Each of them was implemented with one or two engineers over the course of a few weeks, maybe a month. The number of person-months here is pretty low. I would say the total amount of work is not a lot. It's something that I think you guys could walk away with a little bit. I would say that all the technologies we used was generally off-the-shelf. This is not some academic paper you have to implement. You should feel perfectly capable of implementing the things that I'm going to show here.

Melnitz

I'm going to go through five use cases today. I'm going to focus a little more at the beginning, and as I go on, they're going to get a little shorter and faster just so I can get through enough material. First one, how do you tell someone is logged in? How do you tell if they're a member? That's pretty important. You're going to show a very different website if someone is logged in or not or if their membership is active or not. And you have to tell this on every request, this is a stateless system. They're going to come with a request, and another request. You have to know if they're a member or not. You could say, "Every request comes in the system, we just go to the database. Look it up. No big deal." It does become a big deal when you're doing 2 million requests per second. That is the scale that I'm talking about here. There are databases that scale like that. I would like to avoid it. My scalability message is that I want to give myself a lot more leeway. If I don't have to put pressure on the database, let's not.

Let's go back to what we have. We have cookies, that's how most of our devices talk to Netflix or service, that gives each the cookies. These are long-lived cookies. They sit on your box for a year, nine months, but in there is a little expiration date that we put in, and this expiration date so tells us how long we can trust the cookie. If it comes into our system and we crack it open, and we look, we can look for its expiration. That's the dot in the middle that you see up here. If it has not expired, we're just going to trust it. This fits our business use case. At Netflix, if you're familiar with it, you're going to load it up, and maybe the last time you used the service was last night, so it's been eight hours. We're going to need to go make sure you're still a member. After that, we can give you a new cookie that's good for eight hours. For the rest of your TV watching binging for that night, we can trust that cookie, and that fits our use case. If you're a member, at the beginning of the night, you're probably a member at the end of the night. Works great.

Let me just walk through how that works because we're going to turn it up a notch later. What we have here is a layer 7 proxy. This would be equivalent to an ALB, Nginx, something like that. In our case, we're using Zuul, it's another open-source project from Netflix that you can use. It lets us put custom logic in there. If you see me putting code in this box and you can't, or you have to write in Lua or something, that's what's going on here. We're doing this in Zuul. Cookies come in, we crack it open, and we see if it's expired. If it's not expired, let it through. If it is expired, we're going to go to the server account service. Just look up, hit the database, see if they're still a member. It's very straightforward. Either way, we're going to send it down, and once we have this refreshed answer, if they're a member or not, we actually send out a new cookie that's good for eight hours, and that buys us some time.

The thing is that wasn't good enough. A product manager will come back, say, "No, we want to log someone out immediately." There's a lot of different use case of this. You might be on the phone as customer service, saying, "I see some weird activity. I just want everyone logged out. I don't want any activity on this." Maybe you've given your account to your college kid and they're watching some risque stuff, and you just don't want to know about it. "Sign them out. I don't want it. I want it gone right away."

The business requirement now is we have to log them out instantaneously, but we don't want to hit the database. That's what we're trying to save here. We came with a system called Melnitz, and it's focused on that first eight hours. What I have to do now is identify who has had activity on their account in that first eight hours and let the expiration thing take care of itself. That's done, we figured that out. Let's focus on that first eight hours, and your set of customer IDs of who's changed their account in the last eight hours. 150 million members, how many have really hit the sign out of all devices button or called customer service? It's a small number. Here's our set.

The thing is, that isn't that small, and I have to get it out. There's two pieces to this. One, I'm going to get the set, and I'm going to send it out to my layer 7 proxies. Let's talk about the size here. This goes a little bit to the scalability thing. If I just do this at a short cadence, four requests per second, that's still 115,000 entries I need to keep track of. It might be more, it might be less, depending on time of day, stuff like that. That's a lot of data to ship out to every single one of our layer 7 proxies, there's thousands of them. I don't want to ship that much data, and I don't want to do it every minute, which is what we're going to be doing.

We introduced a Bloom filter. If you haven't heard of it, that's ok. Let me talk about why it's ok. Actually, what it gives us is it going to answer this question. Is something possibly in a set, the high degree of probability? Is it a bad interview question what a Bloom filter is? Yes. You shouldn't have to know about it. Does it feel like it's out of an academic paper? Yes, it does, because it's shown like this. Is it available in Wikipedia? Yes. You can read about it, and that's where I want you to go for more details. I'm not going to get too much into it. It's also sitting in a Guava library, so if you're a Java developer, it's sitting in Guava and you can use it. It's in 10 different libraries, and I promise you it's in your language of choice. It's easy to use.

What it's giving us is we can take this set of IDs, X, Y, Z, and we splay it all over this byte array, and this byte array is going to take up a lot less space than those individual IDs. There's going to be some places where they conflict. That's ok, it's going to happen with a certain probability that we can tune, and we turn it way down or way up, whatever works.

Let's talk about how this works in here. Same layer 7 proxy, same expired logic, but now if you're not expired, we're going to send you through this Bloom filter, and the way the Bloom filter works is, it's a set. You say, "Is this thing in a set?" That's the only thing we're trying to answer. It says no or maybe. In the no case, just flow it through, and that is the 99.99% of all requests. For that small 0.01% of the time, we get to maybe answer, we go to the account service, we look it up. This means we're actually hitting the database a little bit more, but not much. This is very reasonable.

I'm going to talk about how I set it up. I'm not going to go into as much detail on the other use cases, but I think this is really relevant because I want to show how accessible this is. We had an account service, that already existed. This doesn't work without an account service. It was already sending events out over some Kafka-like stream this - this already existed for us. We wrote a service that just listened to events and then saved in the database. If things go right, this is like five lines of code. This is pretty simple, straight forward. We then read from this database every one minute. Say, "Give me all the entries for the last eight hours," we put that into the Bloom filter, and we ship it out. We ship it out using a PubSub system. We have one internally, it's not open-sourced, but there are other ones available. Apache has one that came out of Yahoo, which is very similar to the one that we have. Redis has a PubSub, Google offers one, PubSub is generally available. We just throw it out there, here's the new list of all the customer IDs in the last eight hours who might have changed their account, and now it's on our layer 7 proxy.

We now have this near real-time data set we can look at to log people out. We haven't increased the amount of calls to the database, so that was one of our goals here. We're pretty happy with it, but we did have to make some trade-offs. One, there's now a probabilistic data structure in there that I trust you, every new employee that joins the team is going to have to go to Wikipedia and read up on it. It is a library, you do get to call, but you should understand what's going on. We traded a little more code complexity to get this neat space savings.

Also, off-the-shelf components. We could have sent those events from the OC service, the one that's doing this account stuff straight to the layer 7 servers, those proxies. That would work, that would be like instantaneous. There's a lot of problems that could happen there, and we can handcraft it, but we can also just use off-the-shelf components. I use off-the-shelf Kafka, and PubSub and I wrote a service that's like five lines of code or two services that's five lines of code. That was worth it. I was much more willing to have two simple services than try and take my account service, which is super important, with my layer 7 proxy, which is super important, and put a whole bunch of complicated code in there. That's the trade-off I made here. That was the first use case. I'm going to go through four more.

Mantis

We're going to talk about the needle in the haystack kind of problem. I talked about how, when you have 1,000 servers, and you're trying to find a log in there, how do you find it? How do you find this problem? We need to solve this all the time. This is just a classic debugging kind of thing. You've got to find an error message. If I do this at the pre-stated 2 million requests per second, and let's just say every request for some metadata is about three kilobytes - I showed you that graph, they're going to call other microservices, and we want to record all those. You multiply that out, that's 4.6 petabytes a day. You could do that, you can put that in Elasticsearch. I trust you there's a large company out there who is willing to take your check and store all this. I'd like to avoid that.

Let me just start with the basic architecture - the pitchers. You have a service, it's publishing all these events, all these logs, and let's say we put into Elasticsearch, you can search for it. For those who can't see in the back, there's a thick red line between the service and Elasticsearch and a thin line to let's say Kibana or your UI over it, but you have to load all this stuff into Elasticsearch, and I don't want to put that 5.6 petabytes in there.

In comes Mantis, a recently open-sourced tool from Netflix that we've been using internally for a while, and I absolutely love it, and I think you will too afterwards. It is event stream processing. It is not specific to logs. It's there for events, and it's going to give us two big constructs. It is going to put an agent in your service. This is a local agent that you run that you give it the event and it is responsible for getting that event out of your box, so think of it as for every request that comes in, we create an event, that would be our equivalent of a log, and it's going to send it to a Mantis source job. It has a cluster that's dedicated to receiving these events and sending it on to somewhere else. I've drawn it with really thick lines because I haven't told you anything else about why this makes it special.

The thing that makes it special is that Mantis has a query language. MQL - Mantas query language, pretty obvious. You can write this query, you can say, "Show me all the people in Brazil watching Episode 3 of "Stranger Things" on an iPad. Super specific, that's a select statement right there with some where clauses. That will get loaded into the source job, and it's only going to allow things that match that filter to go out. Now we're talking about what's going to Elasticsearch. It's a lot smaller. For the people in the back, there's a thin red line now going to Elasticsearch. There was a trickle of data. Only those people in Brazil on an iPad watching Episode 3 of "Stranger Things."

What gets a little bit better at Mantis is it takes that MQL and pumps it down to all the agents. You might have 1,000 servers, 10,000 servers, or 5. That query is going to go down to the agent. He's going to run the pre-filter and only send off box those messages, and that is a trickle of events. I can get extremely specific tracking. I can see these requests come in live. They have a browser plugin where you can watch this stream. I'm getting it live, no huge overhead. I'm not storing tons of data here. This is what makes Mantis a big win for us.

The thing that it does is avoid work. We're not shipping all this data off box. We had to put an agent in our box, but we avoid a lot of network traffic going on here. I'd say those are the two pieces here. I'll start with scales from zero source solution here. I could have this thing on all night, it's only going to match those iPads in Brazil, that's awesome. I could leave it on all month, no big deal. I could go home at night-time and turn everything off, and there's no additional impact to the system. The trade-off here is, I don't have long-term historical data here, but I do get a really easy way of starting up a new query. I don't care about the cost because it's going to be so optimally efficient to what I am looking for. That's the trade-off.

The other trade-off is this Mantis query language. I had to put an agent in my box, there's something in my runtime now. They control it, if I want to get more complicated and it's not part of the query language, I can't do it, but this is something where I'm willing to hand it off to them. I trust this team to put code in my box. I trust this team to enhance this query language over time. I could have put custom code in my box, but I couldn't put that custom code in all the boxes, and so I'm willing to make the trade-off.

Passport

That was two case stories, let's talk about the third one. This one is a little specific to when you have extremely common data for your requests. Let me explain that a little bit. Membership, that's a common one, or user preference. Let's say they are a premium member, and you will always give them a little more data on every request. Every request comes in your box, and based on the membership, maybe you do something different. Something for Netflix in equivalent might be membership levels. There's a two-stream and a four-stream. Knowing if it's a one-stream or four-stream helps us change our behavior a little bit. This is important data for us, but you might have more important data for a user preference or something like that. Super common data.

You can imagine how you load it. A request comes to the system, gets loaded into API, he requests the service plan from the database. Call goes to mid-tier, he requests it. Call goes to mid-tier B, he requests it. Goes to mid-tier C, she requests it. We got a bunch of requests coming out of the database. One request in, four requests to the database. This is that kind of scalability thing I'm trying to avoid a little bit. Let me layer in one thing. We have this thing called passports, we introduced this for identity. It tells you who this call is made on behalf of, and we do this right at this first layer 7 proxy. We figure out who is this customer, and what is their customer ID, that kind of stuff, and we're passing it through our system. We're going to make a change here. We're going to actually take this very common data, and we're going to load it up right at the beginning. We're going to shove it into this passport, and have it passed down. Now, one request comes in and one request to the database, so we've actually gotten a much more linear scaling. This is important to us. Everybody along the way doesn't have to make a database request, so now it's a little bit faster. He doesn't have to request it. He's getting handed this data structure now.

You should all be thinking from the keynote, "But what if you write to it?" You want some consistency here. This is the non-linearizability that he was talking about. If mid-tier B along the way changes the plan, you're in the process of you change the plan, and now you want to render something for them. You want a consistent experience across them. If mid-tier B here - I've shown in maroon on the right-hand side here - is writing to the database and then a request goes to mid-tier C, he's still going to see the blue dots, the blue membership. That could be our inconsistent experience. If these are eventual consistent databases and we want to do better. We don't want quorum on this database, we don't want to go through all the troubles you saw in the keynote to try and give a consistent experience.

It turns out the passport has this contract where if you are changing the passport, you have to send it back up. What that means is when mid-tier B is making a change, he's going to create a new passport with this new data, send it back up to A. A is now responsible for using that data when it calls mid-tier C. Now, mid-tier C is getting a consistent experience irrelevant of the state of the database. He might get back to you next week. It might be made out of pigeons, who knows? It's going to come later. What's not shown in this slide is that the passport is actually passed up to API and up to the LSM proxy, and he can do something. He could update your cookies with this new data or some aspect of this data. It doesn't have to be the full data. We can make a reference to an immutable data ref object. We don't need the full object going around, but the key is we're passing this data up and down to this stream.

We made this small change where we're passing around some more heavy data, and we did that to put something in the layer 7 proxy, but now we get this great experience where we can get some consistent experience. This is near the land of caching, I get that. If you need to cache things, you should cache it. This is only applicable if you're going to, at the first entry point, be willing to look it up. Use caching if you need to. This is a little bit of a unique use case.

We'd got it via two trade-offs. One is data passing. We had to move past "I am an isolated mid-tier microservice," and "I will call this other microservice, and we will not share any state." I'm saying, let's switch that up, let's put a little extra data on the wire. Yes, it's nice to keep your request small, but if I grow my request size by one or two K, it's not going to take down the network. At least I haven't taken down the network yet.

The other thing here is a heavier data structure. I'm not just passing along a 64-bit int through my system, I have made a heavier data structure. All the people in the system now have to understand this passport. They need to know how to read it, some of them need to know how to change it. That's a trade-off. Now, I need to coordinate with a lot more teams, but I'm willing to make that trade-off to get this consistent database request. That's three. Let's go on to the fourth one.

Device Types

Netflix, you can watch on many different types of devices. There are over 2,000 different devices, you can watch on. We have learned that it is useful to categorize these devices, and not just individual devices, my iPad versus your iPad, but, all iPads, and all old iPads and new iPads. We want to group these things together. It looks like this. The numbers aren't important, all you need to know, there's some gobbledygook on the left, we have to convert it to some useful number on the right. That's what this service is doing for us.

Let me make it really clear why it's important to us. This is an error metric, a real one taken from yesterday. Something is going on, I don't know what device is doing it, and to my point earlier, when something goes bad, it goes bad for all types of devices. If it's for Android devices, which it turned out to be, you could see real quickly a bulk of this is Android. I could say, "Group these things together." This grouping is really important for telemetry. We need this to debug issues in production, so we're going to consider critical data that we need to have available to us. It's also considered ridiculously cacheable. This doesn't change very much. Keep that in mind.

When I joined the team, every mid-tier service would call this DTS service and say, "Do you have anything new?" When he'd reply, he'd give you 100 megabytes of XML. Somewhere in there, they felt like, "Let's put a backup system in case DTS gets falls over," which it could do because everybody is asking for 100 megabytes of XML. They put it in a PubSub system as a backup and say, "If there's a change, we'll just push it out over the PubSub system." It didn't take long to realize the backup was better than the real thing, so we just threw out the real thing, used the backup. What it did is let us take DTS out of the critical path. If DTS isn't around, no big deal. We have this PubSub system that is managed by another team, it's rock solid. It is infrastructure from our point of view, and it pushes out what was 100 megabytes of XML. We did better, we moved it to proto, and we compressed it. It got better, but the PubSub was the big thing.

Even there, we started to look a little bit deeper and realized, "What is this data? What's the real business requirement? How fast do I really need to get these updates out?" It didn't take long, you started asking them, "When does new data flow in?" "It flows in all the time, and we should see results." Ok, what is this data? This data is, someone in China wants to create a new TV, they tell Netflix about it. They want to go through certification, so in March they tell us about a new kind of TV, they start the certification process. They eventually build the TVs, they certify the Netflix app on it. It shows up at Best Buy, and then you buy it. That's March to December; that's nine-month lag before we see this device in production. This data can be nine months stale, and no one's really going to know about it.

That took a little bit of the five why's. We're, "Why do you need this data?" It turns out they didn't need it for nine months. We started to actually embed it in the jar. We had a client library, we'd ship it out, and we'd just put old data, whatever the data of the day was in there, and we had to compress it small enough at this point that it really wasn't a big overhead. Now actually, the data lives in the jar. You could still use PubSub, and you can pump out updates, put if PubSub goes down, this highly critical data is available to everybody. It might be old, it might be nine months old, but that's fine.

Along the way, we tried a whole bunch of other things like, "We'll send incremental updates." We were getting really fancy, but it was getting really complex. I'd much rather simple over easy in this case. We just ditched all those plans and just put it in the jar, and we're done.

There are two trade-offs we're talking about here. One is business tuned fallbacks. This is like really going back to the business and asking, "Exactly what do you need?" If we're going to be architects and we're looking for that freedom of moving things around, we really do need to know what they're really asking for because if they ask for something real-time - if you go back to the log all the users out example earlier, they were real-time. It didn't take long to say, "Real-time is within 10 minutes." It's a very different definition of real-time. Likewise, leveraging existing infrastructure. One was the PubSub piece, which is still important. We do like to get updates out if you are testing internally or something like that, but I'd much rather use that existing infrastructure whenever possible to have my own service where I can tune it, change it immediately. I will use existing infrastructure, that is the trade-off we made, and we were very happy with not having to get paged for a data service.

Sharding

That is four use cases. Let's talk a little bit about sharding. There are different definitions of this. I'm going to give one out there, and this is runtime refactoring. If you can imagine refactoring your code, you might hopefully not change your tests. You change your code, tests stay the same. You've just moved things around. Functionally, it is the same thing. I'm talking about this at a runtime level. I'm going to take my runtime system, and I'm going to move things around, but it still has the same functionality. That's what I mean by sharding it out, so taking chunks of code and moving it around. In this context, I'm going to make reference to a message security layer. That is literally the name, you can find it on GitHub, it’s another open-source project from Netflix. It's shortened to MSL. It is a secure message sharing framework. It has encryption and authenticity and replay protection and stuff like that. It's like TLS, that's probably the best equivalent I can give you. In our case, it has some additional business things that it gives us that we want from our business, but think of it like a TLS. Unlike TLS, you can't just go to Amazon and say, "Give me a load balancer that supports MSL." You can't terminate at the load balancer, so we're going to terminate at API. Same architecture around the edge. You get a layer 7 proxy sending traffic to the first level.

This API service is doing a lot of things. It's going to terminate the security layer. It's going to run some groovy scripts because that's just how it works. It's going to have clients for everything else in the system. You saw that diagram. It spreads out really fast. This API service is going to have client libraries for everybody out there. It's a beast of a machine, it's doing lots of things, and one of the things was MSL; we had our code, and we had to get it in here, and we wanted to get it out. That's what we're going to do. We refactored it, sharded it, and took that logic out and put it into its own service. You can't entirely put it off to its own service; in our case, we have to put a small little piece in the L7 proxy. It is a security layer, but it's also an encryption. We just need to decrypt, encrypt it. We keep those simple operations up at the layer 7 proxy right there.

Of the functionality of MSL we're talking about here, there's the encrypt-decrypt, we're going to leave that at the L7 proxy, and then the more complicated pieces we got to do key exchange, we have to authenticate devices, that we're going to move off to its own service. Does all the same stuff, we've just moved things around.

We were doing this for our own operational benefits, we didn't think much would change, but then people were, "You must have messed up. Look what happened to our CPU on the way down. People must not be calling us." It turns out it just got cheaper to run API, it took less CPU to run API. We anticipated this, but we did not expect a 31% drop, "We'll take it, these are free." Right off the bat, you guys, that should to be enough. Presentation over; you should refactor your code. Then we saw it in latency, "Ok, that's interesting." It's always better to get better latency, but there isn't huge overhead to the encrypt-decrypt, or if it was, it doesn't account for this much of a change. A 30% decrease in average and a 29% in P99. If you can get a P99 reduction, that's awesome. We were really happy about that.

When you look at latency, it doesn't take long to start looking at garbage collection, and that's what we looked at, and we found this drop. The only reason you see a delay here is that we were just doing a slow rollout of the future. The top, I know there's no axes here, is a nontrivial number, like hundreds of milliseconds if not seconds down to what we consider zero. It got a lot better, the amount of time we spent in garbage collection, which is what gave us the latency, which gave us the less cost.

What we couldn't see when MSL and groovy scripts and all these things were all in one API service was, it had some really bad GC characteristics, and it was to the point where the JVM couldn't really deal with all these things going on. When we separated out, the JVM could say, "That's what you meant to do. I can deal with that, I can optimize around that." We actually are now running one MSL service and one API service where we were previously running three, because the JVM, in our case, RVM, can be more efficient now. It knows what your code is doing, it can do analysis on it. We can do analysis on it. I can now go to this MSL service, profile it, and see what it's doing. When it was over an API, too much was going on. I couldn't look at a flame graph and reason about it. By moving it out, I can now get simpler operations out of it, and that's what makes it a big win I would say - you now can reason about the operations piece of this when you have this monolith on why you should do it.

The big overhead, the trade-off is, I'm going to take a lot more operational overhead. I have to write a whole new service that is critical to running of Netflix, but my other critical service is now going to become more reliant, and that's a pretty good trade-off. We can also now separate. The person running API doesn't have to be the team that runs MSL. That was worth the trade-off in this case.

We went over five use cases that we did at Netflix, and I get that our scale isn't the biggest scale ever, but it's also not trivial. These aren't necessarily problems that you will see all the time. It won't affect everybody, I get that. I'm just hoping to give you some insight how we think about it. It goes back to that scalability, I want to make sure I give myself some leeway. If the idea is to just hit the database more in a linear fashion, I'm going to think twice about it and try and do something sublinear; something that grows at a more reasonable rate so that as our customers keep coming in and we get more memberships, we don't have to keep rearchitecting our entire system.

Bonus trade-off: do laundry. The trade-off is you have to do laundry, but the benefit here is your family loves you for it, so definitely do laundry. You can reach me at this email or on Twitter. I know I did go fast over some of this stuff, it is a lot. The lesson here is there are tools available for you.

Questions and Answers

Participant 1: For the Mantis system, you guys are aggressively assembling and doing queries and all this stuff. My question to you was, with this system, you guys are effectively dropping a lot of data that can help you form forensic analysis. How do you approach that?

Ryan: The general question is that we are filtering a lot of data, which means we're throwing away a lot of data. How do we do further analysis on this useful data? We still can. I don't have to be the only one writing a query. Anyone can write a query. There are still systems that flow off that you get recorded, but then you can change it. I mentioned that it's a query language, it has a select statement, and you can say "select this field and this field," and get very small bits of data, and the team that's interested in the small bits of data can just do 100% of all the data. I could say, select star on the small bit of data. We actually both can play in the same box and run parallel requests and still get what we're looking for. Yes, if you truly have to offload the box, maybe the agent in that diagram does still throw the data off box, but it still goes to this Mantis. That can then filter it and send it to the right places. If you can't get the benefits at the first layer of the agent, you can get it at the second layer, and that's fine.

Participant 1: What about in the eventuality that you don't necessarily know what you're looking at or looking for?

Ryan: The question is about the eventuality of what if I needed this data in the past? That's the trade-off that we're making. We have to say, "That said, we've been doing analytics for a while. We know the things that we do want." Usually, what I find is that some new service, some new feature I turned on, and it's new stuff, and that's where we're focusing most of our time, but there is some older data, historical data that's used for analytics for sure.

Participant 2: My question is the sharding part. You splitted the codes to another service, then improve the performance? I don't understand how the performance is hugely improved.

Ryan: Same amount of work is happening, so you would think this would just be literally one plus one equals two. In our case, we run in a virtual machine, and our code gets jittted over time, and so one of the things that actually happened to us was that we had memory being allocated, and in the scope of a request, the JVM couldn't tell if this data was in the eden space and was new data or it was long-lived space, because what was going on in the span of a request was so much, and when it was on two different machines running in a smaller request scope, the JVM and the jitting in the memory analysis could go in and say, "You actually don't use this memory very long. I can throw it out, garbage collection it for free, and this guy could do it too." Both of them are making much smarter garbage collection insights, and that's where a lot of the performance came from there.

It also opened up a Pandora's box. Now that we had a dedicated service, now we could start performance tuning it and getting the real kind of benefits of, tweaking your code and getting really bad hot loops and stuff like that. Some stuff for free because of garbage collection, but also a plethora of new ability to do performance tuning.

Participant 3: My question is about your passport payload. Is it something that you would put on the header? What is the mechanism behind how you would actually store that data so it would be transitive across requests? Then secondary follow-up is, when the service needed to update the passport, what was the interface mechanism to get it back? Was it in the response, or was it in some other way?

Ryan: They're two very related questions. How do we pass this passport through? A vast majority of requests are HTTP. You can send headers, so it is considered metadata as far as your request. We tried to leave the request untouched, unscathed, and it was sent as HTTP headers. In GRPC, which we also use, there is metadata fields we can use, but we also at some point said, "This is so useful. Let's just make it the request." If your service wants a passport, it wants to know who you're acting on behalf of, take that as an argument. As much as we did use it as a sideband channel, we moved towards, "Let's just move it into the request if you could." If you're writing a new service, we would ask you to put it directly in the request because if you're going to try to reproduce that request, it'd be nice to just say, "Here's the payload. Here's the passport," on the way up, it is HTTP headers on the way back also or put it into the request on the way back up.

Participant 4: I'm curious about your thoughts on performance of a serving recommendation engine-based results, specifically with two scenarios. One where they are sorted, filtered, and largely around items whose state, particularly around availability, changes very rapidly. In the scenario where an item is available for maybe two minutes, and now it goes away, or it goes into a different state, that means it needs to be way further down because it's just a different class of thing, and then it comes back a minute later.

Ryan: I don't think I follow the part about the recommendations and data changing versus becoming unavailable.

Participant 4: Specifically, recommendation around the fact that if you have all this stuff in Elasticsearch, you could be updating it and you can do your search, and that's really easy, but when it's a recommendation engine type of scenario, the sort is highly influenced by the user-item relationship.

Ryan: I'm a little bit familiar with the whole recommendation engine, it is different. It has different trade-offs for sure, and it is not very cacheable. Every customer has a different set of recommendations that we want to do and stuff. In our case, nothing I was showing is really a customer-specific. In that case, they do lean a lot more towards Memcached D-like solutions, pushing updates to Memcache D. I think it gets a little more into what the keynote was talking about and how if you want that different characteristic, you choose one of those systems.

Participant 5: For passport, you mentioned that you implemented some contract, and I can imagine that contract involves being able to deal with the same message twice, essentially message replay because you're basically taking a request that you've mutated and then shoving it back through all the previous stages.

Ryan: It does not go through previous stages, it's for subsequent requests. In the model and that pitcher at mid-tier A is going to call B, then C, then D, let's say. If something happened in B, then he called C. They're not happening in parallel, it's a serial form.

Participant 5: Right, but the preceding stages have to deal with the same request essentially happening multiple times, so if there's any side effects - how difficult was it to get to buy into the contract essentially? How much fall out was there from implementing this new contractual thing that's not really obvious?

Ryan: There are two pieces. There's the carrot and the stick. A big piece of it, if you are going to change data and you want other people to see it, you have to do this contract. If you want to be a service changing the data and it's not being reflected in other places, it's not getting saved with that cookie, if you're not getting it back up to that layer 7 proxy, your data doesn't go out, so does your service even exist? There was a reason why they'd want to participate for sure, and very few services are changing it and needing the changes reflected. We did, had to talk to very few teams on that front. The other contract was just you receive the passport and then you send it back down. This is classic just tracing, going through like that. That was very easy, we already had tracing mechanisms that we could use, and for the flowing backup part of it, it was a, if you want your data to be updated, you better send the password back up, so it melded really nicely for us.

Participant 6: My question is around the device type. You mentioned that when you have a subscription, after being used for nine months, you create a pops up model and you create a jar or something you mentioned. Do you store that pops up model and communicate back when the device come online, or in production? The entire cycle, the performance, or the management of getting into something in production from the day noticeable - how do you do that?

Ryan: Let me see if I get the question right. How do we manage getting that data pushed out to production? It flows two ways.

Participant 6: You need to store somewhere, and you need to communicate back?

Ryan: We have a source of truth, a database that can store the truth of it. There are 2,000 devices, but truly, that's not a big database. That is just a source of truth that we live, that we look at, and it's a matter of, the system makes a change to the database also knows how to push a change to us so we could push it out over the PubSub. That coordination wasn't too bad. They know they made a change, so they made sure to tell us. It does give us this dual world where the source of truth is over here, which is not production data, and we're pushing data out to production. We really want to keep those two worlds different really. The team that's optimized for making those sorts of true changes isn't the team that knows how to really run things in production, right, or doesn't have that training or that experience. I don't think I'm answering your question per se, though.

Participant 6: Do you have an orchestrator?

Ryan: Sure. They call us, I don't call them. When they make that change, and they've made a change to the UI, they tell us that there's a change. That was our decoupling. We wanted nice clean decoupling between the team that was not doing production changes to us, which is the team that was doing production changes. They would just call our service and say, "There's new data." There wasn't a lot of coordination in that front.

Participant 7: I was really impressed by the scaling from 1 million to 150 million. This is less about engineering, more about ideation - I'm sure at 1 million you guys had thoughts about how to architect this thing that you wouldn't have imagined doing at 158 million. I'm curious, from an ideation perspective, over the years your team drew inspiration about practices that you would need to learn from because obviously, your team at 1 million has a limited knowledge of what the scale needs to be and then over the years, you need to either internally be really creative or look outside for solutions. In terms of the eight years, how did you guys get inspiration for practices in ideation?

Ryan: I wasn't working on the product the whole time, but I can tell you that it comes from a few different veins. One is the willingness to rewrite things. If I'm telling you that scalability is giving yourself some wiggle room and you can grow 2X, it means that you have to be willing to rewrite it when you get to 2X and prepare for 4X. There is a willingness to rewrite. I think when people would propose, "We need to rewrite it," we were pretty open to the idea. There has to be a willingness to rewrite, there's value in that. The other half of it, I would say, is that teams would partition themselves and take ownership of it. If my team is a small team and we're seeing growing pains, that's how we know that it's time to re-architect within our smaller domain. We don't have an architect saying, "We think there's general re-architecture needed." We did it in a smaller scale, and so that really let teams really know when and how they should start thinking about the scalability issue, because they have the responsibility. If their service goes down, that's on them. They knew that pain.

See more presentations with transcripts

Recorded at:

Dec 16, 2019

Justin Ryan

InfoQ Software Architects' Newsletter

Scaling Patterns for Netflix's Edge

Summary

Bio

About the conference

Transcript

Scaling Problems

Melnitz

Mantis

Passport

Device Types

Sharding

Questions and Answers

Related Sponsors

This content is in the QCon Software Development Conference topic

Related Topics:

Related Editorial

Popular across InfoQ