BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Authorization at Netflix Scale

Authorization at Netflix Scale

Bookmarks
39:51

Summary

Travis Nelson discusses Netflix’s approach to scaling and shares techniques for distributed caching and isolating failure domains.

Bio

Travis Nelson is an engineer in the AIM (Access and Identity Management) team at Netflix. He’s been there four years, having done a tour of Silicon Valley companies.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Nelson: We're here to talk about authorization at Netflix scale. My name is Travis Nelson. I work as a senior software engineer at Netflix. I work for a team called AIM, which deals with the identity, authorization, and authentication of our users. We do that for the website and for streaming applications, and also for mobile games. It's the consumer side of the house and not the corporate identity side of the house.

Here are some questions for you to think about while we're going through this presentation. The first is, how do you authenticate and authorize 3 million requests per second? It's a lot. We have to deal with that all the time. How do you bring design into evolution? Our system, as many systems do, evolved over time. When you have things that evolve, it can be hard to bring a good design into it that wasn't thought of in the beginning. Our particular problem is, how do we bring in a more centralized way of thinking about authorization into our evolving product? How do we bring in complex signals, things like fraud and device signals that might be potentially ambiguous? How do we think about that? Finally, how do we build flexibility? The last thing we want to do is knock our system down to the point where it's hard to make changes. We actually want to make changes easier to implement, make the authorization system an enablement mechanism for product changes.

Outline

Here's what we're going to talk about. First of all, where we were. The state of our system before we started this effort. What are our authorization needs in the consumer space? We had to think about what are the requirements? What do we actually need to do? Talk about centralizing authorization, what that means, and how to go about it. At least how we went about it. Then, how we think about things like scaling and fault tolerance, because we have such a large amount of traffic coming in all the time.

Authorization in the Netflix Streaming Product Circa 2018

Here's where we were circa 2018. We had this authorization built in the system as the system evolved. The client will basically pass an access token into our edge router, edge router, which transformed that into what we call a Passport, which is our internal token agnostic identity, which contains all the relevant user and device information. That would go to any number of microservices. These microservices, each would potentially have their own authorization rules based on that identity. It might also call out to our identity provider to ask the current status of that user. That might mean that several different of the microservices are also calling into that same identity provider. The trouble with this, of course, is that the authorization rules, they might be different for microservice, or they might be the same, might end up with several different microservices, all having the same authorization rule. That might be replicated potentially hundreds of times in a very large system. That's where we were.

Pros and Cons of this Approach

There are some pros and cons of doing that. First of all the pros. It's pretty scalable. If you do that, and you put authorization rules in every one of your microservices, and these microservices are all independently scalable, you have a pretty easy way to scale your system. You just scale up all the services at once. You never really run into roadblocks with the authorization being a limit to scaling. You also have good failure domain isolation. What I mean by that is, if somebody in one of those microservices messes up one of those authorization rules, and then pushes their service live, it only affects that particular service and the other services around that are resilient to that failure. It might not be noticed or it might cause a small partial outage somewhere, but nothing catastrophic. That's actually a good thing. We don't want to have catastrophic failures that impact the entire system all at once.

There are some cons too. One is that the policies can be really too simple. One of the most common policies that we found was that if your membership is current, then you get the stuff. We have that check everywhere throughout the code. It's a good thing often to check that, but that's not really the question that people are asking. The question really should be more like, can I access this particular thing? Is this particular feature on the site or in the mobile app accessible by this member, or this user, or this device? The questions are far more nuanced than they're actually expressed in code. Then, when you add the complexity, the complexity gets distributed. If you want to change something like that, if you want to say that, if they're current, and they're not in this country, or if they're current in an anonymous device, then that ends up getting distributed among all those different services that also contain that rule. Of course, if it's distributed, then it's also difficult to change. Tens of services might need to do the same change, instead of potentially a single point of change. That means that you're going to do a lot more change, you're going to involve a lot more teams, and your timeline is going to increase because of that.

Exceptional access in that scenario is also quite difficult. Because if you have some special piece of content, or some special feature that you're trying to enable just for a certain set of users, that can be hard to do. Because if your access controls are all that simplistic, then it's not being replicated everywhere else. If you have something that violates that constraint, and you still want to give them a certain level of access, then you can't do that. If you want to enable a certain privilege, and do that privilege check at various points in the system, you'd also have to distribute it. It becomes untenable at a certain point, if your rules become more complex. You can't really do that thing anymore. Also, fraud awareness ends up being localized in that model. Because when you have complex fraud detection systems you want to make sure that you can put them at the various points where the checks for fraud are useful. If you only do that at certain points, you can easily miss other points. By putting that more in a central location and knowing when your fraud checks are being executed, then you can have a much clearer picture and have much better coverage of that signal.

Some Netflix Authz Concerns

We also had some specific Netflix Authz concerns as well. We wanted to take a step back and look at, what do we actually need for our consumer presence or consumer facing experience? One is we needed a way to model the things that a user can and can't do. Maybe some users on some devices can access some functions like a linking function to your TV, and others cannot due to various reasons. Maybe their device is insufficient. Maybe their plan is insufficient. There's all kinds of different reasons. We also wanted to be able to get the same answer everywhere. We wanted all the microservices to have a common point of reference to be able to get that same answer. If everyone's looking at the same feature, they should be able to get the same answer for that. We also wanted to look at how we think about the corporate versus the consumer identity versus the partner identity. There's different places that they intersect, and we needed to make sure that we accounted for that. We also have things like API access. We have special access for special users, where we give them some amount of privileged access that normal users don't. We also needed to accommodate things like device, and presence, and fraud signals. We wanted to be able to basically have a way to write policies around all of this, around the user, their status, their plan, their device, their presence, fraud signals, all of it.

What are we authorizing? We're authorizing things like the product features, the different things that you can do via the website or via a mobile app or a TV app, things like the profile gate, whether you have access to mobile games or not. Also, we want to be able to authorize the videos that you have access to play back or download. Those were also part of the equation. If you're authorizing that, then it also becomes a discovery issue, because we want to have policies around discovery because what the user can see in terms of what they're going to be able to interact with on their UI is directly related to what they're authorized to use. For example, if you have a title that you have early access to that will appear in your list of movies, whereas if you don't have access to it, it won't appear. That's the thing that we need to be able to express in these policies. We want to have discovery, and the playback rules to be coherent.

Approaches to Authz in the Corporate Applications Space

There are some different approaches to authorization that we looked at, and there's obviously some pros and cons to each and some applicability here and there. The first one is role based access control. This is where you assign each user a role, and you give the role permissions. Then the client basically asks, does this user have this permission or not? Which is a fine model. It actually is really nice. We look at this extensively. We really realized that we wanted to be able to accommodate a much broader variety of signals, and create different policies that were based on many more things other than just the user and some role assignments. We ended up with something beyond that.

Next point I talk about here is attribute or policy based access control. This is where you look at all the different attributes, and then you assign permissions to the sets of attributes through policies. That's pretty close to what we're doing. We do have policies that look at all of those different signals. There's another kind of access control which is based on that, which is called Risk Adaptive Access Control, or RAdAC. We're moving in that direction, where we include the fraud and risk signals into that as well. It's not only attribute based, but it's also based on risk scoring, and that sort of thing. We want to make sure that we are able to create different decisions based on various kinds of attributes of the user, including things like risk based signals.

Role Based Access Control (RBAC)

RBAC is simple. We could have started with that, but the complexity we realized was a little bit beyond that. Our system looks a little bit more like this one. We have clients coming in to our Authz provider. Then we have different identity providers that we need to work with. We have our customer identity provider, sometimes those are linked to a corporate identity where the user might have additional features enabled. Then the customer might also have billing attached, so we need to sometimes look at to see if they're current on their billing or not. We have all these different dimensions of the user that we need to think about in terms of just the user. We also have a fraud detection system that we call into, to determine if the user has engaged in suspicious behavior before, and that provides a score, we can write policies based on that. We have also customer presence. We have ways of knowing which country a user is in right now, whether they're on a VPN potentially or not. We have all of that coming in. There's also device identity signals. Many of our devices are fully authenticated cryptographically, some are not. We have different trusts for each of them. We know at least what the device is purporting to be and whether it's proved to be that. We can use that also in our policies. That's the lay of the land. We have a lot of different signals coming in. The question is really, how do you make that coherent? How do you scale it?

Enter PACS

What we did was we decided to create a central authorization service and we called it PACS, our Product Access Service. We actually have several different types of clients for it. We have the UIs. We have different kinds of offline/online tasks that call into it, and they're actually asking different questions. The UIs are asking things like, what are the features that are enabled for this user in terms of overall features? Can they do things like look at the profile gate? Can they browse the [inaudible 00:13:37]? Can they link their mobile and TV device? We also have offline tasks. One of those is something like building [inaudible 00:13:48] offline. We have the ability to classify our content in terms of groups, and then the groups are basically entered into the calculation. Then that gives you the kinds of content the user can see. That goes into the calculation to build that.

Messaging can also ask the questions, can this user see this title or not? That might tell them whether or not they should match that title. Or potentially, we also have some fraud type messages that could go through that path as well. Then playback will ask any question, can this user play back or download especially a title? We have different kinds of APIs that actually represent that, but it doesn't really matter. In the end, it's different use cases and different things that we're processing. We do that through a coherent set of policies, and the policies can be referenced between each other inside of PACS. Then PACS goes and can reach out to visit the user state from our identity provider, can also look at request attributes, things like the Passport. Then we look at various other systems like the presence, signals, and fraud signals as well to come up with our determination of whether or not we allow that access.

Failure Domain Isolation

How do we get back the failure domain isolation? Having the authorization rules in each service was a good thing in some ways, because then if somebody screws up one of the rules and pushes the code, then it's not going to bring the whole system down. How do we achieve that same thing with a centralized authorization service? What we ended up doing, in Netflix we have different regions where we deploy basically the full stack of all our services. Here, they're listed as region A, B, and C. Then each region will have a full stack of services. We've done this further. Within each region we've sharded the clusters by use case. We have one for online discovery in region A, that'd be one cluster. Another one for playback in region B, and so on. I haven't listed them all here, but there's a number of them. That way, if one of them has an issue, because we push that code or push to that policy, we tend to push them one at a time, so that we can get this isolation effect. If one of them goes bad for some reason, and that's never happened, and never will, we could easily figure out what the problem was and roll back the process, and get back to a clean slate. We can do that really without bringing the whole system down. That's how we were able to isolate the failures.

Scaling and Cache Consistency - Approach 1

Next, let's talk a little bit about scaling and cache consistency. This is the first approach we did. We don't want to have every client call PACS server for every request. If we did that, we would have very large clusters, and it would be really expensive to maintain, and there would be a certain amount of latency that would be undesirable. We have two different levels of caching that we implement. One is the request scoped cache, which is wherein if a client asks for the same result more than once, within the scope of the same overall request, then we just return the response back immediately, within that particular request. It's all contained locally within the client. The second level of caching is a distributed cache represented here as Memcached. What we do there is we go, and the PACS clients, whenever it gets a request will first reach out to Memcached and ask for that record. If it doesn't find it, it will go to PACS to calculate it. If PACS calculates something, then it will write it back to the Memcached. We do that to eliminate the pressure on the client. Otherwise, we could theoretically write the response from the client as well. We like to do it more from the server side.

That's all well and good. Actually, that works pretty well. If you have a certain amount of cached entries in your distributed cache, then the odds of hitting the server become lower, and then your latencies will improve, which is all good. The problem with that, of course, is this is an authorization service and it has lots of impacts on various pieces of the infrastructure. If the underlying data changes to your decision, then you need some way to evict that cache. Our first approach at eviction was to do that asynchronously. Our identity provider would send us a stream of events, and we would process those events and go and delete the entries in our cache. What we realized was that was not going to work because the clients were asking for the answers again too quickly. We would have a quick page refresh on a change to a user data, and then we would need that to be reflected almost immediately. The latency was really too high with this approach. We ended up not doing that.

Scaling and Cache Consistency - Approach 2

What we've arrived at is something a little bit different, but achieves the same result. Instead of having the IDP send events due to changes for that, we actually have the IDP produce a dirty record, which is basically the account ID. Then basically just has reference to the fact that it is dirty inside of it, and we put that into Memcached directly from where the user data is stored. Put that into the cache. That way, when the PACS client is looking for records, it always does a bulk read and it always includes the key of the account for the dirty record as part of the request. Then we know that if we get that dirty record back, then we can't use the results of the cache, we have to go back to the server. In doing that, we actually achieve the near synchronous approach that we were looking for. We also didn't need a couple of assistants too much, because all the IDP really has to know is the fact that it needs to write this dirty record and doesn't need to know anything about the internals of the caching structure that PACS uses. It was two reasons for that, one is we didn't want to put too much information into the identity provider. We also wanted to have this near synchronous approach. Doing that actually achieved the ends that we needed. You can see some statistics here, we had about 4 million RPS on the client, and about 400k RPS on the server. We do get very high cache reads on this.

Dynamic Fallback (Certain Use Cases)

We also have something called dynamic fallbacks, which means that we can actually potentially produce a result directly from the client if a call to PACS fails. Our Passport, which is our token agnostic identity, actually contains some fallback data within it that we can actually use, and some very simple rules that we keep in the client. For certain use cases, ones that are not highly sensitive, we actually can authorize the use of a particular feature without going up directly to the server if the server call were to fail. That's how we accomplish that. It's very non-critical things, non-sensitive things. If we were to get a certain percentage of failures on the PACS server, we would still be able to move along this way. It's how we handle the fallback.

Summary

In large scale applications, a simplistic approach to authorization might not be good enough. There's just not enough flexibility there. You need to have a way to introduce this complexity without putting the complexity everywhere. You want things to be more risk adaptive. You want them to be more policy driven, more easily changed. It's probably a good idea to start centralizing those things. How do you handle it? Centralizing the policies around Authz is a good idea, but you still want to handle things like the failure domain isolation. You still want to create the right surface area for failures. Distributed caching is a good idea to help with performance. If you're going to do that, then you need to worry about cache consistency.

Questions and Answers

Westelius: How do your services interact with PACS? Is the client embedded into applications, or is it a standalone process?

Nelson: We actually use a gRPC type model, mostly. We have a lightweight client that's generated from the proto definition. Then the lightweight client ends up getting embedded into those services. We actually ended up doing some custom filters inside of the gRPC client, so it's not entirely generated. We had to do that because we were doing more advanced caching. It's still a pretty lightweight client, so it really just has that simple gRPC type interface.

Westelius: Mike is asking for dynamic fallback. When the call fails open, are these rules encoded into the PACS client?

Nelson: For the dynamic fallback, we did have to also do some custom code inside of the client, inside of those gRPC filters. The rules ended up being fairly simple. We just picked a few things that we knew that could be called very commonly and things that we could use a fail open model. Really, those just check the fallback data inside of the input, and then run the rule if the gRPC call fails.

Westelius: I think that's really interesting as well given the exposure of how you're dealing with localized rules in the fallbacks.

Does PACS provide APIs for other services or systems to validate client permissions?

Nelson: Yes. We have a number of different APIs that we provide in PACS, so we can allow access to what we call features, we can allow access to videos. It's not really about asking if something has a particular facet or something, it's more about can I do this particular thing? Those are the kinds of questions that we like to answer. Can I play back this video? Can I show this cast type button on the screen on my UI? That's the kind of question that we're generally handling.

Westelius: I'm also a little bit curious, actually, if you're expanding that a little bit, as we're thinking about things like trust, for example, for today's track, how much of these other segments that you're thinking about some device information versus fraud information is taken into consideration, as you're showing these different types of content?

Nelson: We have different kinds of policies that apply in different cases. Often, it's the status of the user, whether they're paying or not, the plan. These things are the basic attributes. Then also the plan metadata that's associated with that, so we know how many streams they're paying for, that sort of thing. Then, on the device side, we have whether the device was fully authenticated or not, or whether this is a completely untrusted device. Then we know from that also what kind of a device it is and what that device should be doing, so we can use that in our policies. Then the fraud is a layer on top of that, where we can also say, if all those other attributes are correct, and we think you don't have a high fraud score, then we're going to let you do this particular action. It's a layering approach is how I would think about it.

Westelius: It's also interesting given that it's not just in terms of what the user should be able to access in terms of permissions, but also, what are they able to access and in what format are they able to access it? It's really complex that way.

Nelson: We ended up doing that because we realized that a pure permissions based approach just on the user was just not expressive enough with what we were trying to do. Often, these device and other signals were really impacting the user experience and what they should be able to do, so we ended up with a more complex model.

Westelius: As you're thinking about the fragments that make that identity, is there a way in which you consider that whole for when you have certain amounts of data, or are you able to make those decisions when you might not have high trust, or maybe not have parts of that dataset?

Nelson: I think that's true. We don't always have a full picture of what's going on. For example, if a device was unauthenticated, then we don't really know the device identity. In those cases, we have to account for that. We don't really know what this device is, but what are the things that an unknown device can do? It ends up being that kind of a model.

Westelius: How do you handle internationalization, if different countries have different auth requirements? Do you handle it in a centralized form, or do we take that into consideration?

Nelson: Yes, we do. We do have rules sometimes that are based on country. It really depends. There's a lot of different systems that have dealt with country in Netflix. We have to work in the context of those other systems as well. We have our catalog system which assigns the content windows to each title. We take advantage of that. We actually look at that. We have our plan metadata, which allows users to sign up for particular plans, and those might be different per country. Then we have the actual request that's coming in, which is from a particular country, and it may or may not be on a VPN. We have to think about country in a very nuanced way, is what I'm saying. It's rare that we could say, just in this country, you can only do this thing. It's like, all these other things have to be true as well. The video might have to be available in that country, this plan might have to be available, and the user might have to be coming from that country at that point in time. It's a complex thing. We do include country in those rules.

Westelius: Expanding on that a little bit, is it so that the resource then is defined to be available from a certain area or country? Are those rules localized to the resource or is that something that you maintain in PACS?

Nelson: We take a higher level approach in PACS. We don't usually directly encode a particular resource, not in terms of like a video. The rule would be more like if this video is supposed to be available in this country and the user is coming from this country, then we're going to allow it. That would be the kind of thing that we would express. There might be certain APIs that we might only open up to certain countries. It's quite possible we would do that, and that might be encoded in a PACS policy.

Westelius: I didn't quite understand why it's better for the IDP to mark records as dirty rather than deleting the records. Could you elaborate on the differences a little bit?

Nelson: What we were trying to do was not build in really tight coupling between PACS and the IDP. We wanted to make sure that the contract was very simple, so that we could expand in many different ways. In the PACS we actually have many different kinds of records that we deal with as well. For every one of our different APIs, we actually have a different record type in the cache. That's because we need to basically encode the full response in the value. If the question is different, then the value we encode in the cache has to be different. What we didn't want to do is build in nine different record types into the IDP and have them go delete it directly. It was too much. We figured if we just had one record that it wrote, that would be a sufficient level of coupling, without being a brittle system, where if we added another one, we'd have to go and update the IDP again. Or maybe it would slow down the IDP to delete that many records. Really, it was just about coupling. I think we achieved a pretty good level of coupling with that single dirty record.

Westelius: With the PACS clients connecting directly to Memcached, how do you connect Memcached from being overwhelmed with number of connections/requests?

Nelson: We just have very large Memcached clusters. We understand what our traffic patterns are. We have very large clusters that can handle that. Then we have autoscaling built into that as well. It's not really been a problem so far. We just are used to operating at scale, and there's other services at Netflix that also use that same scale with Memcached.

Westelius: Are you saying that maybe this isn't for those who can't maintain really large Memcached clusters?

Nelson: It depends on the level of scale that you're dealing with, I think. If you have that kind of scale, you probably have some distributed caching that can also scale up to that level as well. The other way to think about it is we also have ways of failing out of a region. If we really had stress in a particular region, the entire Netflix service will fail over to another one. That's also a possibility.

Westelius: Have you considered JWT based access tokens so clients don't even need to hit Memcached?

Nelson: Yes. We originally thought, ok, what if we just did all our authorization rules at the beginning and then pass down that information, say, in our Passport. The JWT I think that you're talking about is equivalent to our Passport. They're very similar concepts. We ended up not doing that, because there was just a lot of information that we would need to encode inside of that. We didn't really want to pass that around. Then we didn't want to also have too much knowledge about what things would be downstream, all the way at the gateway. We settled on this approach where we would just put in fallback data as necessary for the really critical use cases, and then pass that down. That model has worked a little better for us. You're right, it depends on what you're doing. If you have a simpler model and a simpler system, and less things that you need to authorize, then yes, doing that right at the edge and then passing it down is a perfectly valid model.

Westelius: Yes, very lengthy JWTs can definitely become a hassle. I know that for sure.

Do all server calls at Netflix include a Passport, including offline tasks or scheduled jobs?

Nelson: No. The Passport is really about the access tokens, and it's our token agnostic identity that comes through. It's not necessarily true that every call will have a Passport because we don't always have it for the offline tasks. We have ways for those tasks to create them, but often an offline task will be something that goes through almost every user that we have and might want to do an operation on each of them. Having them call a service to create a Passport and then call PACS is a bit too much overhead. We actually have different APIs that don't necessarily require a Passport. Those wouldn't have the same kinds of fallbacks, either, because it's not a thing to have a fallback if you're just providing, say, a user ID. It's a little bit of a different model, but we can also support that with the same policies, just the interface is a little bit different.

Westelius: How does PACS work? Cross data center?

Nelson: PACS is deployed in each region. We tend to think about things in regions, so we have a few major regions, and we make sure that everything is deployed in each of them. Then there's availability zones within the region, but we don't really worry about that too much. We have basically the equivalent model in each region, and each one can answer the questions exactly the same. The thing that could be a little bit different when a request gets moved from one region to another, is that there's a slight possibility of a race condition. That's something that we just have to deal with. The use case might be if you change the user's data in one region, and then immediately went to another region, then it might be a slightly different thing that you'd see. In our experience, it's not the end of the world. We don't really bounce between regions, very often. We tend to keep the user within a given region, and it's only if we have an evac, then we'll move them to another one. Generally, chasing those replication delays is just not a big issue. It's pretty fast. It's not a case where large numbers of users are going to see problems.

Westelius: Given that you mentioned that as a resolution to the failure domain isolation when that occurs, if by mistake, I've released something that turns out to not be effective, what does that look like for the users or for the system?

Nelson: If we did have a massive problem, where we literally were denying everything, then when we push to a given shard, then the clients would then start seeing failure responses from that. Then we would notice within probably about 20 seconds that that region was going bad, and then core would start evacuating the region, if we didn't catch it before then. If we caught it before, then we could roll it back. That's what would happen.

Westelius: How do you change the rules embedded into the PACS client? How do you deploy rule sets, essentially? Do you need to redeploy every service or any of these when you make changes to these rules, essentially?

Nelson: We have a separate rule set repository that we use for our production rules. Those are what end up getting deployed. We have a separate Spinnaker pipeline that is capable of deploying those. We actually deploy the code and the rules entirely separately. It's actually an internal rule language that we've developed in Netflix for other purposes, but we've adapted to use it in PACS, which is called Hendrix. It allows us to express very complex trees of rules that we like. Then we basically can have tests that run against those rules as well built in the high level, is this rule valid? Then we have tests that run against that with the PACS codes. We know that the rule is also valid in the PACS code. Then our pipeline basically does CI on that. Then it publishes to the regions in a specific order. That's how we achieve that.

 

See more presentations with transcripts

 

Recorded at:

Jul 08, 2022

BT