InfoQ Homepage Presentations User & Device Identity for Microservices @ Netflix Scale

User & Device Identity for Microservices @ Netflix Scale

View Presentation

Speed:

Download

47:51

Summary

Satyajit Thadeshwar provides useful insights on how Netflix implemented a secure, token-agnostic, identity solution that works with services operating at a massive scale. He shares some of the lessons learned from this process, both from architectural diagrams and code.

Bio

Satyajit Thadeshwar is an engineer on the Product Edge Access Services team at Netflix, where he works on some of the most critical services focusing on user and device authentication. He has more than a decade of experience building fault-tolerant and highly scalable, distributed systems.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Thadeshwar: I'll ask a question, first of all. How many of you are subscribers or have been a subscriber in the past? Can we have some show of hands? Ok, most of you. As a Netflix subscriber, maybe you experienced this wherein you are trying to watch a title on Netflix, let's say on your TV and you found out that you have been logged out of your account on that TV. We can imagine that it can be annoying for our customers because logging in back on the TV using a TV remote is no fun, especially if your Netflix account email includes your full name and your full name is as big as mine.

On the server side, we categorize the errors that cause this as user reauth errors. If these user reauths were to happen on a larger scale, it would cause a drop in some of our core streaming metrics, because if people get logged out, they can't stream. It might trigger an outage incident, an outage that would take much longer to resolve even after the server-side issues are resolved because people may not log in back on their devices or maybe they don't remember their password. Above all, this just adds more friction between our customers and the content that they're trying to watch. Netflix tries to eliminate all such friction all the time.

My name is Satyajit, and I'm on the Product Edge Access Systems team at Netflix, short for PEAS, and that's our team logo. Yes, we love acronyms and we love creating logos. Our team operates some of the most critical services at Netflix that manages the life cycle of authentication tokens that we send out to these customer devices. We ensure that all Netflix subscribers around the world remain logged in on their devices and are not logged out because of any server side issues or any changes that we make to our services. In a large distributed services environment, failures are inevitable. In fact, a couple of years ago, whenever such an issue like authentication or identity-related issue was reported, it used to take much longer to resolve that issue because our systems were very complicated.

Here is a screenshot from one of the prod incidents from back then. We use Jira for prod incidents. Most of the information here is redacted, but what I want to highlight are these two things. Just look at the number of people and number of teams involved to resolve this incident. This is not usual for Netflix. That's our team, by the way, playback access, which is shown with the highlighted arrow. We were part of the bigger playback org back then. Whenever such an issue was reported, we had to include people from multiple teams because there were multiple systems that were dealing with authentication tokens and resolving identities. Also, we did not have enough visibility into these systems, which just made it harder to find the root cause. At some point, we wanted to simplify this and streamline the authentication and identity resolution parts and hence, we decided to rearchitect our system. That is what I'm going to talk about today.

Also, this talk is not about service to service communication within the Netflix ecosystem. For that, we use virtual TLS with self-sign search. This is about authenticating Netflix subscribers, like some of you folks here, and the devices that they use.

Today I'll talk about where we were with regards to access and identity management, what we did as a part of that rearchitecture, and what wins did we see as a result of that.

Where We Were

To better explain where we were, let me run you through a very simple user login flow, which will illustrate how we used to issue the authentication tokens and how we used to validate those tokens server side. Let's say you are trying to log into the Netflix app on your TV. You enter your email, you enter your password, and you hit Next. That request then lands on Netflix's edge proxy, which is shown here as Zuul. Zuul is our primary edge proxy, which is a gateway for all traffic coming into Netflix. Think of Zuul as something similar to an L7 proxy, which routes requests to the origin services that are deployed behind it. Additionally, Zuul also allows us to run custom filter code that can act on a request or a response depending on the type of filter. Throughout this talk, whenever I say Zuul or whenever I say edge, I'm talking about Zuul, or vice versa.

Coming back to this flow, from Netflix's perspective, your TV on which the Netflix app is installed has a unique identity, which we call as the ESN or the Electronic Serial Number. Along with your credentials, this ESN is also passed in the login request. That request then gets routed to one of our origin servers, that is the API, which then calls one of the mid-tier services, not surprisingly called auth service to validate your credentials. Once auth service validates your credentials, it issues a HTTP cookie with the customer and device identity information in it, and eventually, that cookie is sent back to the device as a set-cookie header which the device stores in its cookie store.

This is how at a very high level, user login flow used to issue the cookie, which is the authentication token here. Once you log in, in order to present you the Netflix homepage, the app makes another request with the previously acquired cookie. Again, this request lands on one of the API servers as routed by edge proxy. To authenticate this request and to extract the identity present in this cookie, the API server needs to decrypt this cookie because it is encrypted. In order to do that, it needs access to a specific cryptography key which is provisioned by our key management service that we have built at Netflix. For those of you who are not familiar, a key management service provides storage and access control for cryptographic keys. The server takes this key, cracks open the cookie, and if the cookie is valid and is not expired, it'll send the customer ID and ESN information downstream, eventually generating a Netflix homepage, which is personalized for that customer ID. All the tokens that we send to these customer devices have a set expiration. If this cookie was expired, API server would additionally also renew this cookie before sending the customer ID and ESN information downstream.

This is how we used to authenticate the request. As you can see, it's pretty straightforward and standard flow. But we had some problems. First of all, we had more than one service consuming cookies. This is a very high level architecture diagram of a part of a Netflix ecosystem. Each block that you see here represents a cluster of AWS EC2 instances. As you can see, we have more than one, what we call as origin services deployed behind Zuul. We have an API, we have a legacy API, and we have a service called as device auth service. API servers formed our primary API service layer, which hosted endpoints for different functionalities for different device types or platforms like iOS, Android, CE devices, etc.

Device auth service plays a very special role in performing device auth, which I'll talk about later. Legacy API service, as the name suggests, is used by older legacy devices. All these services were consuming cookies. There were even some mid-tier services which were consuming cookies because those systems wanted to extract some information which was present in the cookie. All this means is that we had to provision key access for all these services, which was a security concern. Also, these requests go to multiple mid-tier services and multiple origin services, so not having a central place for identity resolution caused discrepancies and a lot of errors and which were exaggerated by lack of visibility into these systems. Also, as it is apparent, all these services are doing redundant work of decrypting core tokens and resolving identity, which was very inefficient, especially at the scale we operated at.

To give you an idea of scale, Netflix has more than 158 million subscribers worldwide. There are more than a billion devices with a Netflix app on it, and all those devices make requests to Netflix services. We absorb more than 2 million requests per second at peak, at the edge layer. What this number translates to is, the code that is decrypting the token and extracting identity is getting called 2 million times per second, and with more than one service doing it, that number just multiplies that many times. Even though we use a symmetric cryptographic key here, there is still a CPU cost that is incurred for each request here. On top of this, multiple services consuming cookies at this massive scale, we had one more problem, which was the biggest factor in making our systems more complicated. Cookie was not the only token type being used. We support multiple types of tokens for different categories of requests.

Let's look at what all types of token we support. We already saw cookies, cookies are used in request for signup flow, meaning when you subscribe for Netflix, for login, and for discovery. When I say discovery, that means a user is trying to discover some content to watch, either by browsing the homepage, or a genre page, or using the search functionality. All these requests use cookie as a form of authentication.

Then we have something called as message security layer, which is an internal security protocol developed by Netflix, abbreviated as MSL, and we call the corresponding tokens, MSL tokens. Think of MSL as something equivalent to TLS in terms of the features that it provides. Without going into too much detail on MSL, at a very high level, MSL provides us with device authentication and encryption. Netflix has a very critical business requirement of authenticating devices to not just users because we have contractual obligations with studios as well as content creators to secure the content that we show on our service.

Authenticating a device means validating a device's identity, which is done by employing a particular device auth scheme. We use different device auth scheme for different device types and we put an appropriate level of trust on device's identity based on the auth scheme used. Sometimes these auth scheme also leverage the existing DRM on the device, so the more secure the DRM, the more level of trust we put on devices' identity. Eventually, this trust or assurance is used by services to make authorization decisions. To give you an example, the level of trust that we put on a Chrome browser on a macOS is very low compared to, let's say, a latest TV that you can buy from Best Buy, because the TV, most likely, will employ a much secure form of auth scheme. Because of this low level of trust on that Chrome browser on macOS, we will not serve 4K streams on that browser, even if available for a type.

We use this MSL protocol for license and playback requests, which are very critical from Netflix's business point of view. Then we have these tokens called a CTicket token, which are part of the legacy authentication protocol that Netflix developed during the early days of streaming. These tokens are used by very old legacy devices which are still being used and we have no way of updating Netflix app on these devices. You folks still own the TV that you bought 10 years ago with the Netflix app on it? I still do. Back then, if the Netflix app was hitting a legacy service endpoint, chances are that it is still hitting it because we can't upgrade the Netflix app on it. Because of this, we have to continue to support the legacy service, corresponding authentication protocol, and the tokens.

Then we have these JWS or JWE variants of charts that we use for some partner integrations. Lately, Netflix has been partnering with device manufacturers to merchandise Netflix content to members as well as nonmembers, and sometimes from the partners' UI itself. For these integrations, we built specific APIs and we chose to use an open standard like JWT to better integrate with partner infrastructure. When these tokens come to our services, we have to extract identity from these tokens as well.

I talked about at least four different types of tokens which are used for these categories of requests, which looks something like this. This is the same architecture diagram that I showed you earlier, but with all the tokens flowing in our system like this. Just to summarize, we had multiple services consuming multiple types of auth tokens at a very massive scale, which just made it very inefficient, insecure, and complicated. If this was not enough, we were thinking of building a client side services layer.

Earlier, I mentioned that API server hosted these endpoints for different device types or platforms. With this new initiative, there was this plan of migrating these endpoints to this new platform where they would be deployed as independently running Node.js servers on containers. We are also planning to build new API stacks for these services. We definitely didn't want these new services running JavaScript code to start consuming tokens like existing systems. As you can see, we did not consciously choose this architecture that we had for authentication. It happened gradually and organically over time as business requirements grew. We had reached a point where we wanted to simplify this. Even though we wanted to simplify it, we had to keep the existing protocols intact. We couldn't start from scratch and abandon those billion-plus devices already being used.

What We Did

What we did was we moved authentication to the edge. What I mean by that is, instead of all these tokens flowing into these origin servers and mid-tier services, we started terminating these tokens at the edge layer wherein we started resolving identities and sent it downstream so that any service that was deployed behind Zuul did not have to worry about this or did not have to deal with these tokens. We also created new microservices to handle the life cycle of these tokens, and one microservice would just handle the life cycle of just one type of token. As you can see from this diagram, there's a dedicated service which handles the life cycle of cookies, dedicated one for MSL tokens, and so on. When I say lifecycle of a token, it involves creating, augmenting new token, or renewing it if it is expired.

We call this shared authentication code running in Zuul as well as in these microservices as EAS or Edge Authentication Services. Let's zoom into this EAS layer a little bit, which is shown here by the dotted line. In this architecture, the EAS layer is structured in such a way that the code running in Zuul could handle authentication as well as identity resolution for 95% of the total requests without making any remote calls to any of these edge authentication servers. These are the requests where the token coming with the request is valid and it's not expired, so all it takes for the code running in Zuul is to decrypt the token and resolve the identity and send it downstream.

For the remaining 5% of the requests, however, the code running in Zuul doesn't need to make a remote call to one of these EAS servers. These are the requests where the token coming with the request is either expired, so we need to renew it or let's say for a MSL request we need to perform a device auth, or some sort of a cryptographic key exchange. This split of 95% and 5% where for only 5% of the requests we need to make remote calls from Zuul, helped a lot with the resiliency of Zuul, because at the end of the day, Zuul is our primary edge proxy and entry point for all requests into Netflix. Resiliency is very important to us, because if Zuul is down, Netflix is down, and nobody can stream or even browse.

To further improve its resiliency, we also built fallbacks while designing this layer. You might've heard that there is no fallback for authentication, which is true. If your main auth service which is validating your user's credentials is down, there's nothing you can do about it. There are some cases wherein we have some leeway to issue fallbacks when we can't talk to one of our dependency services. To explain a fallback scenario, let's take expired cookies as an example. By the way, whenever we renew cookie or for that matter, any token, we also verify that the user is a paying member or not. We do that by making a remote call to a service called a subscriber service, which eventually does a database lookup.

Let's say the cookie coming with the request is valid, meaning we could decrypt it and we could verify that we had previously issued, but it is expired. The EAS code running in Zuul tries to renew that cookie by making a remote call to cookie service. If that call to cookie service fails because let's say the cookie service is down or it can't talk to one of its dependencies, subscriber service, or let's say there is a network error. Network error serves as part of life in distributed services environment.

Let's say if that call fails for whatever reason. We will not fail the original incoming request, instead, we will mark this renewal call as Rescheduled and we'll still send the resolved identity downstream. While responding back to the device, we will send a cookie, which we call as a rescheduled cookie with a very short expiration time. Because of this very short expiration, the cookie will get expired pretty soon and when the same device makes another call with this expired rescheduled cookie, that is when we will renew the cookie, assuming that the server side issues are resolved. This is an example of a fallback that we have in place which prioritizes user experience over security.

Going back to this current architecture diagram, the next logical question would be, in what form did we send the identity information downstream? For this, we created a new identity structure called Passport. What is Passport? It is an identity structure created at the edge for each request and services consumed in the scope of same request. It contains user and device identity information. It is internal to Netflix ecosystem, meaning it is an internal identity token that we don't send it out back to the device.

For security, it is integrity protected by HMAC. For those of you who are not familiar with HMAC, HMAC stands for hash-based message authentication code. In cryptography, it is used to verify the authenticity and integrity of the data that is being transmitted and it involves using cryptographic hash function and a symmetric key. We added HMAC protection in Passport for any service which wants to do an additional verification on the integrity of the data that is present in Passport.

Lastly, it is in Protobuf format, because most backend services at Netflix use gRPC, so Protobuf was a natural choice. We could just put the Passport binary proto data in a gRPC request and the framework will take care of serialization and deserialization for us. Here is what a Passport Protobuf message looks like. For those of you who are not familiar with Protobuf, just think of this as something similar to data class in Kotlin or a plain old Java class which has some member variables and holds data. Those indices there are part of the Protobuf syntax wherein you need to have a unique index for each field in your message.

Let's look at what all information does Passport have. Firstly, it has header which has some metadata about the Passport. Then it has two buckets of information, User Info, and Device Info. As the name suggests, User Info stores user or the customer identity information, mainly the customer ID and the account owner ID. Device Info stores the device identity information, mainly the ESN and the device type. If you notice, both User Info and Device Info have these two fields which have a very special purpose for us, Source and authentication level. Source indicates how we resolved identity or what claim was presented to resolve the identity, meaning did we resolve a user's identity using cookie, a MSL token, or some other claim?

Here is a list of some of the claims that we use to resolve the identity. We launched Passport with the Source field and we got feedback from our customers that wanted a higher abstraction than Source. What we did was we grouped these sources into these three buckets of authentication level of low, high, and highest. An authentication level signifies the level of trust that we put on a particular claim. For example, an authentication level of highest means resolved identity using a MSL token or a user credential's claim. If you remember, earlier, I mentioned that we use MSL protocol for license and playback requests. Let's say a license server, while using the identity in Passport can do an assertion on the authentication level for user info or device info. It can do an assertion that it has to be authentication level of highest, otherwise, it can fail the request.

Lastly, as I mentioned, the identity information in Passport is integrity protected. This is where we store the user and device integrity information. We take the HMAC of user info object and put those bytes in User Info and we take the HMAC off device Info Object and put those bytes in Device Integrity. This Passport is passed in a gRPC request, either as binary blob of bytes or as a Base64 encoded string, which represent those bytes. We also built a wrapper on top of this Passport binary data called Passport Introspector. This is what the Java interface for that Passport Introspector looks like, which has a bunch of getter methods to access the device and user identity information. We built this wrapper so that let's say in future we want to move to a different wire format for Passport, we could do that because of this abstraction without affecting every consumer. When a service gets Passport binary data in a request, it then creates an instance of this Passport Introspector using a factory that we have provided.

Passport Introspector is for services to consume Passport data programmatically at runtime, but there was also a need for humans to introspect the Passport data for debugging purposes. Here is a screenshot of one of the tools that we have provided that the teams can use to decrypt a Base64 encoded Passport string and see what is in there. If you folks noticed, both User Info and Device Info have this list of something called as Actions. Along with Passport, we also introduced an interesting concept of Actions which we call as Passport Actions. Passport Actions because Passport here is a carrier for these actions.

Earlier I mentioned that Passport is an internal identity token which is not sent back to the device. Whenever an identity mutation happens, we need to send the updated identity back to the device in the token that it understands, and that is where Passport Actions come into play. When this mutation happens, the downstream service that actually perform the mutation in that request sends an explicit signal in the form of Passport Action while responding to the request. This signal is used by the EAS layer at the edge to create or update the corresponding type of token and that token is then sent back to the device.

Let's look at some of the examples where we are using these Passport Actions for identity mutations. Can you folks think of a very basic example of an identity mutation? User login. Before user login happens, we don't know the identity of the user and we call them non-member. After user login, we have verified that they are a paying member and we have authenticated their credentials, we call them Current Member. We need to send this current member identity back to the device in the form of a token that the device understands. If the device understands cookies, we need to send the current member cookies back, and if the device understands MSL protocol, we need to send current member MSL tokens back.

Let's look at the same user login flow in this new architecture. Just like before, to log into the Netflix app you will enter your credentials, you will hit Next, that request will then land on one of the API servers as routed by the edge proxy. The new thing in this architecture is that the EAS code running in Zuul will create a Passport for this login request and send it to API. The Passport just has device identity information, that is the ESN, because we don't know the user identity yet. API will then call auth service to validate the credentials. We'll send the Passport along, and once auth service successfully validates the credentials, unlike before where it used to issue cookies, now it sends back a Passport with the updated current member identity along with the user login Passport action. The EAS code running in Zuul sees this user login action along with the updated Passport, and that is a signal, explicit signal that it needs to create new cookies for the current member identity. Then Cookie service creates a new cookie and that cookie is sent back to the device as a set-cookie header as usual.

As you can see, this Passport action acts as an adapter pattern wherein you have an internal token that is Passport, and you have an external token that is Cookie, and Passport Action bridges the gap between the two. It also provides us with a very clean separation of concerns. Auth service is only responsible for authenticating users, validating credentials, and the token creation is taken care by the EAS layer. Because of this, the EAS or the auth service does not need to know about different types of tokens that it needs to send.

Another example of an identity mutation is a profile switch. You folks must be having multiple profiles on your Netflix account. It's a nice feature. Each profile has its own identity because we show personalized content for each profile. Whenever you switch to a different profile, a downstream service is switching your identity and fetching a new homepage for that identity. When this happens, we need to send the switch profile identity back to the device in the form of the token that it understands. These are just a couple of examples where we use these Passport actions out of 30-plus different flows which we have at Netflix.

These Passport actions have two benefits. Firstly, as I mentioned, it provides a clean separation of concerns, and second, since now all the identity mutation happens via these Passport actions, we could instrument these actions with much better logging and metrics which provides us with much increased visibility. That was not there earlier.

Just to summarize, we moved authentication to the edge and streamlined the identity resolution as well as the mutation parts, making the consumption of user and device identity via Passport much more efficient, secure, and simple.

Wins

Let's look at some of the wins that we saw as a result of this rearchitecture. First and foremost, we moved to a token agnostic identity model where the systems consuming identity did not have to worry about any of the authentication concerns. This was a very big win in itself. It greatly simplified their codebases because now they could remove all the authentication-related code, and it also simplified their operational model because authentication and identity was no longer their concern. They would just consume the identity present in Passport and focus on their business logic.

It also greatly simplified authorization because we now put enough claims in Passport while resolving the identity which the downstream services can use to make authorization decisions. Earlier, these services were just being handed over two values, a number for customer ID and a string for ESN, and there was no way to know how this identity was resolved just by looking at these values.

We also moved to a model that is much more extensible now. It's no longer just two values for user and device identity. Services now consume Passport as a form of identity, which is much more extensible. If we have to add new attributes about a user or device identity, we can do so. In fact, we see a pattern at Netflix where once a service gets identity in a request, it then takes that identity and makes a remote call to a service called a Subscriber Service to get more information about the subscriber, meaning their membership status or which plan they are on. There are multiple services in the same request path making the same redundant call to Subscriber Service. Looking into the future, we could possibly call Subscriber Service from Zuul and put the latest subscriber data in Passport for other services to consume, thereby avoiding all these redundant calls to a Subscriber Service. The point here is, since we have this extensible structure, it gives us more opportunities for such optimizations.

As I mentioned, we offloaded token processing from all mid-tier services and all origin services, which resulted into significant gains for many of their system metrics. We were also able to fine-tune our EAS services based on the token processing profile, which was not possible earlier because, for example, earlier, an API origin service would process all the tokens in the same JVM. To give you an example, MSL protocol is very CPU-intensive and adds a lot of heat pressure. We were able to fine-tune the MSL server by choosing an instance with more compute units and a much aggressive GC settings profile.

Speaking of gains, let's take API server as an example. Here is a graph of CPU to RPS ratio as well as load average for an API server. For both those metrics, a lower number is better. We saw 30% reduction in CPU cost per request and a 40% reduction in load average, which roughly contributed to half a million in savings per year for EC2 costs. Those gradual drops that you see there is because we launched token termination at the edge during those times by a percentage of request-based dial.

Here is another graph, which shows the response time for API server. We saw 30% reduction in average latency for API responses and P99 dropping by 20%. Since most of our servers run on JVM, we have garbage collection. We saw a significant reduction in GC pressure and GC pause times for API cluster. We now have much increased visibility into identities flowing in and out of Netflix system and also into the identity mutations happening in a given request. That means we can resolve identity-related issues much faster now. This is because we now own the token creation as well as identity resolution parts. We could instrument both those parts with much better metrics and logging that provides us with this increased visibility.

I cannot emphasize this point more. Earlier, we had to consult multiple teams and touch multiple services code base in order to make a change, which was authentication or identity-related. Now, it just involves one of our services, greatly simplifying the developer experience. A change before used to take multiple weeks, now just takes a few days.

Last but not the least, because of this architecture, it created a separation of concerns among all the teams. Because of this separation of concerns, our team could focus on server site security and gradually and eventually make it better and more secure. Other teams did not prioritize security because it was not their primary charter and more important than that, they had other product features to focus on.

Just to summarize, these are some of the key learnings that we saw as a result of this architecture. If you are also in a similar boat as us and you don't have the luxury of implementing your authentication from scratch, maybe some of these learnings could help you in some of your decisions.

Questions and Answers

Participant 1: I have two questions. The first one, I have one edge case for you. I want to hear how you handle it. Suppose I use TV to log in to Netflix, you mentioned that your cookie service could be down. I was logged in then your cookie service is down, you send me a rescheduled cookie with a short expiration time. Then suppose your expiring time is 10 days, I turn off my TV first, then 10 days later I turn on my TV again. Should I see your login page or I'm already logged in?

Thadeshwar: Your question is, if you get a rescheduled cookie and you turn your TV on after that many days, will you get a login page or no? First of all, our expiration time is not 10 days, it's few hours. It's very short, in general, without the rescheduled cookie and rescheduled cookie has even smaller expiration time. It will not be 10 days, it will be much before than that. Even if you have a rescheduled cookie, you will not get the login page. We'll try to renew that cookie if we can reach one of our services. That's the answer to your question, basically. We will try to renew that cookie and you will not be logged out in that case.

Participant 2: All of what you had there was assuming that requests are always coming from outside through Zuul. Are there any requests that originate inside of Netflix that need to use the authorization services and so they're not passed through from the client requests, they originate inside, and how do you deal with that situation?

Thadeshwar: We don't have that many use cases as of now. We are trying to do push notifications wherein the requests originate from within Netflix, and for that, they use a different form of tokens. We do still own those tokens, but they don't use these EAS services.

Participant 3: When you talk to external services, do you pass the Passport or do you have something like in the case of JWT token?

Thadeshwar: We never send Passport to external services. It's only within Netflix.

Participant 4: Given that you're just one team, how did you convince management the importance of this thing and how did you coordinate this transition? Since all the authentication, it was in all these different microservices, I'm assuming there are different teams and now you have to form a new team maybe to do this work? How did you safely go through that transition?

Thadeshwar: Goes back to my very initial point in the first few slides that we were seeing so many issues and it was a complicated architecture, and so, even management saw that there is a lot of pain involved when there are issues that pop up like this. For this eventual migration, it was not a quarter's job, it took many years. It took two to three years for us to migrate to this new path.

Participant 5: In between, you talked about moving the cookie service, MSL service all to the edge. What about the KMS? What happened to the KMS?

Thadeshwar: KMS is owned by a separate team and they own that in the operate tech. You're talking about the key management service, right?

Participant 5: No, the cookie MSL service, everything moved to the edge, you can move the KMS to the edge.

Thadeshwar: When you say KMS, you mean the key management service?

Participant 5: Yes.

Thadeshwar: It's not moved to the edge, but it's a separate team that owns and operates that service.

Participant 5: The cookie service, all that is in the edge, but it calls KMS which is within Netflix?

Thadeshwar: Yes. We have a dependency on KMS.

Participant 6: My main question is related to the testing of this microservice. When you moved all the cookies and the other tokens to a different microservice, how did you make sure that your testing is covering all the scenarios and then what kind of different testing you did strategy-wise?

Thadeshwar: We rely heavily on unit tests and integration tests and we also test these new services by end-to-end tests. That is the first gate. We make sure there is enough test coverage there. Then when we roll out this feature, we don't do it all at once. We have these dials which are based on percentage of requests, we also have dials wherein we roll out certain features only for certain device types. For example, we may roll out this token termination just for a small percentage of iOS devices or a small percentage of Android mobiles and things like that. It was a very controlled rollout and testing was done before that.

Participant 7: Since there are many teams that are interested in the Passport and a few that are updating that I'll be interested if you can share some of your experience in evolving the Passport structure. How do you decide what use case will be backed by the Passport and what data goes in there and what doesn't?

Thadeshwar: We are very particular about what data goes in there because we don't want Passport to become a kitchen sink of things that people want. We choose it based on how frequently the attribute is used, and something that doesn't change that often. Let's say customer ID or account owner ID does not change, but let's say membership status can change. That's why we don't put membership status in Passport, even if many teams have asked us. We go by the use case and we look at what is feasible and what is not. Generally, we are very strict about what goes in Passport. For services that want additional information that is not in Passport, we encourage them to make a remote call to the service, like let's say subscriber service, like give them a Passport and they'll give you more information. That's the model that we have been following so far.

Participant 8: Can I ask a follow up question to that? When you evolve the Passport Protobuf schema, how do you coordinate that across all of the clients?

Thadeshwar: That was the whole point of the abstraction Passport Introspector wherein the teams are not directly consuming the proto. In a way, they are, but it is consumed via this Passport Introspector, so we maintain that contract. That interface that you saw, we maintain that.

Participant 8: What if someone's using an old client library, how do you go track that down?

Thadeshwar: Good thing about Protobuf is the additive changes are backwards compatible. I think some of that helps us.

Participant 9: Within Netflix services, how do you pass around Passport, in what format?

Thadeshwar: It's passed explicitly in a request. Let's say you have a gRPC request to a backend service, we pass Passport either as a Base64 encoded string or as a byte array.

Participant 9: Within the header?

Thadeshwar: No, explicitly in the API. In the payload.

Participant 10: For decoding the Passport in the services, you provide that introspector interface. I'm curious if you guys require developers of the services to all use Java or a JVM language so you only have to write a library support for that or do you support the library in different languages?

Thadeshwar: The interface that I showed was for Java, but we also have similar interface for JavaScript. Those are the only two stacks which use Passport.

Participant 11: Authentication is one transformation you saw that you could refactor across the services. Are there any others that you've identified?

Thadeshwar: They are always there, but our team specifically focuses on authentication aspects. There might be other teams that might be doing such optimizations, but I may not be the best person to answer that.

Participant 12: How do you validate the Passport when you receive it on a service?

Thadeshwar: What do you mean by validate?

Participant 12: You include these authentication fields in the Passport, but how do you validate it? How do you make sure the Passport is something you generated?

Thadeshwar: As I mentioned, it's not a token that is coming from an external device or an external service, it is minted within Zuul. Unless we introduce some bug in Zuul, we can trust the Passport or the identity present in Passport.

Participant 13: When you did the migration, was there any specific steps you followed, especially because of the legacy APIs? Were there any challenges to maintain legacy APIs to incorporate the change in the legacy APIs because they work differently compared to what you have?

Thadeshwar: Yes. Definitely. When it comes specifically to the legacy APIs, we saw a lot of pain while migrating those services to this new Passport. What we did was, we refactored the legacy API code in such a way that it can operate in both the model, like without Passport and with Passport. Without Passport, meaning the old way of operation and a new way is using Passport wherein that token gets terminated at the edge. We tested that and we found a lot of issues and it was a slow and painful migration.

Participant 14: Once a Passport is created and if another service modifies it, how do all the other services keep in sync?

Thadeshwar: There are only certain services that have a need to modify the Passport. Not anyone can modify the Passport. Since the identity information in Passport is integrity-protected, we don't provide this key permission to all these services. For some service that needs to mutate identity in the Passport, it then passes the same Passport in the same request. If there is an internal orchestration happening, the service ensures that it is passing the new Passport and not the old one. There are a handful of services that have the ability to modify a Passport in Netflix ecosystem.

Participant 15: You said you're using this in Zuul, so every request that comes in you do intercept that and do you intercept another Passport to every request?

Thadeshwar: Yes.

Participant 15: If Zuul is external-facing and you were trusting the Passport token going there, then you have an extranet and an intranet in between, in Zuul itself?

Thadeshwar: When we create a Passport, we are looking at the token that we issued at some point, in past. The token also is cryptographically encrypted, so once we are able to decrypt and verify its identity, then we can safely create a Passport from that.

Participant 15: It is still in your extranet? Zuul is in your extranet?

Thadeshwar: No. Zuul is internal. It is attached to ELBs which are external facing.

Participant 16: Let's say you have Zuul inside a cluster, so you have a fully trust environment. At some point, you might cross boundaries where you have to have a zero-trust. Do you convert back your Passport in order to send it to another cluster that could be potential middleware?

Thadeshwar: We don't send Passport to any of the external services as somebody asked earlier as well. The only time we need to send an external token is when we are responding back to the device and that's where Passport action comes into play. We don't send it to any services. Passport is within the Netflix ecosystem and we don't send it to any middleware or any other services that are not within our VPC.

See more presentations with transcripts

Recorded at:

Dec 16, 2019

Satyajit Thadeshwar

InfoQ Software Architects' Newsletter

User & Device Identity for Microservices @ Netflix Scale

Summary

Bio

About the conference

Transcript

Where We Were

What We Did

Wins

Questions and Answers

Related Sponsors

This content is in the QCon Software Development Conference topic

Related Topics:

Related Editorial

Popular across InfoQ