InfoQ Homepage Presentations Unwinding a Decade of Assumptions - Architecting New Experiences

Unwinding a Decade of Assumptions - Architecting New Experiences

View Presentation

Speed:

Download

36:49

Summary

Cole Turner shares his experience in implementing new experiences across dozens of Netflix microservices, how they navigate assumptions, from ideation to delivery, and how those assumptions impact decision-making.

Bio

Cole Turner is a senior software engineer at Netflix, focusing on user interfaces and experimentation. When he’s not doing that, he’s chasing how to improve developer ergonomics and productivity, and mentoring early career developers in engineering and career growth.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Turner: I will be presenting, "Unwinding a Decade of Assumptions - Architecting New Experiences." I love puzzles. This is one of my favorite puzzles because it's one of my favorite shows. It depicts a group of bank robbers celebrating their big score. There was just one problem with this puzzle. When we pulled it out of the box, I found that the edges, it was hard to see where the pieces would go. The contrast, it was hard to see the differences between colors. This was a really hard puzzle to look at. Even in the best light, I found myself having trouble putting it together. This is a lot like working in microservices. When you work on a service, you know how your service fits with another service, but it can be hard to see the bigger picture.

Background

My name is Cole Turner. I'm a senior software engineer at Netflix. I build client side applications for the non-member website, which is where our customers go when they are logged out or don't have an account. A lot of the work that I do is building UIs that interface with upstream microservices. In my experience, I've seen this interaction between how microservices work together.

A Brief History of Microservices at Netflix

I'm going to briefly describe what a microservice is so that we're all on the same page. When we arrange our application into loosely coupled components, developers have more leverage. We have a client that talks to our API, or edge, which is responsible for creating all of the data that the client needs by talking to microservices. These services can depend on each other, or operate independently. Services are often designed around modular functionality or business domains. It's no secret that at Netflix, we love microservices. They are a fundamental part of our architecture, allowing us to operate and scale globally in the cloud. They're crucial to our business, where we seek to entertain the world, reaching billions of people. A lot of you know that Netflix was an early pioneer of microservices, but it first began as a Java monolith when we built our website for DVD in 1998. When the application was deployed, all of the libraries across all of the engineering teams were pulled together, built, and deployed in one go. This would take a lot of time. The biggest problem with this is if there was any issues with just one of those libraries, that meant all of the engineering teams were responsible to be on alert and investigate the issue.

In 2009, our engineering teams set out for the migration to microservices. It began with our non-customer facing parts of Netflix, and then later our consumer product. Over the next few years, Netflix would also carve out its edge API, which is where our clients talk through the gateway to upstream microservices. This division between microservices allowed all of our engineering teams to focus on what matters most, what they're good at, and what their workload is. The diagram you're seeing here represents the flow of requests from the client through our gateways to upstream microservices. As requests come in from clients, they pass through our routing and API tiers, and onward up to the services.

Netflix Microservices Explained

I'm going to talk about what a microservice looks like in Netflix. Our microservices are made up of a client library, a persistence layer, and a cache layer. When a request comes in through REST or gRPC, it will use either one of the service client or the cache client to fetch data. The cache client will pull from one of our EVCache stores, which is a Netflix open source solution for distributed in-memory cache. If the data is found there, then it will return from there. If not, then it will use the service client to pull from a persistence datastore such as MySQL, Cassandra. This combination of cache and persistent layer allows our services to scale to thousands of requests per second. That's a lot.

The Hidden Cost of Simple Assumptions in Microservices

I'm going to tell you a story about the hidden cost of simple assumptions when it comes to a microservice architecture. In 2019, we unveiled a new user experience where visitors coming to netflix.com could watch free content. We wanted them to be able to watch from a limited selection of select shows, right on their mobile device. They would be able to watch from their browser without having to download the Netflix app. Here's the catch, the users would not be asked to log in or sign up for an account. They'd be able to watch free content without even having to provide an email address, or selecting a plan. This was 15 years of Netflix history that we were trying to unravel. We unlocked DRM protected playback for non-members without an account. We enabled playback on mobile web browsers for the first time. We did all of this without having to rebuild our entire microservice architecture.

It sounds easy. Just flip a switch and allow non-members the ability to do what our members have been doing for over a decade. On the surface, it sounds like a few lines of code, a piece of cake. We knew going into this project that our backend teams were built around members and subscription plans. We knew we would have to change that, or at least confront it. Before we could do that we needed some answers. We needed to understand how all the pieces fit together. Playback has always been a member feature, and this was the first time that our non-member teams needed that domain knowledge. We knew that this was going to be a significant effort. What we didn't know was the assumptions we were making about the level of effort and the impact that it would have on our backend teams. Lastly, we needed to know which teams to talk to. We needed to know, are we doing the right thing? Who is this going to impact? What I love about Netflix is that our culture embodies this highly aligned, loosely coupled teams that collaborate well together. Our product and engineering teams are amazing at collaboration, because we socialize on what we're trying to do.

Now that we had our answers, it was time to get to work. At a high level, the puzzle pieces looked like this. Our UI and streaming client would talk through the gateway to the play API orchestration layer. To allow non-members, we needed to coordinate with those engineering teams. We partnered with them, and for months, we worked on orchestrating the non-member, getting playback to work. While this isn't everybody who was involved in the project, we worked diligently with our other microservice teams to make this all work.

How to Get Around the 'User' Problem

I want to tell you about the biggest challenge that we faced. Many of our services upstream had this built-in assumption that anytime they would be called there would always be a user object. That's because for the longest time, they were only ever called with a user object. These assumptions were prolific, and one of the most challenging for our team to address because we're dealing with non-members and non-members aren't logged in. If they're not logged in, there is no user object. Our team started working together to figure out how to get around this problem. We looked at several options. The first option that we looked at was, could we create a user for every non-member coming to the website? This would prove to be expensive, because we expect high volumes of non-member traffic. It would be wasteful, because the account would only last as long as their browser session. We needed to explore other options.

The second option was, could we work with the services to provide support for non-members? Would they be able to support if there is a user object or not? This was also expensive, because it would take them months, if not years, to build support for this. That's a risk because while we're trying to get something out the door, this would delay our project. This would delay us years. We had to look at another option. The third option that we looked at was, would we be able to fake it? Could we mock a user object with a non-member? We liked this solution because it's easier. It takes less work. It's a good shortcut that would hold us over, while the long term strategy is planned out. This was the option that we felt was best for us because between the options, it would allow us to get our feature out the door.

Netflix Playback Microservices

We looked at how we would make this option work. When a client makes a request to the gateway, the service call contains a context object to identify the user and their device. The request then goes through the playback API orchestration layer, where it talks to subscriber for the source of truth. This source of truth tells us whether or not the user is eligible for playback based on their account and device settings. What if there is no user? Our playback API team started to mock the non-member user, which meant that they no longer needed to call the subscriber microservice. By mocking the user, the play API orchestration layer would be able to fulfill all of the requirements to meet the playback lifecycle. One outcome to this option was that now this would be the terminal stop for non-members in the service call, this would reduce the pressure of requests on upstream services. Another outcome, which is my favorite, is that those teams would no longer need to do any work on their side for us to get our experiment out the door. This would hold us over for the short term. We liked this, but we really were hungry for a more long term solution. We started to think about how users would access Netflix in the future.

Access and Identity Management

In 2020, our access and identity management team set out to unwind these assumptions and look for a more long term solution. They partnered with engineering and product teams across the company to align on a vision for how we think about users accessing Netflix. As a result, we all came to the conclusion that we needed more leverage to control user friction with granularity and more flexibility. Since then, they've set out to refine our access control strategy. We started by looking at what strategies we already had in place. Our primary strategy for access control was a role based strategy. A role based strategy has largely revolved around whether or not the user is a member and has an active subscription plan. When a user tries to play back, our services are checking these things and deciding whether or not they are eligible. This strategy works well when the user experience has very clear boundaries across very clear role types. This strategy is easy to use, because it doesn't require a lot of complexity across your services. However, it's harder to scale, because as the number of access points increase, as the number of roles increase, there is a lot of ambiguity and a lot of work to be done. At Netflix, we found that this role based strategy was hard for us to scale, to do the user experience and learnings that we wanted to do. We needed to explore another strategy.

We started looking at attribute based access control. This strategy is one where the permissions are contextually aware, and can take in factors from multiple sources, such as, who is the user? What are they trying to access? How are they trying to access it? What environment are they in? Let's look at an example. In this example, we have a permission of whether or not the user can play back. This permission leverages several properties about the user such as, do they have an account? Do they have a subscription plan? Which device are they trying to access from? Where in the world are they? What's nice about this strategy is if we wanted to evolve this permission, we can take into account other factors, such as expanding the policy to include A/B allocations, or any other factor that we can model in our backend.

When we compare these two strategies side by side, there are a lot of reasons to prefer attribute based access control. This strategy provides more granularity, which gives us more leverage over user friction. This strategy gives us more flexibility that if we want to come up with new ideas or implement new permissions, we don't have to do a lot of work. Moreover, this strategy gives us the ability to scale better because rather than making a change across our entire architecture, we can make a change just in one service. On the other hand, if your user experience only has a limited number of access points or a limited number of roles, then a role based access control strategy is still an excellent choice because of the simplicity it provides.

User Friction with Permissions

Looking back at our experiment, we would have avoided a lot of time and heartache with an attribute based access control strategy. With permissions, our play API can fetch from the permissions service to determine if the user is eligible. The permission service will look at the request context, such as, who is the user? Which device they're trying to access from, and any other number of factors. It can also pull information from other services. It will then cache and return the result back, and now we know whether or not the user is eligible. However, there are times where we want to show more friction or add some extra steps when the permission is denied. A good example is when your kids are logged into Netflix, and you don't want them to accidentally spend more of your money on a more expensive plan. When our permission service is getting requests from hundreds of services and UIs, we want to make sure that it doesn't become a single point of failure, and take all of Netflix down. Our teams have thoughtfully leveraged multiple strategies to make sure that doesn't happen, including caching and fallbacks, as well as standard practices of scaling, availability, and performance.

The Future of Microservice Permissions - Service Driven Attributes

All of this really excites me because I see how it sets our teams up for success. When we have a single service governing access control, then our source of truth becomes centralized instead of distributed. When we use attributes for access control, then we have more leverage to control user friction with more granularity and flexibility. This lets us tailor our user experiences and really try out new things with ease. This feels like the future to me. I would call this the future of microservice permissions, because this model lets us scale our efforts. We can rapidly experiment and learn how to deliver the most exceptional experiences for our customers.

With all of that said, I want to talk about one assumption we often make about microservices. You only ever hear people talk about how good it is and how it speeds up innovation. That's not always true. It would have taken a great deal of time and effort for us to have done this all before our experiment. We would have had to gather all of the engineering teams, assemble, share context, agree on timeframes and milestones. Then, that's before we even get into development. Even developing it would have taken years. We optimized our plan for what we could do in the short term. While our teams were invested in a long term strategy, we had to get something out the door. Then we would follow up with a long term solution. The message here is that we need to plan ahead, and that our technical problems are not always embedded in technical solutions. They're always non-technical dimensions here.

Key Takeaways

The first takeaway is that assumptions are natural. It's our job to unwind them together. We need to collaborate and gather all of our stakeholders. We need to ask ourselves, who needs to be involved in the discussion, and who is missing from the conversation? This way we make sure we're gathering enough context. Also, we know which solutions exist already and which ones don't. This will also inform us what is going to be the level of effort so that we can evaluate the tradeoffs and make the best decision. All of you can be facilitators in this process.

The second key takeaway I want to talk about is we need to invest in long term strategies that can be paid off over time. It's not enough just to solve this one problem. We need to anticipate how this might snowball or how this solution will continue to evolve as our needs evolve. Because the conversation will continue and our customers will come back with additional requests. We want to make sure that this works for the long term. This will also help us with our next key takeaway.

Our third key takeaway is that we need to size our efforts by risk and reward. A long term strategy is great, but we need to get something out there because we can't spend forever trying to work on the best, most ideal solution, because we might never get there. Priorities change. Businesses have new needs, or maybe, you're in a company where you're running out of money. You want to make sure that you're sizing your efforts by what can we get done today, and what will pay off. Then invest in those long term strategies.

The last thought I'll leave you with is that the difference between programming and engineering is vastly different. Programming is just writing code and solving problems. Engineering is all of the collaboration and communication that happens to make us more effective in doing that. Because engineering is about bringing the group together, so that we're all on the same page and we're doing the right thing for what's best for our business. When we do this, we pursue the non-technical realm. We're talking to people. We're engaging. We're having a good time while we're doing it, because that is what makes us better engineers.

Questions and Answers

Watt: How much time did it actually take to overhaul the authentication move from role based access control to attribute based? What challenges did you find moving between those two? Was it easy, or what was the challenges there?

Turner: When you're faced with such a large change, and you're dealing with all of these assumptions that are baked into your microservices, the biggest challenge is you want to get everything done at once. Businesses have many needs and priorities, so we really have to strategize, what can we get done today and what can we pay off over time? The major theme is, we really want to make sure this works. We will start with some small initiatives that we think will leverage this new system, this new architecture, and in this case, the new permission service. By testing that out, we suss out the performance, the scalability, the availability, before we start to implement this across all of our systems. It is still an ongoing effort, our access and identity management team. They're amazing. They've planned out a multi-year strategy to integrate the permission service throughout our Netflix architecture. This will include the consumer product, as well as how the different users will access netflix.com.

Watt: That means you're not finished with the migration yet from RBAC to ABAC, it's still in process. It sounds like it's been going on for a year, months, a while?

Turner: It began about a year ago, and as we have new initiatives, as Netflix continues to reinvent itself, they're continuing that migration or integration.

Watt: A few questions around some of the technologies and things in Netflix. What standardization does Netflix follow across microservices? Is there any standard tooling and/or approaches that are followed the way you do things? What standardization exists?

Turner: We have a platform team that provides us with a toolkit for creating applications and services at Netflix. This toolkit contains various blueprints for common application types. Whether you're starting a Java application or a Node application, a client or a service, the platform provides a common infrastructure for developers to follow a paved path. Really, this is about providing the best outcome. Our culture at Netflix embodies freedom and responsibility. Our teams are free to choose any application type, or invent their own. They may decide that they find one technology really suits their needs, so they're free to pick that and prove it out.

Watt: Can you mention something about your caching strategy for user permissions? You spoke about how you cached some of the permissions there, and maybe how you deal with invalidated tokens for users who have logged out if there's no centralized server?

Turner: When it comes to caching, we have our standard practices of availability, scalability, and performance: having multiple instances, having multiple cache instances. The client library will first check from the EVCache before it actually fetches from the permission service, which is the primary throttle for preventing too much load onto the permission service.

The second question is, how do we handle invalidation for tokens for non-members?

Watt: Specifically dealing with invalidated tokens for users who have logged out.

Turner: We have different strategies for how we treat our logged in users versus how we treat our logged out users. They rely on two different kinds of tokens at Netflix. What I can say is, in addition to that, we also leverage cache expiry times to make sure that whatever permissions we do give to a user, they live in a cache, but only for a certain length of time. That length of time can be customized to each individual permission.

Watt: Do you do explicit cache invalidation, or you pretty much just wait for the timeouts to occur?

Turner: I believe it's a combination of both, but I'm not too familiar with that area. I know that we leverage expiry. I'm not too sure on the invalidation manually, but if it were me and I was in that position, I would assume we would.

Watt: How do you scope your microservices? What principles do you use for figuring out what belongs in a microservice? Because we hear a lot about domain driven design and various different ways to put things together. How do you guys work out what goes in a microservice and what doesn't?

Turner: When I look at all of the microservices we have today, a large part of how they are divided is by business domain, so which services serve the consumer facing product, which services serve our studio, or some of the services behind the scenes that delivers great content in front of users. However, sometimes a service becomes so big that teams decide, how we treat this functionality is not how we would want to treat this functionality. They decide to break it up. They might create even smaller services, and this revolves around the modular functionality of each piece of feature. We have services for videos and different kinds of videos, artwork as well as metadata, and so we do tend to follow the modular functionality or domain area.

Watt: It sounds like the areas that are in particular domains will be in certain microservices but at the same time you've got specific technical reasons for why you might choose to put a microservice by itself, maybe it needs to scale more, or something like that. There's various reasons around that.

Turner: Exactly.

Watt: What kind of technologies, languages, and frameworks are used by Netflix? Which I think you've said every team is free to use their own. You do have a lot of libraries and stuff that you have actually built with, with Netflix. If people wanted to learn more about the Netflix microservice architecture, where could they go to find out about some of the stuff that you guys are doing?

Turner: If you're interested in learning more about Netflix architecture, or any of the technologies that we leverage at Netflix, there are two destinations that I highly recommend. We have the Netflix technology blog, where you will hear from our service teams, our data scientists, frontend, backend. All of our teams at Netflix are welcome to publish at this blog. It's actually where I myself have learned a lot about our architecture and how our services work together. Moreover, you can also check out the Netflix open source GitHub, where you will find many of the solutions that we've open sourced or our architecture from when we moved to microservices. You'll find open source repositories like Eureka, Hystrix, as well as anything that we have open sourced that we have found to help us deliver exceptional user experiences.

Watt: There was a question around testing. There were a lot of changes especially when you moved from your RBAC to ABAC. How did you go about actually testing these changes given that there could be so many wide ranging implications of where things could go wrong? How does the testing work?

Turner: I tend to think about the technical versus the non-technical. When we look at the technical, there's obvious things like automation, and alerting, and monitoring. We have a lot of observability into both the life cycle of how our users are enjoying Netflix, as well as how our systems are handling all of these users. Then there's the non-technical components, which are, we have so many people interested in projects, or we collaborate very well across cross functional stakeholders, that there are always a lot of eyes on what we're building. There have been times where I'll hear some feedback from somebody across product, and we'll figure it out, and we'll adjust from there. It's a lot of collaboration.

Watt: You've got quite a lot of automated test suites and things on services, but you can't test everything, because they're so many. Do you have any strategies as to what you test versus what you don't, in an automated way?

Turner: We employ different automation strategies across the stack. Across all of our backend, middle-tier, and client applications, each layer is doing its own level of testing to make sure that the integration works. If you look more towards the client side of the stack, we often write user story focused automation to make sure that what we expect for our customers to be able to do, as well as not to be able to do, that's the automation that we're writing. As well as we try to strategize making sure that this experience translates cross platform. Our clients will be writing automation for our website, for our mobile apps, for our TV, as well as different devices. We have a mobile automation lab which Netflix engineers can run automation on various devices. For an experience like the one that we launched, we made sure that we were running our automation across a large variety of devices to make sure that the user experience was consistent from device to device.

Watt: What is your API gateway layer? The technology that's used there? Is that Zuul still or is it Amazon stuff? What is the actual technology in use for the API gateway?

Turner: Our API gateway layer is a combination of, we call it edge, and it's basically the edge of the client. It's the routing. It's anything in the middle tier that handles authentication, requests, contacts. We have various API edges that service our different Netflix domain. The consumer product has its own edge. The studio has its own edge, as well as other parts of Netflix. I believe Zuul might be open sourced, which is our router. Then you have other technologies that power the service to service communication, Eureka being one of those, Hystrix. Then the edges will talk to the upstream microservices through REST or gRPC.

See more presentations with transcripts

Recorded at:

Oct 17, 2021

Cole Turner

InfoQ Software Architects' Newsletter