BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations How Netflix Scales Its API with GraphQL Federation

How Netflix Scales Its API with GraphQL Federation

Bookmarks
39:18

Summary

Jennifer Shin and Stephen Spalding discuss Netflix’s API unification process using GraphQL Federation.

Bio

Jennifer Shin is an API engineer for Netflix's consumer and studio experiences. Stephen Spalding is an API engineer for Netflix's consumer and studio experiences.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations. Stay in the know about all QCon events. Sign up for updates.

Transcript

Spalding: In the beginning, the cloud was formless and void. The engineers provisioned a server. It talked to the database and served up the webpage. It was good. The users came and the business grew. As the company grew, they formed more engineering teams. As the server grew, it became a monolith. They divided the monolith into microservices in order to increase the autonomy of the teams so that they can move faster. Then the engineers created apps that used those services. The engineers saw that it was not good for each app to have to talk to every service on its own, so they created an API gateway to bind those services together. In the seventh year, they rested. Not for long, because they continued to innovate. They saw that REST was insufficient so they created graph query languages for the apps to fetch data from the API. It was all good. Time passed, and the company continued to grow. Teams were fruitful and the services multiplied. The API gateway that bound them together grew as well in order to compose the many services. There was temptation. In order to handle failure gracefully. They added fallback logic into the gateway, simple caches, gateway to complex in-memory datastores, along with the business logic. Before they knew it, the API gateway had become the new monolith.

Shin: What Stephen has just described is the state of Netflix microservices today. We have hundreds of mature services providing APIs for UIs to consume. Yet they're all aggregated into a single API monolith. This architecture might sound familiar to you, if your organization also implements a microservices architecture with a single API aggregation tier. We had done all of this work to break apart our system into microservices. Yet, we still found ourselves with an API monolith. We asked ourselves, now what?

Background

I'm Jennifer.

Spalding: I'm Stephen. We're engineers at Netflix on the API systems team. You might also know us as Edge engineering. We work on this API aggregation layer. The nexus point between UIs and the universe of Netflix microservices. There's all these services on the back-end and all these different UIs running on different device types. Our team represents this tiny aggregation point in the middle. We take the APIs exposed by microservices and weave them together into one big graph API. This is that graph.

Netflix's Big Bet on the GraphQL Federation Architecture

Shin: The clients can simply pretend that Netflix is a single service. We're actually just the middleman. We simply aggregate information from all of these different data sources. This architecture has served us well for many years, but we're starting to see that it's reaching its limit. In order to scale even further, Netflix has placed a big bet on an architecture called Federation. We are here today to tell you what Federation is. How it's enabled us to scale to previously unprecedented levels. To convince you that Federation is the future of APIs.

Spalding: First, let me tell you a story. One year ago, I was on-call for the Netflix API service team. It was Thanksgiving and I was visiting family. Everyone was bustling trying to prepare dinner when I get paged. I log on to see what the problem is. I'm told that the trickplay images are appearing in Mandarin, on a big movie that we had just released. Naturally, the first thing I did was Google, "What are trickplay images?" Then I started searching through an API to figure out where it was exposed and how it was used. I knew I'd seen the word trickplay before somewhere. Even though I'd been on the API team for several years, I never knew exactly what they were. Yet, right now, I needed to become an expert enough to fix them, and quick before everyone starts tweeting that Netflix is in Chinese.

Shin: That's the problem. Our graph had grown so large that no single human understands the entire surface area. Yet the entire graph is owned by a single team. What if we could break the API apart, so that domain experts could own their part of the graph, and still expose the entire Netflix ecosystem from a single unified access point?

Spalding: This is precisely what Federation enables. It's a way of breaking apart the implementation of your API, while preserving the facade of a unified API for clients. It allows you to remove the business logic from the core aggregator, so that it becomes an appliance, like a reverse proxy. That is something you can scale.

GraphQL

Shin: The title of this talk is how Netflix scales its API with GraphQL Federation. Let's talk a little bit about what GraphQL is. Then we'll talk about how Federation works within that context. If you've seen Netflix talks in the past, you might know that the Netflix app actually uses a different graph API technology called Falcor. It's conceptually very similar to GraphQL, but back in 2012, GraphQL didn't exist yet, so we created Falcor. In 2020, GraphQL is now pretty much taking over the world. Netflix is using it too. Federation can actually be applied to both, but today we'll be talking about GraphQL.

Here's an example of a really simple graph API. First, from the very root of the graph, which for GraphQL is called query, you can fetch the recommended videos for a user. From there, you can traverse into each one of those videos. The video type has further fields that we can fetch, like title, or rating. Some of these fields just return scalar values, but others express a relationship to another object such as trailer, which would then be another video object. One of the key distinctions of a graph API is that we can selectively choose the properties that we want as a client, and then follow relationships and recursively select properties from other objects.

Spalding: The actual Netflix graph is a bit more complex. You might say it's like something from The Upside Down. Let's break it up. With GraphQL Federation, each distinct domain or logical business portion of the graph is served by a different service. The API aggregation layer composes these together into a single unified graph.

Let's take a look at our example again. This is what that graph would look like in a federated architecture. Each of these colors represent fields fulfilled by a separate service. These different chunks of the graph represent a portion of the graph that one domain service is responsible for serving up. Then a service called the gateway, which you can think of as the aggregator, binds these separate schemas or graphs into a single composed graph. Each service only provides the part of the schema it is responsible for. The video service provides the title, description, and trailer for a single video. The images service provides image URLs or box art URLs. The recommendation service provides the top recommended videos for a user. For each video, it only knows the video ID. That's all it knows about videos. That's all it needs to know. The gateway does the rest.

The Idea behind Federation

Shin: With such clear service boundaries, we now know who to talk to when, say, video metadata stops appearing for a given video. In fact, their PagerDuty ID and Slack support channel are actually embedded directly into the graph metadata, so we know exactly who to call when something goes wrong. Also, the video service team is not bound by the speed of any other team in exposing their APIs. When they're ready for their API to be consumed by clients. Their APIs are available.

Spalding: That's Federation. You take your API and break it into chunks that can be developed independently. You can think of each chunk as a micro-aggregator that just handles a single domain. These can be implemented by domain experts. Then a graph-aware gateway ties them together into a single API. Then this graph gateway is still a central junction in our architecture, but there's a key difference. It doesn't contain any business logic. It just follows a declarative configuration that tells it which data comes from which service. This means, crucially, that the team managing the graph gateway doesn't need to scale along with the size and complexity of the graph that's being exposed. That's the idea behind Federation. Let's go a little bit deeper into how it's implemented.

How Federation Is Implemented

Shin: There are three components to a federated architecture: graph surfaces, the schema registry, and the graph gateway. Let's take graph services first. Graph services are simply just GraphQL servers. They expose a small portion of the overall schema and publish it using what's called a schema registry.

Spalding: The schema registry has one essential task, to hold the schemas for all your services. Along with each schema, it holds some configuration like URL or discovery identifier. We also like to register the contact information for the team behind the service, because then that can be embedded automatically into the documentation for the graph. Before a schema can be updated, it has to be validated. Beyond basic linting, we also catch things like breaking changes, and conflicts that arise when combining the schema with the rest of the graph. Finally, the registry provides the schemas and configuration to the gateway.

Shin: Finally, we have the graph gateway. This is where the magic happens. The job of the gateway is to take a single incoming client query and break it into sub-queries that can then be executed against the downstream GraphQL servers. Remember, this graph gateway is an appliance, until it loads a configuration from the schema registry, it knows nothing about Netflix or about our API in particular. A traditional HTTP proxy is generally considered a layer 7 proxy in reference to HTTP belonging to the application layer of the OSI reference model. GraphQL queries, unlike REST, are abstracted from the HTTP layer. You could think of this as a layer 8 multiplexing proxy.

The Query Plan

Spalding: How does it work? The gateway processes a request in two stages: query planning, and query execution. The query planner traverses a client's entire request, and recursively collects the fields that belong to each service. It identifies the ones that can be fetched in parallel, and the ones that have to be retrieved sequentially.

Let's look at an example. Given our initial schema, if we wanted to take the top 10 videos for a given user, and then for each one we want to fetch the title from the video service and the box art images from the images service, the query plan would look something like this.

Shin: We know we have to fetch the top recommended videos first, because we need those video IDs in order to know which titles and image URLs to fetch. That's precisely how the query plan is constructed. The recommended videos are fetched first, and then in parallel, title is fetched from the video service and box art URLs are fetched from the images service.

Spalding: The gateways is a query plan in this form. It's simply a tree of fetch, parallel, sequence, and flatten nodes. There are three fetch nodes in this query plan. You'll notice the parent of the very first fetch is a sequence node, signifying that whatever sibling of this first fetch will happen after that fetch is executed. The sibling of this initial fetch is a nested parallel node signifying that after the first fetch from recommendations, the gateway should then execute the subsequent fetches in parallel. That is then wrapped in a flatten node which signifies how the results should get stitched back together.

Query Plan Execution

Shin: Executing the query plan is pretty straightforward then. We simply traverse the entire query plan starting from the very root node in parallel or in sequence, and merge them together into the overall response. Here's pretty much the actual code that's responsible for this. It's simply a recursive function that traverses all the nodes in the query plan. When the node is a sequence node, we simply execute whatever is inside that node. When the node is a parallel node, the code execution block is wrapped in an async block. That tells the Kotlin compiler to execute the nested code asynchronously, or in parallel. Flatten nodes, as mentioned, tell the executor where to stitch the result of the query execution back into the overall response. Fetch nodes actually do the fetching against the graph service. That's pretty much it.

How Federation is used at Netflix

Spalding: That's Federation. Let's talk about how we've been using this at Netflix and what we've learned. It all started a couple years ago. In 2018, the Netflix API team was exploring ways to break apart our API monolith. We prototyped a federating graph gateway for Falcor. This was really exciting because it demonstrated a potentially transformative way to scale our API. Meanwhile, there was another organization at Netflix that was building their own API aggregation layer. This rapidly growing organization is Netflix Studio. Netflix Studio engineering makes a bunch of apps that facilitate the creation of all the content you enjoy in the Netflix app. This includes custom software for things like scheduling, talent management, dubbing, animation. There were dozens of services providing all this functionality, and Netflix Studio decided to make a GraphQL aggregation layer to tie them together.

Shin: You might have noticed that the number of Netflix originals has exploded over the last few years, so has the Netflix Studio graph. In only a few months, this studio graph had grown to a point that it took the Netflix consumer graph years to get to. The studio API team was already feeling the pains of a monolithic architecture, so the company placed a big bet on implementing Federation here first. The two API teams joined forces to create a scalable API platform for studio. Right around this time, a company called Apollo had just released a spec for something they called GraphQL Federation. Studio was already using GraphQL at the time, so this seemed like a perfect fit. In July of 2019, the combined API teams started building a GraphQL gateway based off of Apollo's reference implementation.

Federated Graph Service

Spalding: We chose to implement our gateway using Kotlin. This would give us access to Netflix's Java ecosystem, while allowing us to rapidly develop a robust solution with language features such as coroutines for efficient parallel fetches, and an expressive type system that handles null safely. As we started to implement the gateway, we had one lingering question, would it be fast enough? We wanted to make sure that we weren't going to add too much latency. As soon as the basic functionality was complete, we did some benchmarking. The core gateway activities of query planning and execution, were clocking in at under a millisecond. This gave us enough confidence to move forward. Within a few months, we had an initial release of the gateway ready to go. We took the former API monolith and put it behind the gateway. This became our very first federated graph service.

Shin: Next, we set up one new graph service alongside the API monolith that exposed one small portion of the monolith's schema. We marked this new schema with a directive called @Override. This schema directive instructed the gateway to route to this new service instead of the old one when constructing and executing back-end queries. From there, we opened up the platform for wider adoption.

Spalding: Over the past year, we've had more graph services taking over functionality for the former API monolith and adding brand new functionality to the graph. If you look at this chart of the number of graph services behind the gateway, you can see that we're looking at exponential growth. There are now over 50 services in production contributing to the graph. That graph is now being used to power over 60 studio applications. With all the teams behind the services contributing, the graph has exploded. The number of nodes in the graph has grown from about 800 in October of last year, to almost 7,000 today. There would have been no way that one team could have added this much functionality to the graph in only one year. Yet, this was precisely what a federated platform enabled.

Shin: That is the software you want to be building. You want the effect of your efforts to be multiplicative, not linear, especially when your growth goals are as big as Netflix's have been. As we yet envision for the future, this multiplicative effect is precisely what you're getting after. Finally, not only has the federated platform enabled such explosive growth, but that old API monolith that used to be a bottleneck, that was complex and hard to reason about, and that so many considered to be a blocker is now slated to be deprecated and fully removed sometime this quarter.

Using GraphQL for the Consumer Netflix App

Spalding: Now that the Federation platform was built and the studio graph was taking off, it was time to circle back to the Netflix consumer API. There had been growing interest over the last few years of using GraphQL for the consumer Netflix app, and our internal Falcor implementation has been evolving in that direction. In fact, we are already using GraphQL schema definition language to describe our APIs, so we don't have to maintain our own parser anymore. Could we use the exact same Federation infrastructure that we've built for studio to power the Netflix app?

Shin: A small working group was formed to build out one page, the search page in mobile devices on GraphQL. Here's what that screen looks like. The client here is using Apollo client to speak GraphQL to a gateway. This gateway is federating requests to three separate graph services on the back-end, each exposing a different portion of the graph. We then send a small amount of production traffic into this new stack. Initial results are looking good, so the next steps are, A, to fill out the graph even further and make more data available on this platform. B, to send more traffic to the stack overall, so that we can collect more data from the wild. We're really excited about the results so far.

The Challenges with Federation

Spalding: By now, I hope your Lizard Brain is starting to tingle. You're thinking, "There's no way a project like this is all roses." It's true. Federation isn't some magic potion. It's not going to solve problems like climate change or world hunger. In fact, it comes with some real challenges. First, you're going to need a team that's devoted to building and operating a brand new platform. We dedicated three engineers to building out the core components like the gateway and the registry.

Shin: We also dedicated an entire team to the developer experience and tooling of these graph services. Our colleagues, Paul and Kavitha are actually presenting their work on this effort on November 18th, so be sure to check it out. They've worked on some really great, cutting edge stuff.

Spalding: We also dedicated resources for instrumentation like distributed tracing, so that engineers can investigate and troubleshoot problems that are happening in near real-time. For more information on how we did that, check out our colleague Elizabeth Carretto's talk on distributed tracing at Netflix. She just presented it yesterday, November 10th, so be sure to check that out.

Shin: With a distributed graph, you're going to have a lot of engineers contributing to a massive graph. If you don't have a strong sense of controlled chaos, best practices, documentation, even a schema working group, you can end up with one gigantic, highly messy graph.

Spalding: Finally, you'll be distributing the concerns of the API layer instead of shielding your engineering teams from them. This could be potentially a radical and costly shift, especially if most engineers in your organization are not generally concerned with matters like defensive programming or security.

Shin: We invested heavily into this new architecture. We even merged two teams that used to be part of two totally separate organizations in order to deliver this new product. Was it worth such a heavy investment, especially when you consider all of the tooling and instrumentation that had to be built in as well? We posit that it was. As Netflix grows its subscriber base to 300 million global subscribers and more. It's crucial that no single layer of our architecture is a bottleneck for growth and innovation.

Summary

Spalding: What have we covered? We talked about Federation, and we went into some detail about how we've implemented it at Netflix. Our hope is that in all of the technical detail, we haven't lost the essence of what Federation is. It's more of a philosophy than a specific technology. You could call it a philosophy of illogical aggregation. You remove the business logic from the core aggregator, you restore API ownership to domain experts, and yet still maintain a single unified API for clients to access your entire ecosystem. This vastly increases productivity. It enables loosely coupled teams and systems. It restores separation of concerns to your microservices architecture.

Shin: Is Federation the future? It is a future. We're not actually that naïve to believe that a single technology or pattern could be the right choice for everybody. We've already seen that it doesn't solve problems like world hunger or COVID-19. We've taken a journey through the hype cycle, past the peak of inflated expectations, through the trough of disillusionment, to a place of extreme productivity. That is a pretty awesome future.

Spalding: One question remains, is this your future?

Questions and Answers

Fisher-Ogden: What was the biggest challenge, technologically, with moving to this Federation architecture?

Shin: Technologically, I would just say, the actual architecture was not super difficult. I think part of the challenge in bringing this in is that because it was brand new, this was from Apollo, a lot of the challenges actually came from marketing this to the wider organization. Making sure that we had a broad alignment across the entire org, so that people like the domain service teams that they would be on actually that graph services. That we had the framework teams building out platforms that would become available to these other teams to consume and use.

Fisher-Ogden: What was the biggest people challenge? Do you want to expand on that? You didn't have this awesome presentation with cool music and graphics, so how did you build that alignment across the organization?

Shin: The studio edge team was originally the GraphQL owner. They were the ones who first created the initial monolith for the studio side of the organization. Robert and Tejas are our teammates. They really formulated the idea of having this gateway layer. Created a dock, shopped it around, basically. They really did a grassroots level effort starting with pretty much every single team and marketing it from there.

Fisher-Ogden: The people side is very one-on-one, very relational. The technology side is the big scale part. Once you get that alignment, you can scale it up.

Shin: Exactly. Especially with Netflix where individual contributorship is so important. I think having those relationships, working really closely with individuals is pretty important.

Fisher-Ogden: Where should people consider starting, if they're considering doing Federation?

Spalding: The first question to ask yourself is, why are you interested in Federation? Are you committed to doing microservices already? If a monolithic architecture is working for your company, then don't let us be the ones to convince you otherwise. If you are down this microservices road, then this is a really great option. To start, you'll need a gateway. We chose to implement our own using this GraphQL Federation spec, but there is one that is open sourced by Apollo. There's that. Then you'll need some way also to register your schemas. This could be as simple as a Git repo that you all put your schema files in or you can create a registry service, like we did. Or, also, there's an offering from Apollo that you can use for that. These are the practical steps that you would start out with.

Fisher-Ogden: Is the gateway open source? You mentioned Kotlin. You mentioned works at Netflix scale. Tell us about that.

Spalding: Our gateway is not currently open source, but it is something that we're definitely open to. It's been on our roadmap. If this is something that you'd be interested in using or collaborating on, then do let us know, and they'll help us prioritize the work of supporting that as an open source project.

Fisher-Ogden: Boris had asked, "So ours isn't open source right now but where could somebody use an open source gateway?"

Spalding: As I mentioned, Apollo gateway. It's a TypeScript Node.js server that implements the same spec. Ours is actually interchangeable with that in terms of the spec talking to the GraphQL servers.

Fisher-Ogden: Bart Lawson, he's asked the question, how should you think about the granularity level of the Federated layer, the Federated GraphQL servers? When you're breaking this up, how should you think about splitting that up?

Shin: This is a two part question, we can talk about it in terms of the Netflix Studio organization, and then broader. For Netflix Studio, each team was already working in silos, just because of the way that Netflix Studios grew organically. Because of that they were owning their own part of the graph anyway, when we were dealing with the GraphQL monolith. We were able to take that model and apply it into the Federated world. In terms of where you might think of your organization, I would say, I think you would think of it similarly. Who owns the data? Is there a way that you can organically create these relationships between your front-end engineers who are consuming the graph, and the people who are actually delivering it from the back-end side?

Fisher-Ogden: Then two related questions on that schema management, and that ownership. John and Christian have both asked. John asked, how do you avoid teams creating similar schema entities in the federated environment? Is there any governance around the schemas?

Shin: This is really interesting, because this is pretty new. Apollo just open sourced their Federation spec, it was last year, sometime. Netflix, I think, has one of the biggest federated graphs out there, pretty much in the world. One of the things that we're seeing is that you create a person entity or a movie entity, and any other graph service can extend it. Then just start adding fields to it. We actually have a schema working group that meets weekly. We have a schema architect that understands the entire surface area of the graph of studio. We talked about controlled chaos, it aligns the teams to have a similar ideology or methodology behind the schema.

Spalding: To add on that, it's one of the advantages of these architectures that you're separating your schema and the design of your API from the implementation. That makes it possible to have this working group that can be front-end engineers, back-end engineers, a data architect that's not focused on the engineering part of it. They can have that API discussion separately from the back-end engineers implementing it.

Fisher-Ogden: Then similarly, on that, how about authorizers, approvers, security dimensions, how do you think about that? After that, we'll go back to some questions about the gateway, and there's questions about back-end data sources behind it?

Spalding: As far as authorizing and approval as far as the API itself, I think that is a question that is really, for your organization and how you do things. Netflix tends to lean in freedom and responsibility. We try to provide context to allow people to make good choices for their API, and then give them the freedom and the responsibility to do that. That maybe isn't the right choice for every organization, and so you can put more oversight on that. As far as authorization. We do the authentication at the gateway level. That's centralized. Then the authenticated user information is provided to these individual federated graph services. They are able to make authorization decisions that are right for the domain of the data that they're providing.

Shin: Part of it too is we shouted out to the talk that's going to happen on the 18th for Paul and Kavitha as well. For Netflix, we'd have this idea of a paved path. It's basically like, if you really want to do things the way that Netflix does them, the platform team is going to make it super easy to just get on that paved path, so everybody has a single way of doing authentication, for instance. That helps us at least even though we have many different graph services out there, have a similar stack across them.

Fisher-Ogden: Let's talk about the gateway itself a bit more. There's some questions around, why did we choose to implement it ourselves? Then also the performance, the caching, a comparison to Apollo's version?

Shin: One of the things I think was just that this was going to be such a huge bet for Netflix. We were really betting our entire future of the architecture of studio on this new technology and this framework. Because it was such a big bet, we wanted to make sure that we owned that code, and we had control over that going forward. That was one big concern for us.

Spalding: Whenever you're doing something like this, if there are things in the community you can leverage, you want to do that. We considered all layers of leverage, including using the code. Then there's also internal leverage that we could get. Within Netflix, we have platform teams that develop a lot of libraries that are for our microservice ecosystems, just service discovery and things like that. We could leverage those best by writing our own, that runs on the Java Virtual Machine, and still get a lot of leverage from the community by the fact that we're using GraphQL, and we're using this open spec as opposed to doing our own Federation protocol on top of GraphQL.

Fisher-Ogden: Then, how about the caching concerns?

Spalding: As far as caching, we are letting the Federated services themselves make the decisions on caching. Because whenever you start caching things, you open up a Pandora's Box of questions about how you manage and expire those caches. We want to make sure that those domain experts are able to make the right decisions there.

I think there was also in that same question, asking about optimization, query optimization. Then that is something that the gateway does, as far as the query plan is optimized to minimize the end-to-end latency of a query. It's broken up, and parts that can be done in parallel, are done in parallel. Then, also, anything that can be batched together is batched together, so you have a minimum number of requests that are going to the GraphQL services.

Shin: If you remember the part of the talk where we were talking about the OSI reference model, we really want to keep the gateway as dumb as possible. Basically, just make it as a routing layer, as opposed to something that holds on to information and becomes stateful.

Fisher-Ogden: There's a bunch of questions about how to think about data sources or how to think about connecting to REST, how to think about HTTP 111 and 2.

Spalding: Our team just released a blog post on this. If you look for our Netflix tech blog, GraphQL. You'll find it. Check that out. It provides a lot of great details.

Fisher-Ogden: What's one thing you want everybody to remember as they go off to the rest of their day and the rest of the talks and everything else in life?

Shin: The presentation hit it at the very end. I think that my point would be that Federation is like a philosophy. It doesn't have to just be about GraphQL or about a certain technology, but really if you have an aggregation layer, as Stephen said, take the logic outside of the aggregation layer so that you can have a single API, but also, it's not beholden to the speed of the team that can actually open it up.

Spalding: If you're going to do something like this, as with all big projects, start small and move incrementally. Then, finally, go big or go home because this is ultimately about bringing your organization to another level of scale.

 

See more presentations with transcripts

 

Recorded at:

Nov 23, 2020

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Great Presentation

    by Anit Shrestha Manandhar,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I loved how they presented overall. I would love to see some code samples that would help relate on the implementation, which I believe would add more value to developers relating to the platform overall, and to Netflix in OSS.
    Thank you.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT