InfoQ Homepage Presentations Go Far, Go Together - Growing the Netflix Federated Graph

Go Far, Go Together - Growing the Netflix Federated Graph

View Presentation

Speed:

34:25

Summary

Kavitha Srinivasan discusses challenges building a developer-friendly ecosystem in a sustainable way to scale not just the graph, but the developers working with it as well.

Bio

Kavitha Srinivasan is a senior software engineer on the API Systems Team at Netflix. Over the past few years, she has been working on the Domain Graph Services framework, an open source framework for building Spring Boot based GraphQL services, and related GraphQL tooling.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Srinivasan: House of Cards was a huge game changer for Netflix. This political drama premiered in 2013. Since then, Netflix has gone on to produce several more successful shows, such as The Witcher and Stranger Things. Producing original content involves many different functions such as talent management, budgeting, post production. To facilitate this, we have many backend and frontend teams collaborating closely together in the studio domain. We adopted GraphQL fairly early on, as a way to expose a unified GraphQL API for our clients so that they didn't have to talk to individual backend services to fetch data. We implemented this API as a GraphQL monolith. Our initial graph was fairly small, with a few teams participating in the graph. Over time, this graph grew, and we found ourselves having to add more custom business logic in the monolith. This model didn't scale very well for us. We started looking into Federated GraphQL as a way to distribute ownership of this graph across several teams, while still maintaining a single unified API. Three years later, our adoption of Federated GraphQL has been fairly successful. Not only do we have the Studio Graph, which continues to grow, we have more graphs, one for our internal platform teams, and another in the consumer space, which powers the discovery experience on the Netflix UI.

Outline, & Background

Today's talk is not so much about Federated GraphQL itself, but more about the platform, including the framework and the tooling that we've built in order to facilitate and drive adoption of Federated GraphQL. I'll talk through some of the challenges we faced at various stages of adoption, and how our platform has evolved to keep up with these challenges. I'm Kavitha Srinivasan. I'm an engineer on the API platform team at Netflix. Our team is responsible for owning and operating the Federated GraphQL gateway, and also for driving adoption of Federated GraphQL within the company.

GraphQL Schema Definition

To help set some context, let's spend some time to review the basics of GraphQL. Here I have an example schema of a show service. We have three main root types, a query, to define APIs to fetch your data, a mutation, for defining APIs to mutate data, and subscriptions, which allow clients to subscribe for events from the server. Here I have a shows query that takes in a collection of show IDs and returns a collection of show objects. My type show is defined as having the following fields, showId, title, and reviews. The review type has a ratings field. Given the schema, here's an example GraphQL query that a client might send to the service, I can request for shows specifying my showId. In my response, I'm selecting my showId and the rating that belongs to the reviews type. You'll note that I haven't actually requested for the title field. Therefore, I don't actually get back that data in my response. This illustrates the power of GraphQL, wherein clients can specify exactly the data that they want in the response. They don't get anything more or less.

Why GraphQL?

Why GraphQL? As we just saw, we don't have the classic problem we have with REST APIs where we under fetch or over fetch data, you get exactly what you asked for. It is strongly typed, and you have a clear schema contract. All GraphQL services are able to publish their schema to make it easily discoverable by clients. This facilitates better schema collaboration and forces a conversation early on, to be able to design an API that meets both the client and the backend developers' needs for the business use case at hand. There's no versioning with GraphQL. Any new fields are additive. If you want to remove fields, you can follow a nice deprecation workflow. In our company we implement this by tracking client usage stats, and we can remove fields from the schema once we have determined that there are no usages for that field.

Federated Ownership

This is what our initial studio architecture looked like in the early days. We had clients talking to one or more backend services over many different mechanisms: gRPC, REST, and some even GraphQL. There was no consistency in the implementation of fetching this data. There were multiple sources of truth. We instead decided to switch to using a unified GraphQL API. We implemented this as a GraphQL monolith. Now clients could interact directly with this monolith instead of reaching out to the backend services. The monolith did all the heavy lifting of translating incoming GraphQL queries into the corresponding calls out to the backend services. This was great for the clients. However, it didn't scale for us. We found ourselves having to add more custom business logic in this monolith. This is when we started exploring a different federated ownership model. What we wanted ideally is for the backend service teams to also own the GraphQL API that they exposed. We wanted to slim down this monolith by removing all the custom business logic. We wanted teams themselves to stand up GraphQL services, and expose their API, but still be part of the same unified API that's exposed as a federated graph. The gateway now simply becomes a basic router, where it just translates incoming GraphQL queries into outgoing GraphQL calls to different services. In 2018, Apollo published the federation spec that allowed us to do exactly that. It provided a way to share types within the same graph, while being able to distribute ownership of the graph such that each team owned a subgraph. With that, we have clients still talking to the federated gateway, and the federated gateway in turn talks to individual GraphQL services. We call these as domain graph services, or DGSs, in this architecture. The DGSs would be owned by the corresponding backend teams that own the data itself, and would be responsible for providing and registering the schema with the gateway. This is how the gateway would know which services to talk to in order to fulfill an incoming GraphQL query.

Federated GraphQL Architecture

Let's take a look at a more concrete example. Here I have the shows query from our previous example. We have the client asking for a show given an ID 1, and requesting for a title and a reviews field with a starRating. In this example, we have two DGSs. One is the show's DGS. It exposes a type show, with an ID field and a title field. We have a reviews DGS that extends the show type using the @extends directive, and adds a review field to the show type. Both these schemas are registered with the gateway. For an incoming query, the gateway is now able to know that it needs to contact the show service in order to fetch the title field, and the review service in order to fetch the reviews field. It then collates the responses from these two services together into one single response and sends it back to the client. This is how federation works at a very high level. Naturally, this is a significant departure from our previous architecture model. It involved a whole migration.

How Do We Make It Easy?

This didn't come without any challenges. You're asking more than 40 teams who are already part of this unified API graph to now have to own new GraphQL services, to learn GraphQL, which could be fairly new to most folks. In addition to that, we were asking them to learn Federated GraphQL, which was completely new for most of us at the company. We were also asking them to rethink their APIs because we wanted to focus on schema first development, such that the APIs made sense in a composed graph. How do we make it easy? We wanted to focus on ease of onboarding so that teams could easily stand up and implement their GraphQL APIs. We wanted to provide Netflix integrations out of the box, because we have a lot of custom metrics and AuthZ that developers need to integrate with whenever they're standing up a service. We wanted to enforce best practices and some consistency of patterns in the implementations. In short, we wanted to provide a great developer experience to make it as easy as possible to migrate to this new architecture.

Domain Graph Service Framework

The first thing we set about doing is making it easy to implement a GraphQL API. We implemented the domain graph service framework to do exactly that. The domain graph service framework is built on top of Spring Boot and GraphQL Java, and just makes it easy to wire up your code in GraphQL Java. We provided several features out of the box. The first is a Spring annotation-based model, wherein by using these annotations, it automagically wires up GraphQL Java code in your service. We also implemented a code generation Gradle plugin, so it can take your schema and generate the corresponding Java classes that can be used in your implementation. We provided some basic federation support out of the box. We also implemented a lightweight test framework to be able to test your GraphQL data fetchers or resolvers easily without having to write heavyweight smoke tests. We provided tracing and metrics out of the box as well, so these didn't have to be integrated manually by developers. Over time, we started adding more features such as subscriptions and file uploads.

We organized our project into several modules so that developers could opt in and out of certain features. To begin with, as I mentioned, we built on top of Spring Boot. The original idea behind the framework was to provide Netflix integrations out of the box, so we started adding support for AuthZ, metrics, and tracing, and logging. We also wanted to make it easy to wire up GraphQL Java code, and so our DGS core provided the bulk of these Spring style annotations to do exactly that. Then, as I mentioned, we built on top of that and started adding more features such as subscriptions and support for file uploads. Here are some examples of the annotations I was referring to earlier. We have a DgsQuery, DgsMutation, and DgsSubscription. These lets you automagically wire up your data fetcher code and register it with GraphQL Java, so that for an incoming GraphQL query, it is able to invoke the implementation that you've registered with it. We have @InputArgument that lets you extract input parameters easily in your implementation of the data fetcher. We have many more such convenience annotations that let you work and implement your GraphQL service.

Build a DGS

Let's take a quick look at how you can actually build a DGS, to demonstrate how easy it is. I start with a Spring Boot project initialized using the Spring Boot initializer. I go ahead and add my DGS dependencies, which is the starter in this case. I can optionally add in the code generation plugin. This lets me generate Java classes based on my schema defined in the GraphQL service. Here I have a quick demo set up to show how we can implement a DGS. The very first thing I start is by setting up my dependencies. In my build.gradle file, I go ahead and add my optional code generation plugin. This is a Gradle plugin. Then I can go ahead and add my DGS starter and the platform dependencies that the DGS requires. The next step is to define a schema. Here, I'm going to implement a review service to complement the show service. I have a reviews query that returns a type review. The review type has a starRating field. The next step is to implement my data fetcher itself. Here, I've pre-populated this with some static hardcoded data, just to return within my data fetcher implementation. You'll notice that the review class doesn't actually exist yet. Once I build this project, code generation is going to kick in and build all the Java classes based on the schema that I've defined. After I've built, I should have the generated reviews class available for me to use. These are generated in your build/generated folder. You can have all the types that you've defined in your schema viewable there. My generated class also comes with a nice builder style API so I can construct these in my code easily.

Let's go back to the data fetcher itself. Now that we've generated these classes, I'm able to use this in my implementation. The next step is to actually indicate to the framework that this is going to be my data fetcher implementation. I first annotate this with an @DgsComponent. Any classes annotated with this are processed specially by the framework. Now I can go ahead and implement the code for my query. I start with an @DgsQuery to wire up my data fetcher. Here I'm going to set up the implementation for the reviews query. The reviews query returns a list of review objects, and it takes in a showId. I can use my InputArgument annotation to extract this input parameter from the data fetching environment that GraphQL Java provides. In your schema, we have the showId defined as a string. I can now just use that in my code. Given the showId, all I need to do is look up the corresponding review object and return that in my code. That's all it takes. It's just as simple. Once we now build and start the service, I should be able to go and start executing GraphQL queries against this. GraphiQL is an editor that lets you execute GraphQL queries, and it's integrated out of the box. Here, I can browse my schema, and I can also craft queries to execute against this service running locally on my machine. Now here, I have a query, I have a reviews query specifying a showId. In this case, I'm going to use a showId of 2, because I know I've set up some data for that showId. Then I'm going to select the starRating field. Given this query, I can now execute it and get some data back from my service. I also have some tracing extensions integrated. That's the data that you see here in the response. If I provide an invalid showId, I get no data in this case.

Test a DGS

We took a look at how to implement a DGS. The next step is to test it. Here I have a simple test that's set up. It's a Spring Boot test. You'll note that I've only specified the classes that I want to use in this test. Next, I use a DGS query executor with which I'm going to start invoking the API to execute a query. The first thing I want to set up is my query string itself. Let's take the same query string example that we just saw in our demo earlier. I have the reviews query. I specify the showId. The showId itself has a string type input, so that's why I'm escaping it here. Then I select the starRating field. With this query string, I should be able to parse it into the query executor, execute the query and test the output. Instead of actually manually writing the query string, I can also programmatically construct it using the code generation plugin. It generates a nice type safe client API and gives you a builder style API that you can use to construct your query. Here, I have a reviews GraphQL query, and I parse in a showId of 1. I can also select the fields that I want in my response using the ProjectionRoot. In this case, it's the starRating. Once I serialize this, I get the exact same query string that I had parsed in earlier. Now I can go ahead and run this test, and I should be able to get back the result and set up some assertions to validate those as part of my test. Instead of the plain execute query, and having to inspect the execution result, I can also use a different API available for tests on the query executor. In this case, I can use executeAndExtractJsonPathAsObject and specify the path into my results data. Here, I have data.reviews. This is extracted and given to me as a list of reviews that I can go ahead and inspect and write assertions on. With this, I'm able to get back a list of reviews. In my test, I can now go ahead and assert that I get, for example, a certain number of reviews for showId 1. In this case, I know I'm going to get three reviews, because that's how we've set it up in the service. This should set up a passing test. That's pretty much it. That's how simple it is to write your test itself.

Domain Graph Service OSS

We just saw how to easily implement a GraphQL service using the domain graph service framework, and how to test your data fetcher code. Initially, while our idea behind having the framework was to offer Netflix integrations out of the box, over time, we started adding more features, and it became evident that a bulk of the framework was available, could be easily used by the general GraphQL community itself. That's when we considered open sourcing the framework. We were able to take advantage of our modular structure to easily create open source versions and Netflix versions of our starters. Here we just had the Netflix starter that had all the Netflix specific modules and the core modules together. We were able to separate these out into a separate OSS DGS starter that contained the core non-Netflix functionality. Then we had a Netflix starter that depended on this OSS starter and added on additional features as modules, such as metrics, tracing, and AuthZ. This way, we were able to avoid forking our repository, and having a nice, easily consumable model to maintain both the open source framework and the internal Netflix framework together. We open sourced the framework in early 2021. Since then, we've gotten some amazing feedback and contributions and interest and discussions from the community. This has really helped evolve the framework and keep it going forward. The open sourced version of our domain graph service framework does not need a Netflix gateway to work with it. It works just as well with the Apollo gateway or router. It also works very well to just implement a GraphQL service, with or without federation. To check out our projects, you can take a look at GitHub for both the DGS framework, https://github.com/Netflix/dgs-framework, and the DGS code generation tool, https://github.com/Netflix/dgs-codegen.

DGS IntelliJ Plugin

To complement the framework, we also built a DGS IntelliJ plugin. Today, it lets you navigate between schema and the corresponding data fetchers. It also provides some basic code hints and auto fixes, and gives you a nice view of all the DGS components in your project. We built this on top of the existing js-graphql plugin, which is a widely used plugin by the community. We were able to leverage a lot of the features provided by the js-graphql plugin for GraphQL specific feature functionality, and simply added DGS development features on top of that in our DGS IntelliJ plugin. Let's take a quick look at what this plugin does for you. I mentioned navigation between schema and data fetchers. Here you'll see by clicking on the DGS icon, you can navigate between the exact schema definition and the corresponding implementation you have in your DGS service. I can also take a look at all the components I have defined in my project in the window on the side.

Here I have an example of some code hinting and fixing. We made use of the InputArgument earlier. The reviews query takes a showId as input. Let's say I didn't actually have that in my implementation, you can get some hinting and fixes and auto fixes, to say you can actually use an InputArgument in your data fetcher implementation. It also gives you some basic detection of types and checking of names. If I have a different name for my argument, it will hint and detect that. If I have mismatch types, it can also catch bugs in this manner. IntelliJ plugin development has taken off in a big way at Netflix. It gives you access to the abstract syntax tree, thereby helping you build inspections in order to highlight issues, to automatically refactor your code, and enforce best practices. It's also been really instrumental in migrating deprecated code to newer style code. Our IntelliJ plugin is also open sourced and available on GitHub.

Schema Registry (Reggie)

So far, we saw how we could easily implement and test a GraphQL service. The next step is to actually make the service part of the graph. We use the schema registry in order to manage DGSs and manage your schema as well. Reggie is a UI that fronts our schema registry, and it's a one-stop-shop for DGS developers and clients alike. With Reggie, to begin with, I'm able to go ahead and browse all the DGSs that are part of this federated graph. Here's a list for the graph that you chose. You can look at all the DGSs. You can look up a particular DGS. In this case, I have an example DGS set up. For my DGS, I can go ahead and browse the schema. The schema also keeps track of the revision history for the schema, so I can take a look at all the changes that have been made, when they were made, and diff between the different versions of the schema. I have client stats usage, so it tells me which fields are being used by which clients and how often. This is useful in order to enforce some deprecation workflows, so I can monitor to see when my client stats usage goes down to zero before I remove a field from my schema. It integrates nicely with your code repositories so I can easily navigate there. I also get, out of the box, a view into my metrics dashboard that's available out of the box with the DGS framework. Here I can go and monitor the GraphQL errors, the latencies, and what queries my service is fulfilling. I also have a GraphiQL editor here, that lets you run federated queries against the gateway itself. I can browse the entire federated graph, and craft queries based on this, that will be fulfilled by the gateway directly.

Federated Tracing

We took a look at how you can register DGS to be part of the federated graph. Once your service is deployed, you want to be able to monitor it. We built a tool for federated tracing. This is our in-house tool that we've built. It's built on top of Zipkin traces. Here you get a nice consolidated view of the call graph. Given a query, it tells you the fanout pattern for all the services that the gateway is making. It also gives you consolidated logs for the services that the query was fulfilled by. It also gives you a nice view into the timeline, so you can drill down into the times and latencies that it took for executing the query along each step of the way.

Documentation & Training

Lastly, but most importantly, is documentation and training. We put in a lot of effort to providing comprehensive documentation to our tooling and framework. We've also invested a lot of effort in putting together training style videos or boot camps, to help developers self-service and onboard quickly. This has been key and highly instrumental to the success of adopting Federated GraphQL at Netflix.

Growing Pains

Three years later, here we are, we have the Studio Graph. It contains almost 200 DGSs, and it continues to grow. We have the enterprise graph, primarily used by our internal platform teams. We have the newest consumer graph. This powers the discovery experience that you see on the Netflix UI. Over time, we started seeing more patterns, newer patterns. We had clients wanting to request data from multiple graphs, in this case, the studio and the enterprise graph. We also started seeing DGSs wanting to be part of more than one graph. It just made a lot of sense to merge these two together into one bigger federated graph, the one graph, which today has more than 250 DGSs. This came with a newer set of challenges. To scale this graph, we had to make sure we were able to collaborate on a much larger schema. We were not set up for this just yet. Schema discovery was becoming a lot more painful because our tooling couldn't keep up with this large of a graph. Schema governance was also becoming challenging. This has been largely manual for us at Netflix. It was becoming more painful or difficult to actually keep track of all the types across the federated graphs and ensuring it all made sense from a one graph standpoint. We had challenges with newer deprecation workflows, and also migrating fields from one service to another. We needed better workflows to facilitate this. Also, as I mentioned, tooling, because our schema is now much larger, has been impacted with respect to discovery, code generation, all of it hasn't been well scaled to suit this larger graph.

Improved Schema Workflow

At Netflix, we lean heavily on the schema first approach. We have backend and frontend developers iterate on the schema to ensure that the API makes sense for that particular business use case. We have a schema review committee, consisting of a group of folks who are monitoring these schema changes and PRs, and making sure that it all makes sense from a composed graph perspective. Then, once that's passed the initial review, we have the backend teams go ahead and implement the schema. These are staged and tried by our UI teams, and then they go ahead and iterate on the implementation a few more times before it actually makes it out into production. We wanted to automate some of the manual work here, and so we introduced more tooling. We introduced Graph Doctor to automatically catch schema errors and enforce some best practices to alleviate the burden on the schema review committee themselves. Then we also introduced Graph Labs, which allows developers to stage their schema changes for trial by UI teams. This avoided the need to have your service completely deployed in test in order to do a full end-to-end test.

This is what Graph Doctor integration looks like today. I have a PR proposing some schema changes, and with integration with the Graph Doctor, I'm able to automatically get feedback around some best practices and annotations, and it catches errors in my schema proposal. We wanted to enable local federated testing with the schema. It is pretty easy today to have your DGS, and you've made your local schema changes, and with the GraphiQL, you can test changes within your schema directly. In a federated setup that involves the gateway and other DGSs, this is much more challenging to do. Initially, we tried to set up a mechanism in which developers are able to run the local gateway on their host as well. Initially, we had to clone the gateway and run the gateway and also run their local DGS. Then, this was proving to be challenging, and so we provided an image, such that they could run this image in a Docker container. This was still fairly heavyweight. You had to run the gateway and your local DGS, and run a set of commands to indicate to the DGS that it needs to talk and communicate with your local DGS. It was all fairly cumbersome to set up for the developer and also really hard to troubleshoot when the setup didn't quite work right. We improved this with Graph Labs. Graph Labs allowed us to set up a nice sandbox environment by eliminating the need to run the gateway locally. Here in this new setup, the gateway is actually running in the cloud. With setup command, we are able to indicate to the gateway that it needs to talk to the DGS running locally on your host. For all other fields, it still reaches out to other DGSs that have registered with it and deployed and test. This greatly simplified the workflow of testing federated queries with your local changes on your DGS. It also greatly worked as a tool to help understand how federation works end-to-end as well for developers unfamiliar with it.

Graph Lenses

I mentioned that we had challenges with discovery as well in the graph and our tooling. Working with this entire graph became quite cumbersome. We introduced the concept of Graph Lenses, which basically gives you a subview into the graph. With predefined lenses, you are now able to browse and discover and just work with a subset of the types that are part of the entire graph. With the introduction of lenses, we updated our schema registry UI, Reggie, to support that. This is what it looks like today. We have predefined lenses, that gives you subviews into the entire compose graph. Here I'm selecting my Spotlight lens. With the Spotlight lens selected, instead of being able to view the entire graph, I'm able to select from just a subset of the queries that are supported by this lens.

What's Ahead?

What's ahead for us? So far, I described the GraphQL platform consisting of the DGS framework, and several tools we built in order to help with the schema workflows, and generally working with the federated graph. Next, we're helping build out the consumer graph. This is the API that powers the discovery experience you see on the Netflix UI today. We are in the process of migrating away from the older architecture to adopt Federated GraphQL for this. The consumer graph is unique in that it comes with new requirements. It deals with very high scale. We are in the process of identifying performance bottlenecks and improving the framework and tooling to keep up with this. We're also investing heavily in improving observability so that we can monitor for these services more effectively. Finally, this graph comes with complex schema design requirements. We're also trying to improve our schema workflow tooling to assist with this effort. We continue to learn as we go and hope to share what we've learned in the near future.

See more presentations with transcripts

Recorded at:

Oct 11, 2023

Kavitha Srinivasan

InfoQ Software Architects' Newsletter