InfoQ Homepage Presentations Scaling GraphQL Adoption at Netflix

Architecture & Design

Scaling GraphQL Adoption at Netflix

Bookmarks

View Presentation

Speed:

49:23

Summary

Tejas Shikhare discusses how Netflix migrated to GraphQL and some of the problems they had to solve scaling it.

Bio

Tejas Shikhare is a Senior Software Engineer at Netflix where he works on the API Systems team. He has spent the last 4 years building Netflix's Federated GraphQL platform and helped to migrate Netflix’s consumer facing APIs to GraphQL. Aside from GraphQL, he also enjoys working with distributed systems and has a passion for building developer tools and education.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Shikhare: GraphQL is an alternative communication protocol for APIs between the client and the server. In GraphQL, we have a schema that describes the data graph. This data is all the data that you can fetch from the schema. The schema is formed with types, which contains fields. This field reference other types. This is what a schema looks like in the GraphQL schema definition language. We also have a set of route types that we call query mutation and subscription. These are the entry points to the graph. With these entry points, we can construct a query. This query is a tree based structure. The query can be as big as you want, or as small as you want. In this particular query, we want a list of movies. For each movie, we want a title. Now we can take this query, and it's typically packaged into an HTTP POST request and sent to the GraphQL server. The GraphQL server processes the query and then returns a response, typically a JSON object. Very simply put, GraphQL gives you the ability to fetch exactly the data you want from the server, not more, not less. That's it. That's GraphQL in a nutshell.

Benefits of GraphQL

What's all the hype about? Why is everyone so excited about GraphQL? Let's talk about some of the benefits. One of the big benefits of GraphQL is to minimize roundtrips with aggregation. Since the query can be as big or small, we can fetch all the needed data in a single roundtrip. We can take a look at this quick example. These are the movie recommendations for me on the Netflix UI. You can imagine, we might have two APIs to support this UI. We have a movie recommendation API, and for each of the movie IDs that the recommendation API recommends, we can have an image API. Let's say they are deployed in U.S.-West in California, and I'm visiting my parents in Singapore, which is approximately 8600 miles from the server. When I open the Netflix app, I have to make two sequential requests to be able to render this UI. Now there's many ways to solve this problem. You could imagine, we can use something like GraphQL to aggregate so that the client can write a single query, topics, and image for each of them. Maybe in the future, we add badges. Then we add a badge API. Then we just update the query. It's still a single query. GraphQL is not the only way to solve this problem, this can be a REST API as well. Then, you get this complex BFF architecture, and the REST API is not reusable. GraphQL provides a more reusable pattern for this aggregation or orchestration. Hence, GraphQL is a really good fit for consumer applications like Netflix.

This is a sample GraphQL schema. It might look similar to model classes in Java, because we can actually generate model classes, both on the server side and the client side. This gives us much more ease to write the code and send it to our API and back. There's also a clear indication of what's nullable and what's not. It's built into the language, so you can add an exclamation point to mark a field non-nullable. This reduces the churn caused by bugs in loosely typed APIs. It also forces collaboration between the client and the server teams. The strong typing is not just great as a contract. You can also build developer tools and power them using it. One other big benefit of GraphQL is it shines when it's implemented as a single graph for your organization. Because first it becomes a visual aid for all the data in your organization. Then it also becomes the connecting dots for all the different domains. Then you can write a query that crosses these domains. This is a really powerful paradigm.

Background & Outline

My name is Tejas Shikhare. I'm a senior software engineer at Netflix. For the past three years, I've been blessed to be part of this amazing team. I've been working on our Federated GraphQL platform. My focus has been with GraphQL, and distributed systems. Most recently, I've been also working on developer tools and developer education. I'm a big fan of API stewardship. For our talk, we're going to start cataloging two of the common architectures, pattern for GraphQL in the industry. We're going to dive deep into the Federated architecture, which is what we're doing at Netflix. Then we'll jump into some of the migration challenges, and some strategy recommendations for you.

Monolithic vs. Federated GraphQL

GraphQL was open sourced by Facebook in 2015. Since then, two core patterns have emerged across the industry. The most common way to implement GraphQL is in the monolithic architecture. Why? Because we want the one graph that we saw earlier. In small companies, the GraphQL service is just part of your core monolith. It is built within it. In some bigger companies you can have the GraphQL layer separate talking to the monolithic layer, or it could be talking to your microservice architecture. We've also seen that GraphQL service can be a BFF, Backend for Frontend owned by the UI teams. Or it could be a backend service, an aggregation service. Really, it's always owned by an API or a GraphQL team. This is how we started at Netflix too. This is an oversimplified view of the Netflix architecture. After we adopted microservices to scale our teams, we quickly discovered the need for an API layer to bring together and orchestrate everything for the UIs. We created this service called DNA API. Except, GraphQL was not invented yet. Facebook was still working on it internally, and it was not open source. We developed a similar technology called Falcor, which is actually open source. It just works like GraphQL, but it just didn't take off like GraphQL did. Both Falcor and GraphQL actually came from the same problem space. At Facebook, it was the newsfeed team trying to orchestrate data from multiple sources. At Netflix, it was the TV UI team trying to lay out the TV UI.

Then, over years, this monolith started growing, as we added more features, and eventually it became bigger. Along the way, we started seeing some problems. First, for every new feature, we needed a code change both in the service layer, but also in the API layer. This was often done by different teams. Because of this, the API team had to become experts in many domains. They were also the first line of support because it's a single runtime and handles all the requests. This frequent code changes, and it says more backend services, we need to connect to them, so more dependencies. This resulted in slow build times. Oftentimes, when you have a single runtime, a memory leak in one area could cause problems in the completely unrelated areas. We saw this cascading failures. These are some common problems of a monolith architecture. This is what we saw with the API layer. To fix this, you can imagine, let's say we have this API. It's owned by the same API team, but aggregates across many domains. What if we could still have this one graph, but then split the implementation of all of these subgraphs to different teams.

This is where we entered Federated GraphQL. What's the simplest way to explain this concept? Let's say we have this type Movie in the monolithic GraphQL API. It has three different fields, fulfilled by three different services. The monolith API team would go and implement resolvers to resolve these fields and aggregate data from multiple sources. What if we could break this type apart and give the type extended across service boundaries so that each team can implement their own part of the API. That's exactly what Federation is. Using this idea, we envision an architecture. There are three main components to this architecture. The first is a DGS, or a Domain Graph Service. It just implements the subgraph that we saw. The Domain Graph Service can be a separate service that calls into the microservice, or it could be the microservice itself. All it does is just implement the GraphQL API pertaining to that team subgraph.

Next, we have the schema registry. The schema registry is responsible for validating that each of these individual subgraphs are valid, and then merging them and composing them into a super graph, which is then exposed back to the clients by this highly available service, the GraphQL gateway. The clients write queries against the gateway and the gateway is responsible for breaking these queries apart into subqueries that are sent to the Domain Graph Services. My coworker Stephen and Jennifer gave an amazing talk at QCon Plus about two years ago, explaining Federation and architecture in great detail. You can learn about query planning and query execution. I definitely urge you to check that talk out if you haven't already.

GraphQL is Thriving: 3 "One Graph"s

Where are we today? GraphQL is used widely across the company. If you pull out your phones today and open the Netflix app, it's powered by GraphQL. It's using our member and the gaming graph. On the production and the studio side, we have a lot of people working on different parts of the production process, such as pre-production, post-production, on the set, and we build a lot of apps for them. These apps are also powered by GraphQL. It's powered by our Studio Graph. Then most recently, we have started also building an internal tools graph, which is for apps that are workforce facing, and we build them with GraphQL as well. We're dealing with multiple dimensions of scale here, over a billion requests per day, tens of thousands of types and fields, and 500-plus active developers. It's been over two years since Jennifer and Stephen presented. We've been operating and scaling Federation. Did we solve all our problems? Not quite. I think using Federation has just introduced some new ones. What I've learned from this experience is software engineering is largely about understanding the benefits and the tradeoffs, and then applying them to the situation at your company. No technology is the silver bullet. I want to take quite some time to share with you the challenges we are facing with Federation.

Challenges with Federation

In the monolithic GraphQL team when you have this monolith API layer, only the API team needs to be GraphQL experts. In the Federated world, even the domain teams also need to learn GraphQL. The initial barrier to entry is just too high. Imagine one day going to your teams, backend teams, they're implementing their APIs in REST or gRPC. You tell them, start implementing your APIs with GraphQL and make sure they also merge into this unified graph. This is really hard. To address this, we leaned heavily into developer education. We created bootcamps, example codes, and lots of documentation for people to get started. Then we also provided first-class Slack support and weekly office hours. I think what really helped with the initial barrier to entry is we actually embedded with the domain teams. My team knew how to do GraphQL, so we worked with the other teams to help them spin up these services. Then they became the champions of the architecture. Federation sounds cool. You can just decentralize ownership. Actually, driving adoption is pretty hard. Over time you overcome the developer education problems, and the developers start to get the hang of it. Then you start getting graphs in your ecosystem, subgraphs. Then more developers come to the party. In the studio ecosystem, we have 159 subgraphs, so that many Domain Graph Services in that ecosystem.

It feels like we are successful, we have this one graph. Does it feel like one? It just feels like a hodgepodge of things stitched together, instead of this highly leveraged one graph created by the monolithic team. It seems more curated, cohesive, high leverage. Why do we see that? Because in the monolithic world, schema design is a single player game, we have one team understanding all the requirements, product requirements, and exposing a unified API for the clients. In the Federated architecture, however, it's a multiplayer game. In studio, it's a massively multiplayer game with 159 teams building out the schema. This can lead to some issues. In GraphQL, there's no specific way to handle errors. The spec doesn't say anything about that. You can do it in many myriads of ways. Let's say we have this simple user API, and we want to return a user not found error. One way to do it is alongside your response object, you can return an errors object, saying, user not found, just like this. Another way to do it is you can model the user not found into your schema. The user query can either return a user object or a user not found object. There's pros and cons to this approach. I think when you have 159 teams working on the graph at the same time, it's difficult to enforce one approach at the end of that, so teams use different approaches. Who is affected by this? The clients, because they can't reuse their code for error handling across different features. We see the same pattern across other API design things, like pagination.

Another big thing is a graph has become too big to collaborate. It has so many types, queries, and mutations, and so many teams building it. To address the first thing that happened when you have such a big graph with a global namespace is you have naming conflicts. We decided, we'll namespace our types and fields to avoid the naming conflicts. Then I think we just dug ourselves into a hole, because now the namespace hides what's been implemented. We tend to repeat the same feature sometimes, because we have the namespace create these informational silos. How do even new developers get on board? They get overwhelmed by the size of the graph, and start collaborating in here. We start to ask ourselves if the leverage is still there.

Achieving Cohesion in the Federated World

Federated GraphQL gives us the freedom to move fast. I think in return, we are trading off a well curated API. Are we still the responsible stewards of the API? I think API inconsistency and collaboration issues is a big price to pay. How can we achieve cohesion in the Federated world? I think solving this has been the core focus of my team for the last two years. I'd like to share some ideas with you. First, we came up with this workflow. This workflow is what we call collaborative schema design. People don't just follow a workflow, we have to create mechanisms and tools to make it happen. Before GraphQL, we used memos for defining product specs and requirements, collaborating with product managers. That still worked great, so we kept it. To address collaboration challenges between the client and the server teams, we created GraphHub, a schema collaboration tool. The goal of GraphHub is to escape from the shackles of the implementation details, and focus on the API, and design a schema collaboratively with your UI partners.

What is GraphHub? GraphHub is just a monorepo. It's just a Git repo that has all the schemas, and it syncs with the schema registry. The schema registry is dynamic, and always has the latest schema from prod. In this Git repo, any developer can file a pull request, or what we call the schema update proposal. Pull requests are a great form of collaboration. They communicate clearly and crisply their intentions, and remove the hand-waviness from collaboration. In this PR, I added a new query for QCon demo. Now I can just grab the branch name, and access the live mocked API for the schema changes. I can share this with my UI partners or whoever else. I dig and run queries and get back random data. This is all done just with schema changes. No code was written, no implementation details were discussed. GraphHub is a runaway success. So many teams are using to collaborate with their schemas. That's GraphHub.

To supplement GraphHub and improve collaboration even further, we created the schema working group with individuals who are super passionate about GraphQL. With Federated GraphQL, everybody has to learn GraphQL, but not everybody needs to be passionate about it. This group gave us some of the benefits of the monolithic GraphQL team. This group is open for anyone to join. They review schema changes, document best practices for pagination and error handling. We highly recommend this. To solve for schema inconsistency, we created a tool called GraphDoctor, which is a schema linter. The goal of GraphDoctor is to help create a consistent API in the world of massively multiplayer schema development. GraphDoctor listens to every PR with the schema change. It will run consistency checks and make recommendations directly in the PR. GraphDoctor is powered by schema guideline proposals. Schema guideline proposals can be created by anyone. They have a unique identifier and can be accessed on our doc side. We can codify these schema guideline proposals into linter rules. Then GraphDoctor can use these rules and vet pull requests with them.

Now that the schema is designed, we go into the implementation phase. Then we might discover blockers, and change our plans. To help with that, we created a tool called GraphLabs, to facilitate rapid prototyping and feedback loop between the client and the server. Let's take a look at an example here. Let's say I want to add some changes to the movie DGS. I made the schema changes. I implemented the code to handle those changes, and I file a pull request. What GraphLabs does is it creates a sandbox environment that is created and destroyed automatically for that particular pull request. This environment is blended with the rest of the components of the architecture in test, and it's isolated. Then we can share this environment with client teams. The client teams can then integrate with this environment directly from their UI, and test everything end to end, while the backend code is still in pull request. This enables extremely rapid prototyping.

Lastly, API design is not a perfect process. We'll make mistakes, requirements evolve. To power the deprecation workflow, we created graph stats and notifications. We count how many times deprecated fields are used, and send an email notification to the client teams who are using deprecated fields. These stats can be leveraged by many tools also, for example, GraphDoctor uses these stats to make sure that people are not making breaking changes. All of these tools revolve around schema collaboration and making the schema better, because the schema is your API. We don't want to have the information silos in our schema and fall prey to Conway's Law. We want the schema to feel more like this, a cohesive unit. I don't think we are there yet. The hope is that with the tools we've created, we can get there one day.

Another big challenge with our graph is it's growing too quickly. It's hard to discover. When you have a client team come in and the clients of the API are trying to see what's available, there's just too many things. Let me show you what I mean. This is our UI for testing out an API, maybe you're familiar with Swagger. On the left-hand side, you can see all the different APIs that you have available. There's a lot of them. We created this thing called Lenses, which is a magnified view. It gives you a much more manageable set. Then you can select which API you want to use. What are Lenses? Remember these subgraphs and the domains that we have, it allows you to magnify into a domain. Then, it doesn't mean that you don't have access to the rest of the graph, you still do. It allows you to start small, look at a smaller view, and then start building your queries from there. This is just one way we're trying to solve the big graph problem. If you have other ideas, we would love to know what they are, because this is one of the biggest problems of having a big graph.

I tried to highlight some of the core challenges with Federation. There are a lot more. For example, it's much easy to share types between subgraphs in the monolith than it is in the Federated architecture. Federation also has a lot of limitations. The main takeaway is Federation is not free. It's not going to solve all your problems magically. We had to build a lot of additional tooling, documentation, and developer education to make it work. We're still trying to make it work.

Reflection on the Monolithic API Layer

Now that we've looked at some of the problems with Federation, it's important to reflect back on having a monolithic API layer. This is the slide from earlier where we talked about the problems with the monolith. We don't have to solve all the problems with a new architecture. For example, instead of the API team becoming experts in many domains, we can have a contribution model, and domain teams can contribute to it. Instead of having a centralized support, we can create decentralized support. For example, even if it's the same runtime, we can create different metrics, and page different teams based on those metrics. We can also improve our developers' tools for slow build times. Lastly, we can build more resilient systems. For example, at Netflix, we actually shard our monolithic API layer by device type, so that failures don't cascade to another area. We've done a lot of these improvements and more in our API layer. The takeaway is you don't need a new architecture to solve all your problems, you can improve incrementally. I still think it's a great fit for us, because it goes hand in hand with the microservice architecture. Netflix has a decade of experience building microservices. We've invested so much in improving operability, observability, and resiliency of our microservice architecture. While Federation gives us the loose coupling, we did have to ensure a high level of alignment with workflow and developer tools and the schema working group. Is Federation the right choice for you? That's only up to you to decide.

GraphQL Migration Can Be Painful

We're not done yet. Remember this DNA API, which implemented Falcor, we still need to move to GraphQL. As you all know, migrations are non-trivial. Whether you're moving to Monolith, or Federated, moving from REST APIs to GraphQL. In our case, it was the Falcor API. We've all been parts of migration during our software career. It looks something like this. The plane is flying midair, and you're changing out all its parts, and it can go down. On the left-hand side is a very familiar Netflix discovery UI that everybody loves. This is what we decided to migrate first onto GraphQL. The first big challenge was conflicting priorities. Do Netflix developers work on a new product feature, or do they work on a tech migration? Thankfully, the iOS and the Android teams were pretty excited about GraphQL, so we've managed to get them on board to move things along. Then we have dimensions of scale. Over 200 million members use this canvas to discover what to watch. You can only imagine how much traffic the screen receives. We had to absolutely maintain engagement metrics and feature parity.

Migration can take months, sometimes years. Any new feature we add during the migration needs to be exposed via both GraphQL and Falcor. While Roman riding is a fun spectator sport, I don't think developers are super excited about maintaining their features in two discrete systems. This is the price we have to pay. We decided we needed to understand all the requirements and build a solid plan going into the migration. We had two sets of migrations going on, first, we were moving from the Falcor API to the GraphQL API. We were also moving from the Monolithic API layer to a Federated one. We decided to focus on the first one, because it involved both UI and backend teams. Here is the plan we came up with. Remember the DNA API from the bottom. That's the service that serves the devices directly today. First, to move to GraphQL, we had to make sure that the devices knew how to talk GraphQL. This is actually a non-trivial task. It involves prototyping different client libraries, figuring out client-side caching and normalization, and making sure that the client performance is acceptable. Next, we had to build a transformation DGS, a Domain Graph Service that translated Falcor, from Falcor to GraphQL. It was another layer that we needed to add in there. Then we needed to break the monolithic API apart and move the logic into Domain Graph Services.

To further complicate matters, the devices don't talk to the monolith directly. They actually have talked to the monolith through the BFF layer, which is owned by a device team. For GraphQL, however, we wanted to have a single flexible API. A BFF is a Backend for Frontend. We wanted a single API for the GraphQL, but each query can have their own different query per device. This added further complexity to our modeling. Now the API needed to be more flexible. The BFF don't just go away. The logic either moves into the client, or it moves into the server. That has paid off for us, because our architecture is a lot simpler now.

Next, we also had to make sure the migration was safe. Since there are so many moving parts, we are adding a new way to call the server using a GraphQL client, we have this new extra layer service that might add latency to our calls. These users are also highly contextual. A lot of things were changing. It's hard to test and make sure that every single thing works. We decided, why don't we use A/B testing? This detect changes in our core metrics. This was an excellent idea because it allowed us to make a decision quicker instead of ironing out every single issue before launch. This doesn't mean we didn't test, we still test. We didn't skip writing our tests. We did end-to-end testing. We also used another technique called shadow traffic testing, specifically from moving from the monolith to Federated GraphQL. Since the API and the query is exactly the same, but the underlying implementation is different, we can use this technique. We have this tool called Mantis, with which we can sample random traffic events. With this, we can build a simple event processor that sniffs this random traffic event and send the request, the same request to both the Monolithic and Federated, pretending to be an actual user. Then we match the response. If it matches, great. If it doesn't, then we investigate further. This allowed us to iron out a lot of functional correctness issues like localization, for instance, and gave us a high confidence before launch.

There's nothing to sugarcoat here. You need to figure out the list of steps and execute on them iteratively. It's a long process. It took us almost a year, but we're close to the finish line now. What I've observed is there are two core things that complicate migrations, first, is the baggage of your legacy architecture. The more complex it is, the more complex the migration is going to be. Second, is the dimensions of scale. The more dimensions you have, the more safety mechanisms you need. That said, not all our GraphQL migrations were this complicated, especially on the studio and the internal tools. They just didn't have this much baggage or the dimension of scale. As you're thinking about GraphQL adoption in your company, you don't want to underestimate, but really have a solid plan going into the migration. If the migrations are too hairy, moving to GraphQL may not be a good choice for you at the moment. Though the migration was painful, the silver lining is new product development was relatively smooth sailing. For example, the Netflix Games platform is built entirely with GraphQL, and Federation.

GraphQL Adoption in your Org

We've covered a lot of ground. We've covered multiple architecture patterns, their tradeoffs, migration pains. Let's talk about GraphQL adoption in your org. This next set of slides are just recommendations based on my experience working with GraphQL. Just some ideas and not a hard set of guidelines. Whether you're an early stage seed startup, or a large company with deep microservices, you really want to start with a monolithic GraphQL API, one graph. This gives you the good foundations to build a unified API. You want to resource this effort into a single team, to plant the seeds and grow and tend to the GraphQL API. Ideally, this team is a mix of backend and UI engineers, because you want both a scalable service, but then also a great API for the client developers. I think this is the most successful strategy for having a unified GraphQL API. Then as your graph grows, you can think about Federation, and if it makes sense in your organization.

It is also important to plan a coordinated GraphQL effort in your organization. What I've seen to be very common, is many teams will start building GraphQL for their own domains, and you often end up with many different GraphQL APIs within the same company. If you have n different teams building their own GraphQL APIs, you're not going to unlock its leverage. We saw this at Netflix too, especially in our studio domain. When you have so many different teams, and you have new technologies, people are going to start building GraphQL in their silos. With that, you're not going to be able to do this where you can query across domain boundaries with a single query. If you see this happening in your company or your organization, you need to get into the room and start collaborating, trying to merge your multiple graphs into a single graph, if it's possible. It's not always possible, because even at Netflix, we have three different graphs at Netflix. One of the efforts in my team is to merge the studio graph and the enterprise graph into a single graph to unlock even more leverage, because there's points of connection there.

Then, as you've seen throughout my talk, our team has focused a lot on schema design, because it's your API, it's absolutely table stakes. The amount of effort you invest in schema design will directly affect the success of GraphQL in your company. We've discovered a few best practices that might be valuable. First, try to adopt a schema-first approach. Sometimes you might feel the urge to generate the GraphQL API as you're adopting it, to generate it from an existing API you have, or backend model objects or even database objects, database schema. I think it's important to resist that urge. Even though it might save you time in the short term, it makes your API very tightly coupled with the server implementation. Then it can evolve independently. GraphQL comes with an excellent schema definition language. Start designing your schema there and keep it loose and flexible from the server. In GraphQL, you can add fields and deprecate old ones. This allows the UI teams or the client teams to move to the new field over time, and stop using the deprecated field. Once the deprecated field is gone, we can simply remove it from the schema. This is what we call the deprecation workflow. I showed some tools earlier, like how we count stats on deprecated fields. It's just to make it easier. We typically don't tend to version our GraphQL APIs like we do in REST, but instead we try to follow the deprecation workflow.

Next, it's important to think about the product when designing a GraphQL API, whether it's looking at product specs, requirements, and not just thinking about what your backend provides. Go a step further, collaborate with your UI engineers. Lastly, GraphQL can be used for server to server APIs, microservice messaging. It's not meant for that. It truly shines when we use it for consumer APIs or device facing APIs. If you do decide to use GraphQL for server to server APIs, make sure you keep a clear separation of concerns between your backend APIs and your product API.

Takeaways

I think GraphQL is great for device to server APIs. It really shines when one graph is part of your strategy. We also got the opportunity to catalog two of the most popular ways to implement GraphQL, Monolithic and Federated. We dove really deep into the Federation problems because that's what we've been doing at Netflix. I shared some tools to make Federation easier. They're mainly about managing your schema better. Lastly, migration is not free. Migration complexity is deeply tied to the architecture at your company. I didn't want this talk to be Netflix uses Federated GraphQL or GraphQL, so everyone should use it. It may not even be the right decision for you for your organization. My goal was to share our journey and put a heavy focus on the challenges. I hope you're able to take these ideas back to your organization and do what's right for you.

Questions and Answers

Betts: With implementing subgraph providers from a traditional REST API, have you tried tools that take OpenAPI and Swagger specs and auto-generate a GraphQL schema based on those?

Shikhare: It's a very common thing to do at the beginning, because it's an easy way to get started with GraphQL. I think it's totally fine. There's actually tools out there that you can use to generate both from REST and gRPC to GraphQL. The main thing you want to watch out for, and I mentioned this in the Best Practice Guide, is you don't want to always generate the GraphQL API. You can generate it once the first time, and then you want to be able to modify it independently. Because if you always keep generating the API from the REST API, or the gRPC API, the GraphQL API is going to be tightly coupled with that. You can use these tools to generate it, but you still want to independently evolve and build your GraphQL API separately. That's what we try to follow. Sometimes it's easy to generate and you use the mapping code easily. There's another trick you can do, actually. Once you generate it, you can still return the model objects from your REST API or gRPC API to the GraphQL data fetcher. Then let's say if there is fields that match exactly, then the GraphQL data fetcher would be smart enough to use those fields from the model object. I think this is supported in most languages. That's a neat trick to use. You don't have to reconstruct and redo the model objects and transform between. Then every new field you add, you would have to either add a data fetcher for it or a [inaudible 00:41:35]. This is a very common thing that everyone wants to do.

Betts: It sounds like one of those things that makes it easy to get started, but once you get, especially the scale that Netflix is at, those Getting Started tools aren't as applicable for your scenario, especially.

Do you have any GraphQL schema standards that teams should follow when they provide a subgraph?

Shikhare: Yes, absolutely. At Netflix, freedom and responsibility is one of our big values. Yes, we do have standards, we do communicate them. We actually build linters. GraphDoctor is the tool that makes sure that some of these standards are enforced that I covered in the talk. Ultimately, you can still bypass, unless it's like a breaking change or anything, you can still do whatever you want. We have the working group, we have the linter to make sure that those standards are enforced. There's many ways to do pagination in GraphQL. We do follow that Relay standard, which is really nice, but it's complicated to implement. We create these helper functions and things like that, to help people implement that. Definitely having standards is great. Error handling is another area. I know someone asked a question about using HTTP error codes. We can use that but we use that at a GraphQL error level. You can still have HTTP error code mapping, because one field could be not found, so it's a 404, but something else could be a different error. Since you have all these different requests that are happening, you can have different partial errors. We did try to do some standardization there, but it's challenging.

Betts: I think anytime you have that aggregation point of I'm going to have this one request, and it becomes five requests, and I'm hiding that implementation detail, it's not always a simple answer to come back. Maybe it's ok to give back some of the data, even if you get a 404 from one endpoint, is it still ok to return part of that graph, or do you just fail explicitly and say, ok, none of it came back.

Shikhare: One of GraphQL's superpower is to be able to return partial data, so you can still render, the UI can still decide. Now the UI might decide like this particular field has to be present for us to render, and they can make that decision at the client level, but the API itself can still operate with partial failure, so you would return this particular field is set to null and the error is, you can add, not found. You can do all sorts of things.

Betts: I think you mentioned earlier that the BFF pattern, Backend for Frontend is one of the first use cases that people use. This is definitely frontend focused. Any thoughts on using GraphQL for backend to backend communication?

Shikhare: It can be used. It's fine. That's not what it was invented for. If it's convenient to just use the GraphQL API, it's fine to use it. Obviously, you probably don't want to rewrite the same API in gRPC. Even at Netflix, for things like backend use cases, we do use gRPC, even today, and we plan to do that for the next foreseeable future. For example, we do have use cases that have emerged where moving data from data warehouse, moving data to like a search index or something like that. The GraphQL schema acts as a really nice entry point between data warehouse schemas, which are usually the Avro or those kind of things, and then the Elasticsearch schema. People have started using the GraphQL for data movement. You have these tradeoffs again. Again, it's not close to the backend, it's designed for the UI. The data model is not going to be quite what you want to move to the data warehouse exactly. You might have to recreate some data models within GraphQL to make that happen. You have to evaluate backend usage on a case by case basis.

Betts: I hadn't heard of that case before. I can see why it makes a little bit more sense to say the data warehouse, because it's that data aggregation consumer, you probably want different rules. A gRPC is usually fairly tightly coupled between two services, like the contract is very well defined. It could change very easily.

Has latency become an issue? Does it introduce any latency because you're adding an extra step in the network?

Shikhare: I think the whole goal of GraphQL is to reduce your latency, time to render, basically, because you want to be able to make as little requests as possible. If you observe, the GraphQL server sits in the data center alongside all your microservices, so all of these calls are within the data center. Adding an extra hop, as long as you're within the same region, within the same data center, it's non-observable latency. If you don't have data in the same data center, you have to hop across regions and stuff, then it starts to get slower, this kind of addition. You have to pay the cost somewhere, whether the client is paying the cost or someone is paying the cost. We do a lot of observability tooling. When developers are developing their APIs with GraphQL, you can see exactly where the time is spent. Really, you shouldn't be spending more than 5 to 10 milliseconds doing any of the GraphQL stuff across the board. It should be more your business logic should be taking up time, your data queries, and things like that.

Betts: Why is it difficult to share types between subgraphs? Don't they have access to all the other types in the monorepo?

Shikhare: The subgraph itself is an individual freestanding GraphQL server, so it doesn't have all the types because you don't want to spin up that service, the subgraph service with all the types because it doesn't care for that. Then, when you make changes, you need to be able to validate these types that they merge into the unified graph. Now, the unified graph, only the client cares about it. Because they are the ones who are writing these queries. Really, the subgraph is using an existing type, you can just redefine it in your subgraph and it merges seamlessly, or you just have the types that you have in your subgraph that are spun up in your subgraph service.

See more presentations with transcripts

Recorded at:

Dec 13, 2022

Tejas Shikhare

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?