Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Using DevEx to Accelerate GraphQL Federation Adoption @Netflix

Using DevEx to Accelerate GraphQL Federation Adoption @Netflix



Paul Bakker and Kavitha Srinivasan discuss how they made certain Build vs Buy (open source) trade-offs and the socio-technical aspects of working with many teams on a single shared schema.


Paul Bakker is a senior software engineer at Netflix. He has a long history in the Java community, is a Java Champion and author of “Java 9 Modularity” and “Modular Cloud Apps with OSGi”. Kavitha Srinivasan is a senior software engineer at Netflix, with more than 15 years of experience in distributed systems, currently working on developer experience on GraphQL.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Bakker: My name is Paul.

Srinivasan: I'm Kavitha.

Bakker: We're both part of the developer experience team at Netflix. We're going to talk about how we use developer experience to really drive your large architectural migration.

How It Started

Before we go into all the details of what we did, and what we learned from it, we do have to give a little bit of context how we got involved, and why we did what we did. In the space of show production, there's a lot of people involved in producing a show. Netflix produces many shows. To support these efforts, we have many in-house built apps to keep track of all the things. At one point in time, the architecture that we used for this is that we had many different microservices, each responsible for part of the data. These microservices would typically expose their data using a gRPC endpoint, and in some cases, REST. From the perspective of devices, we wanted to have one single API, specifically a single GraphQL API with a single schema. The way to accomplish that is that we added a GraphQL server API in front of those microservices.

It is an architecture that many companies use, and that is generally something that works out very well. Also, for us, it worked out quite well for quite a while. At some point, we did start to see limitations in this architecture. It had a lot to do with the scale we were doing things in, and number of teams involved. Every change that was made in these back-end services, specifically, like adding new data to the schema, so new fields to the schema, all these changes will also require a code change from the team managing the GraphQL server API, because they had to add code to do a new gRPC call, to get the data. Then add it to the schema, so that this would be exposed in the GraphQL schema for the devices. The fact that this team had to do that manual work, meant that they were a bit of a blocker basically, for all the changes that were being made. Just because of the sheer number of things involved, that just became a bottleneck by itself.

Another problem with this approach is that the device teams, the user interface developers, and the back-end teams responsible for the data are quite disconnected from each other, because we have this team sitting in the middle that actually takes care of the GraphQL schema. We didn't have a lot of natural collaboration going on between the device teams and the back-end teams, to really collaborate on a schema that works best for everyone.

Apollo Federation

Srinivasan: Right along this time, Apollo came up with a Federation spec. It described how we can split a single shared graph across many different services, each owning a portion of the graph. We call these, Domain Graph Services, or DGSs, in this architecture. Let's say for example, you have a type, movie, that's owned by a movie DGS. It defines the movie ID and title field. Now you have another service called the review service that wants to extend this movie type by adding another field called reviews, and provides the data for it. You can have an incoming GraphQL query that goes to the gateway. The gateway is configured so it knows it needs to talk to the movies' DGS and the reviews' DGS in order to fulfill the query, get the data, and send the response back. That's how Federation works at a very high level.

This solved the big problem with the bottleneck for us, because now the gateway has no business logic. All it does is routes requests, and reaches out to the appropriate services. All the schema collaboration is now being done by the back-end and the front-end teams, and any changes related to it. The gateway is completely oblivious to this process.

To learn more about Federation, check out this talk by Jennifer and Stephen on how Netflix scales its API with GraphQL Federation.

How to Onboard 30 Teams on New Technology

Bakker: With this new architecture in mind, we envisioned that it would increase the rate of change that we could make and increase the rate of innovation that we could go after. However, this new architecture also meant new challenges, and we introduced a completely new problem. Because so far the back-end service teams that we had, and at the time we had about 30 teams involved, they only had to expose a gRPC endpoint. It's something they were already very familiar in doing. They weren't too concerned about a GraphQL schema or how the data would look like in a GraphQL endpoint. We weren't worried about GraphQL in the first place. It wasn't really something that they needed to think about, because another team was taking care of that. With this new architecture, we did require those teams to all onboard on this new technology and this new architecture. They would now be responsible for owning part of the schema, which means also the schema design that comes with it, and be responsible for writing the code to actually implement a GraphQL endpoint. On top of that, GraphQL was a very new technology when we look at back-end development, because we didn't have a lot of GraphQL going on in the company in back-end microservices at the time. The question that we were facing is, how do we get 30 teams excited about onboarding all this new technology and this new architecture, while it does imply a lot of extra, new work for them?

When You Give a Developer Carrots

Srinivasan: The answer to that is providing developers with lots of incentives by way of a great developer experience. You want your developers to be happy, and you want to spark developer joy. How do you go about doing that? First and foremost, is the ease of onboarding. In our case, we had a mix of developers who were already familiar with GraphQL, and many who were not. Added to that, you have Federation, which is a completely new concept to introduce. You want your users to focus on the business logic, and not so much on the wire up and the setup details.

Next, you want to give your users a familiar ecosystem to work with. At Netflix, we have a whole slew of tools that we use for debugging, tracing, and logging. We wanted to make sure we provide tight integration so users can use the tools that they are already familiar with during this migration. The third aspect is consistency. You want to try to make all the code look the same. You can easily enforce best practices as a result. It also makes it much easier when it comes to debugging, and sharing what you've learned with other teams. This also fosters more collective learning for patterns, and you're not reinventing the wheel so much. You want to build a strong sense of community, and this makes the journey much more exciting for everyone involved.

The Paved Path for Back-end Services

Bakker: Looking a little bit at how back-end development is typically done at Netflix, we have what we call the paved path. The paved path is the easiest and best reported way to get things done. As an example, looking at this picture, it might be a little bit more exciting and more fun to go off the road and climb these boulders, and find your way around them. It's definitely more efficient to just follow the paved path and get to your destination. Very similar, we have the paved path for back-end development. Back-end development at Netflix is standardized on Java. We use Spring Boot. On top of Spring Boot, we have built many integrations such as distributed tracing and distributed metrics and logging. We have integrations with security systems to do authorization and authentication. We have a host of different IPC clients to do gRPC and REST. Also, server implementations to do gRPC and REST. Know that there is no GraphQL here because GraphQL really wasn't something that we were using a lot within the company when looking at back-end server development.

Set Up a GraphQL Service

Srinivasan: How do you go about setting up a GraphQL service? First of all, you want to initialize your Spring Boot app. Then you have to create a schema, which is the API that you want your service to expose. Then you write a data fetcher, to return data in response to requests. Then you need to build your schema for the parser. Then lastly, you want to set up your HTTP endpoint to respond to /GraphQL queries.

Bakker: Let's look at an example and see how that actually would look like in code. The example that we will look at is implementing a schema. It's a very simple schema. We have a query type defined, and on the query we have one field that we can query for, which is hello. It has enabled arguments and it will return as a message, as a string. If you want to implement this in Spring Boot using GraphQL Java, the code looks like this. It's a working example and it's quite a bit of code, so we are not going to look at the details of this code. The important detail that I want to point out here is that there's only a single line of business logic on this slide. All the other code is just setup code. It is boilerplate. Obviously, this would be code that every team would have to refigure out, and write. An important detail here is that this code although it's a working example, is really a naive implementation. It's really not thinking about error handling. It is really just taking care of the more simplistic happy path. Doing this in a more production-ready way, there's quite a bit of code to write. It will be code that every team would have to reinvent and get writes, if it would not provide anything else.

Srinivasan: You have the happy path, but now you want to add authorization. You want to add tracing, logging, metrics. You want to add error handling, custom exception handling, and so on. All of this is just repeated boilerplate code that users can easily do away with and not have to write each time in each of their services.

The Domain Graph Service Framework (DGS)

Bakker: Very clearly, we could do much better here. Early on in this architectural migration, we decided to invest in a Domain Graph Service framework, or DGS. The DGS framework is really GraphQL integration for Spring Boot Netflix. The Netflix part in that is also important. It's not just about general Spring Boot usage, but it is also about integrating with all the different components that we already have at Netflix. This framework uses GraphQL Java internally. GraphQL Java is a pre-built, maintained open source library that really takes care of the lower level details for doing GraphQL in Java.

Let's take a look at how the DGS framework looks like. The same code example that we have seen previously would just look like this code, using the DGS framework. You immediately see, it's a lot less code. It looks like a lot more declarative. In a familiar Spring Boot fashion, we use annotations for a lot of things. We use @DgsComponent to mark the component as being a class for the framework. We use @DgsData to say, this method is actually providing a data fetcher for a specific field in our schema. In this example it's the hello field that we had specified. We have integration with things like @Secured to get authentication, or authorization on a field level out of the box. This integrates with the Netflix implementation of these security concerns. What's left is really just a business logic that the developer should be concerned about. Then there's all sorts of conveniences to, for example, easily get input arguments using this @InputArgument annotation. It really makes your code really nice and concise, and most of all, that's really focused.

A huge benefit that we get out of this is that there's a lot of consistency between code bases. There were initially about 30 teams that needed to be onboarded on this new architecture. We have less than a handful of developers in the developer experience team. That's the team that's helping out all these teams, and help them whenever problems come up, and get things going. With that ratio, it is really important for us that if a developer comes to us with a question on, for example, our select channel, that we can go into their code repository and quickly understand how it's all fitting together, and really focus on the code that might be problematic. We get that because of this consistency, because all our code bases are structured in a very similar way, we can really quickly point out the code that we actually need to look at.

Another benefit that we're not seeing in this code, but that we are getting for free is that we get a lot of integrations with existing Netflix systems. For example, this code example integrates with distributed logging, distributed tracing, and distributed metrics. There's no setup required at all for a developer. There's all those working out of the box.


Srinivasan: Next, let's talk about testing. Testing can have a big impact on developer velocity. With the framework, we were able to provide a nice, easy way to eliminate any setup code and just focus on the business logic that you want to test. In this case, the GraphQL query. The rest of the test setup purely involves any validation steps you want to add. On top of this, we also make it really easy to construct your GraphQL query string. We've not shown it here for the sake of simplicity, but we provide a nice type-safe way to construct your GraphQL query string, as opposed to manually handcrafting it, which can be quite painful. Not only this, this just talks about testing your local DGS. This has to operate in a federated system, which has many moving parts. We also offer our developers many easy ways to test manually, end-to-end, their queries in a federated setup.

Schema Management

Bakker: These are just some examples of what we did at the coding level, the way developers run their code. Being in a federated architecture, there's many more parts that are required. One of the things to really think about is schema management, because each team is managing their own part of the schema, analyzed to come together. Looking at the broader GraphQL community outside of Netflix, Apollo has some really good tooling that helps with things like schema management. We actively collaborate with Apollo on both the Federation specification, but also the tooling around this to make developer's life easier.

However, because we have so many existing systems within Netflix, and we wanted to deeply integrate with these systems, we decided to build this tooling ourselves. What you get with this tooling is really the one-stop shop for everything around your DGS and your schema. You get schema efficiency. You get statistics on how your schema and the fields in your schema are being used and used by who. There's integrations with build pipelines, and distributed tracing, and metrics, and logging, and all these things. This really becomes the one place to find all your information and manage your schema and your DGS.

Tracing and Logging

Srinivasan: Then comes tracing and logging in this distributed architecture. As you can see, we have our tool here. This is a custom in-house tool that developers have already been using. With the DGS framework, we were able to integrate with this existing tool and make this available for all developers who have migrated onto this new architecture. This tool shows the request fanout pattern. As you can see, it comes in, the gateway is now forwarding it to many different DGSs, which in turn can go ahead and talk to other back-end services. All of this is captured in this view. It's all color coded, so you can see yellows for warnings, and red for errors. It also has the correlated request logs on the left-hand panel. We're able to get correlated logs, because in the DGS framework, we ship all the logs to Elasticsearch clusters, and we're able to view all the relevant pieces of information in just this one view. Edgar also allows us to drill down and analyze performance. At every stage of the processing of the request, you can see how much time it took, all the way down to your data fetchers.

To learn more about observability at Netflix, please check out Elizabeth Carretto's talk.


Bakker: Another very key aspect of developer experience, and so many of you spend a lot of energy on, is getting really great documentation. We have about 30 teams as part of this migration, and you have less than a handful of developer experience developers. How can we scale supporting that many number of teams and developers? Documentation is really key there, we believe. It is really critical that developers can easily get started themselves, can learn about the more advanced use cases by themselves, and for the most part, find solutions to problems they might run into themselves as well. The documentation is just really the key part there.

Learnings - Which Path to Pave?

Srinivasan: So far, we talked about the DGS framework that we built, and the several tools that we developed in order to make a smooth developer experience. For the remainder of the talk, we'll focus on the learnings from this journey. Starting off with, which experience do you want to build out or which path do you want to pave? In our case, we started really small with a very narrow set of users, a very narrow use case as well. To begin with, all we wanted to do was reduce boilerplate code. Starting with GraphQL Java, we wanted to add more features to it, but make sure that we don't take away anything. In order to do that, we needed to provide escape hatches at the right places, so developers still have access to all the low-level features in GraphQL Java, in the event that we didn't already support it in the framework. This was in the initial phases.

Then, as we moved along, and as the adoption started growing, we heard about more feature requests, and we incorporated a lot more features along the way. Now, gradually, we're up to a point where we have almost 30-plus teams who have onboarded onto this new architecture, all using our framework and tools. Not only that, we also have users that have nothing to do with Federation. They're not part of this migration, but they're also using our tools and the framework to just be able to easily start up a GraphQL service.

Bakker: Especially early on in this migration, we had to figure out what to show for and what features to add exactly, because the way this works is that the teams actually implementing these DGSs, started at the same time, as we as the developer experience team started talking about, maybe we should do a framework, and we should have built all these tools. Prioritization was really important, what to do first. That's early on with the teams that are actually doing the implementation work. The developer experience folks embedded with these teams, and were part of basically the first DGSs that were built. That way we got a really deep understanding of what problems to solve, what needed smoothening out. It also just built developer empathy.

Later on in the process, we adopted this philosophy that we never say no. Of course, there's a lot of subtlety to that. Essentially, there are many ways to say yes. What we mean by this is that we always prioritize the users first. Of course, we have the bigger features that we are working on, and things that we want to get out in the framework. Whenever a developer comes to us with a question, and that can be a bug report, but it can also be a feature request, and in many cases, they have some idea how to improve their developer experience. The easy thing to do would be to create a Jira, put it on your backlog, and come back to it a few weeks later whenever you have time.

Instead of doing that, what we do is we basically try to interact with these users right away. That means coming up with a design, basically, of whatever we need to do together with them. Then go and implement these features when possible, just the same day or the next day. With having that really short feedback cycle, we get a lot of benefits. First of all, our users, they feel like you're actually listening to them. Not just listening to them, but they get immediate benefits from being a user of the framework, because they have this idea on how they can improve their own work. Instead of having to put in the time to actually implement those features in the infrastructure, they talk to us, and they basically get it for free. The benefit for us, obviously, is that we get really good feedback, and we know what to focus on, because that's what our users need. This does slow down the work on bigger features a little bit, but that is a trade-off that worked out really well for us.

Engage With Your Users

Srinivasan: Finally, you want to engage with your users continuously. We have a support channel on Slack, where we almost hear from our users on a daily basis, whether it be feature requests, or requests for bug fixes, and sometimes even design discussions with a narrow focus set of users. We also do surveys to glean general patterns of how our tools are being used, and how effective they are across different types of users. We also found user interviews to be extremely effective, regardless of where you are in this process. Just targeted, small focused groups of users to learn about the workflows and general needs, can go a long way in informing your priorities and even figuring out what the next new experience to build out would be.

The Risk of Reinventing the Wheel

Bakker: This migration has been a huge success story for us. That is a success story that has been going on for about a year now. We also want to give you a little bit of a word of warning. That is, before you dive into building all these custom tooling and frameworks yourself, there is definitely a risk of trying to reinvent the wheel. The first thing that you really have to think about hard is that, can I actually do a better job than whatever is available in open source or the market? Because, especially at Netflix, where we have a really strong culture of freedom and responsibility, and it stretches also to technology choices, a team is free to choose a technology they think they can do the best job fit. Which means, if they think they can do a better job using some open source tool instead of using the tooling we provide, they're free to do so. They will actually go out and do that. We have to make sure that the users see the benefits of using the tooling that we provide, and are actually doing a better job than they would otherwise do, because they have these things available.

Something else that we put a lot of extra time and thought in, is how do we contribute back to the wider GraphQL community? How do we also learn from the wider GraphQL community? Concretely, what we're doing is we're collaborating very closely with Apollo. This is collaboration on the Federation spec, but also lots of in-depth discussions about how tooling should work and in which ways tooling can help making the developer experience better. This way, hopefully, others can learn from us, but we're also learning from how other companies are solving some problems that we might be facing. Definitely make sure that you give these things thought as well.

Srinivasan: We'd like to leave you with a famous quote by Winston Churchill, "A pessimist sees the difficulty in every opportunity, and an optimist sees the opportunity in every difficulty”.


See more presentations with transcripts


Recorded at:

May 28, 2021