In this podcast, Kavitha Srinivasan, a senior software engineer at Netflix, sat down with InfoQ podcast co-host Charles Humble. Topics discussed included: how the two main Netflix business units are migrating to GraphQL; how the schema is managed; performance considerations when working with GraphQL; the role of DevEx in a large migration.
Key Takeaways
- Since the introduction of the federation spec both the studio and streaming business units are adopting GraphQL. Streaming, which has been around for longer, started with REST and then adopted Netflix’s own Falcor JavaScript library for data fetching, but is now migrating to GraphQL.
- Federated GraphQL means that ownership of the graph can be split across teams. This in turn allows for a simplified gateway that merely acts as a router and requires minimal business logic. As with microservices, the way the schema is split tends to reflect organisational structure.
- GraphQL schema governance is challenging. Netflix has a schema working group that is part of the discussion when a team wants to add a new entity or field, and has experimental and deprecated tags that are used to help manage their large, complex schema.
- There are a number of performance considerations when working with a large GraphQL schema. In particular cross-region calls are not performant, and GraphQL also has an inherent N+1 problem.
- Focussing on DevEx - in particular providing good tooling and documentation - has really helped with the speed of adoption of federated GraphQL at Netflix.
Subscribe on:
Transcript
00:04 Charles Humble: Hello, and welcome to the InfoQ Podcast. I'm Charles Humble, one of the co-hosts of the show, and this week I'm speaking to Kavitha Srinivasan. Kavitha is a senior software engineer at Netflix. She has more than 15 years of experience in distributed systems and is currently working on developer experience to facilitate the growing adoption of GraphQL within the Netflix organization. GraphQL seems to be having a bit of a moment, offering a useful solution to the problem of re-aggregating information to solve business requirements. It's an interesting and powerful abstraction, but it does come with some trade-offs. So I was keen to talk to Kavitha, to find out more about Netflix's experiences. Kavitha, welcome to the InfoQ podcast.
00:47 Kavitha Srinivasan: Thanks for having me.
00:48 Migrating to GraphQL
00:48 Charles Humble: Absolute pleasure. Within Netflix you have the two top-line business units, basically streaming and studio. And studio, I think started with GraphQL and is moving towards federated GraphQL, whereas the streaming side of the business started with REST and an API gateway in front of its microservices, which I think is an architecture that's perhaps a little bit more familiar. So I thought maybe a good place to start would be to talk about those two different migration paths and how they compare.
01:18 Kavitha Srinivasan: As you know, the studio side of the business is more recent. It's come into existence since Netflix has started producing a lot of originals, Netflix originals, and shows and movies. And so it's really focused on developing apps for the pre-production, post-production and taking users through that life cycle. And there, since it's more recent, a lot of teams started doing GraphQL. They chose GraphQL to begin with. And we had a whole mix of different teams doing GraphQL their own way. It started off with having a thin API gateway layer that did GraphQL. And it would really actually just stitch the schemas from each of these individual teams together and expose a unified single API to various clients.
02:05 Kavitha Srinivasan: Now, this was a great place to start with. We've learned from many lessons in the past, and we started off with this. However, over time, very quickly because this side of the businesses was rapidly picking up, we found that that became the new monolith in that we had one gateway team and they were responsible for essentially stitching together the schemas and as the schemas evolved, each time we had to go and make a change in the gateway layer. Over time it just added on more and more business logic to the gateway, which is a huge anti-pattern. We just didn't want to do that. And that became a monolith in terms of just being able to scale the team. From a team perspective, we just didn't have enough folks who could keep up with the amount of change that was happening. And so that's where federated GraphQL came in.
02:52 What does federation bring to the table?
02:52 Charles Humble: So what does federation bring to the table?
02:56 Kavitha Srinivasan: It's still the same single graph that you expose to clients. However, the ownership of the graph is split across teams. And so now the gateway again becomes like a thin layer. And all it does is it routes parts of the request to the various backend services, which own a portion of the graph. So with federation, you can really share types. You can share entities and that's what it brings to the table. And so with that now, even the ownership has been delegated to individual teams and the gateway is just responsible for making sure we have the right logic in place to route. So when a request comes in, if the fields requested in that GraphQL query belong to two different services, it makes use of this registry where you register schemas and the corresponding services that own the schema.
03:44 Kavitha Srinivasan: And it can just route the query. It has query planning logic, which is the single most responsibility of the gateway. And thus various teams can now deploy, own their services, own the data, and you have everything working as before. So that's one big migration that's currently in the works and it's been going on pretty successfully. We were able to adopt federated GraphQL. And like I said, this is very recent and we were able to learn from various past lessons.
04:10 Charles Humble: So that's the studio side. What about the streaming side?
04:13 Kavitha Srinivasan: In the streaming side, we have a lot more history there. We've gone through multiple evolutions of the architecture. So as you mentioned, we started off with REST. REST was very chatty given the whole slew of devices that we support many hundreds and hundreds of different device types. So that became quickly unmanageable. And then we introduced Falcor into the mix, which is kind of like GraphQL, but it predates GraphQL, or probably both evolved around the same time but it's more proprietary to Netflix, to solve this chattiness issue. So you can get all the data you want with a single query instead of making multiple REST calls. And with that, we had various patterns. The first one we started with having the gateway, a host, various client specific adapters, so different UI teams could push their code that adopted this single exposed API to their needs because each device needs to adapt the API just slightly differently to customize it.
05:12 Kavitha Srinivasan: And so we started off with that model where the gateway itself would host these different pieces of server-side adapters and this allowed client teams to quickly customize their code and push their code and increase their velocity of development. But now this again became the new monolith. Everybody was pushing code to the same gateway and also the developer experience was not great here. It was increasingly difficult for teams to be able to test the small pieces of functionality they would push to this monolith. And so that became a big problem. And so we wanted to solve that. And so the architecture evolved into using BFF [Backend for Frontend]. So we split these adapters, server-side adapters, client facing adapters out of the gateway into separate node BFF. So that was the next version of this architecture. And so we've set about on that migration path and that's still actually ongoing several device teams are still in the process, but it's almost there.
06:10 Kavitha Srinivasan: That's been taking a couple of years now, and all this still with Falcor. And so now if you take the two sides of the business, we look at studio, we see, "Oh, everyone's using GraphQL. There's a lot of community support, now it has federation and that's been working great." And so we're looking at the streaming side of the business and thinking, "Okay, we have Falcor, we don't have community support. Can we do GraphQL? Is that better? Because inherently it lends itself to better self documentation. There's much better tooling and all the goodness around GraphQL that comes with it."
06:45 Kavitha Srinivasan: And so now we're thinking, "Okay, we have this BFF pattern. So clients are already able to adopt the API, but now GraphQL is supposed to do that very thing. It's supposed to do the job of the BFF because with the query, you can now fill up the essential pieces of information that you want. It lends itself to highly customizable APIs. And so now we're thinking, "Okay, let's start experimenting with using this federated pattern, GraphQL pattern in streaming." So that's kind of where we're currently at. So it's interesting just seeing both sides of the business converging towards a similar architecture.
07:21 When you're thinking about splitting the graph, how do you reason about that?
07:21 Charles Humble: Right. Yeah. That's really interesting. So when you're thinking about splitting the graph, how do you reason about that? What level of granularity would you typically apply?
07:32 Kavitha Srinivasan: I would say it's no different from how you would split your microservices function into functionality, mostly around business logic and basically what your team is responsible for. It mirrors the organization structure to some extent. And so each team would own the entities that correspond to the data that they would own. So that's kind of how you would split it. And because with federation we allow sharing of types. Actually, this is a better explained with an example: a movie has several different fields and you have a movie ID, you have title, you have actors, the script and so on. So it's fairly huge. And you can imagine that this one type you'll have many different teams wanting to own the data for it, but probably not the entire type, just parts of the type. So each team would own just certain fields in the case of movie and they all contribute and it gets unified into this single movie type.
08:29 Kavitha Srinivasan: So you have one service that would own this parent movie type, adding a couple fields. And you have many other teams who would say, let's say one team is responsible for managing scripts. The other team is responsible for managing talent. So in this case, the information database around several actors and actresses and the talent profiles, they would add those specific fields and own just those fields from this big movie type. And so that's how splitting happens at a high level. You could have other entity types that are just exclusively owned by teams and by services and are not much shared. Those are much simpler to reason about but the whole point of having - the whole advantage of having - this federated graph is so different services don't have to redefine these types. They don't have to duplicate the data. You have a single source of truth for a single type of data.
09:22 Do you have any governance around the schema?
09:22 Charles Humble: I'm wondering how you avoid two separate teams adding identical or similar fields to a schema. Do you have any governance around the schema?
09:32 Kavitha Srinivasan: Yes. So that's been a huge discussion point actually. And at Netflix, we've invested a lot of effort and energy into trying to get that right. It's not easy, but we actually have a schema working group and it is a very people heavy process, I would say. So any team that wants to be part of this federated graph needs to participate in the schema working group. We have a data architect overseeing the evolution of this single unified graph. And so anytime somebody wants to add a new entity or add a new field to an existing type, they have to come to this approach, the schema working group be part of the discussion. We make sure that it all makes sense and folks aren't just willy-nilly adding pieces of schema to the graph on their own.
10:16 Kavitha Srinivasan: So we're trying to get better with tooling to make that easier to visualize what the schema graph looks like, to automatically reject any sort of pieces of invalid schema. We have some tooling for that in place. So there's some safeguards but more importantly, it's this being part of the schema working group. That's been a huge factor for the success of maintaining this big federated graph.
10:39 How do you manage access control and access to data?
10:39 Charles Humble: And how do you manage access control and access to data? How do you prevent people from having access to data within the graph that they perhaps shouldn't be able to see?
10:48 Kavitha Srinivasan: Yeah. So because it's federated, we have some basic authorization. So within Netflix, there's a general authorization structure and authentication structure. So the gateway would make sure that all requests coming in are authenticated; it's from valid users. And the authorization is actually left to the individual back-end services. They basically ensure that the fields that are being requested using Netflix specific policies, they are able to ensure that whoever's requesting for data fit the profile that's been set by the back-end services. And so it is federated in some sense. There's some basic level of authorization but we try to make it easier through DevEx. So we did develop a framework on top of just basic GraphQL Java, and this framework adds all these Netflix specific authorization decisions and things like that.
11:43 Kavitha Srinivasan: It makes it easier for back end services to do authorization at that level. So they don't have to reinvent the wheel each time. So, it is somewhat all federated but we're able to scale that because with DevEx, we're able to provide a platform and the framework and the tooling required to make it easy for them to do that.
12:01 Do you have a deprecation process of some kind?
12:01 Charles Humble: GraphQL doesn't do any versioning as far as I understand it. So how do you manage something like removing a field from a graph? Do you have a deprecation process of some kind?
12:11 Kavitha Srinivasan: Yes. We do. We have a recommended lifecycle and we have some tooling to enforce that as well. But the recommendation is, because there is no versioning anything that you add is inherently non-breaking, it's backwards compatible. And so you can add new fields. When folks are in the experimentation phase. So they don't exactly know what the final schema should look like before going into production. So we have this @experimental directive that you can add in your schema, and that ensures that you don't have to go through this whole deprecation lifecycle. So when you're in this experimental phase, you use this directive, so you can easily add and remove fields without breaking schema for others.
12:49 Kavitha Srinivasan: Of course, there's some amount of responsibility. You need to be sure that no one else is using your field because it's new and you're still experimenting with it. And then once you've decided, yes, it's going to be part of your schema and you've pushed it out to production, it's there and you have several clients using it. We have client usage stats. So our tooling on the gateway, we keep track of who's using which field, we visualize that with good tooling. And also we have some schema validation checks in place. So whenever you're pushing changes to your schema, it will tell you whether you're removing a field or you've not deprecated something.
13:25 Kavitha Srinivasan: And so when you're pushing changes out to production, this is the process. And if you want to remove a field that's being used, you want to a) make sure that you mark it as deprecated using an @deprecated directive in your schema, and this instructs the tooling to say, "Okay, if you've marked something as deprecated and you don't have any users of that field by way of tracking the client, useage stats, then you're allowed to remove this field from your schema." So we're kind of still evolving and trying to make that process better but this is roughly how we go about it.
13:56 How much attention do you need to pay to where the data you're fetching resides?
13:56 Charles Humble: I would like to talk now about some of the performance considerations. So I'm imagining that for some queries, there will be a performance hit if you are dealing with network latency, for example. So perhaps if you're going across regions. How much attention do you need to pay, to where the data you're fetching resides?
14:17 Kavitha Srinivasan: Definitely cross-region calls we've seen are not performant. So it's recommended that you don't do cross-region calls. So that's been a common pattern we've seen. Over and above that GraphQL inherently does come with the N+1 problem. Let's say you want to fetch movies, you want to fetch certain fields from a set of movies. You have to get the movie IDs, and then you have to, one by one, go ahead and fetch the individual fields for each of the movies. And so in GraphQL, we already have this data loader pattern, which will reduce the chattiness to some extent. It lets you batch those calls.
14:52 Kavitha Srinivasan: So you end up reducing the total number of queries. So that is much better for your performance. And so that's something that we've been advocating for making sure a lot of folks who are writing these services use the data loader pattern to make sure that they don't run into this issue with performance. Beyond that, it's pretty much left to each of the teams to make sure that the queries that they're making are performant and optimal. You also want to sometimes limit the query depth. So if you have cyclical query references to the same data back in your query and that's how you've structured the schema, you can limit the depth of the query by enforcing a depth in your service. And so the query terminates at that point. So these are some techniques we've advocated for as part of best practices for performance.
15:40 Charles Humble: And do you run any parts of the queries in parallel as well?
15:43 Kavitha Srinivasan: That logic is mainly on the gateway. It's part of this query plan construction. So when there's an incoming query, there is logic on the gateway to identify which parts need to be executed serially, and which can be executed in parallel and directed to different services. And in some cases there's also queries that require serialization because certain fields are dependent on the fact that there are other pieces of data that are part of the query. So for example, you take this federated movie type that I was referring to earlier. Most other services would require a movie ID, which is a key in a federated type to identify that type. And so any data that they would need to provide each service would need to provide depends on that key and on that movie ID being present.
16:31 Kavitha Srinivasan: So in this case, the first query would go to a movie service to fetch the basic information. And then you would use that key to fetch data from all the other services. So this is essentially sequential, but from then on the gateway can construct a query plan such that the queries go in parallel. So that logic is completely dictated by what fields are being requested in the query and what that sequence should be.
16:55 How is caching handled?
16:55 Charles Humble: How is caching handled? Do you cache at the local microservice level, or do you cache the whole graph within the gateway?
17:03 Kavitha Srinivasan: There's actually no caching happening here. So each request is independent. We don't cache any of the queries here, any sort of caching would have to happen on the client side. That said, we are looking into having some persistence in queries. Now GraphQL Java has persistent query support. And so if the gateway receives a query that it has essentially seen before using an hashID, it can look up that query and return the result for it. But other than that, there is really no caching on the server side of things.
17:34 Charles Humble: Okay. So the GraphQL gateway is not stateful.
17:37 Kavitha Srinivasan: No. It's not.
17:40 Is this your gateway based on the open-source Apollo gateway?
17:40 Charles Humble: Is this your gateway based on the open-source Apollo gateway or is it something that Netflix built themselves?
17:48 Kavitha Srinivasan: So it is based on the federation spec that Apollo came out with. It is not exactly Apollo gateway, and we have a few reasons we chose to go that route, but they are interoperable in that they both implement the same federation spec and we collaborate heavily with Apollo on that as well to make sure we're compliant and we're following the same path we don't deviate. But the reason really within Netflix, we had to implement the federation spec in-house is because we have a lot of custom tooling. We have metrics, and we have observability tools, distributed tracing. All of that is already part of the Netflix ecosystem, security and authorization mechanisms. All of that is another example as well. And so we needed all of this to co-exist. If we had to take the Apollo gateway, we would have had to build all of this on top of that, which it was much easier to just implement the gateway and have it integrate with the Netflix ecosystem, just out of the box.
18:51 Charles Humble: Have any plans to open-source your gateway at some point?
18:52 Kavitha Srinivasan: We actually, don't. Not that I'm aware of. So currently we're not planning to, because like I said, the whole reason we have it is because of the tight integration with the Netflix ecosystem. However, we do have the DGS framework, we call it - the Domain Graph Service framework that is built on top of GraphQL Java, and that's what each GraphQL back end service is that owns a piece of this federated graph. So that framework, we are in the process of open-sourcing early next year. And it provides a lot of conveniences on top of GraphQL Java when it comes to being able to set up, get up and running, it takes barely a couple minutes to actually just start up a service using your Spring Boot initializer and write a simple "hello world" data fetcher, actually. And so we have built a lot of conveniences on top of GraphQL Java that we think would be valuable. And that's something we're planning to open-source early next year.
19:47 What are some of the challenges in terms of getting federated GraphQL adopted at Netflix?
19:47 Charles Humble: I wonder if we could perhaps step up a level and talk about what some of the challenges have been in terms of getting federated GraphQL adopted at Netflix?
19:58 Kavitha Srinivasan: As I mentioned earlier, I've been part of the developer experience team. So right off the bat, one of the biggest challenges was, okay, GraphQL is also fairly new, and on top of that federation is a whole new concept. And so since it's so new, we don't have a whole lot of existing knowledge to draw from, so education was a big aspect. Trying to get folks to get familiar with the concept. And so that's where developer experience really helped. We focused on good tooling. We provided a framework to make sure our users, our developers didn't have to learn GraphQL, let alone federated GraphQL from scratch. We try to abstract that put a nice abstraction layer, so they essentially are dealing with very simple concepts. So that was one big challenge, just education and making sure we're providing the necessary platform and tooling. So users don't get bogged down by the nitty-gritty details.
20:56 Kavitha Srinivasan: They can just focus on business logic, because here, like I mentioned before, we were talking about having a big migration. And if you have something working already, there's very little incentive to move over to federated GraphQL. If they already have something working. And the main push behind moving to federated GraphQL was so that we could slim down the gateway team's responsibilities. We could slim down the responsibilities of having to put in business logic, having to change the gateway each time something changed between the front-end and the back-end. And so trying to incentivize that was one of the big challenges I'd say.
21:33 Kavitha Srinivasan: And then just trying to make sure that the schema, I talked about the schema working group, I wouldn't say it's controversial but it's hard to convince everyone to get into the same... It slows developers down. It essentially comes down to developer velocity. At Netflix, we have a culture of loosely coupled with highly aligned teams, and so we make sure that no one's blocked on another team to get their job done. They're able to move independently, but converging towards a common goal.
22:06 Kavitha Srinivasan: And in the past also back end teams and front-end teams would like to move at their own pace. They don't generally want to be dependent on each other for any changes. So with the introduction of the federated graph in the schema working group, we were essentially saying, "Hey, slow down. You don't need to put in changes that rapidly. Let's make sure that the schema makes sense. Let's make sure it's schema-driven development." So we heavily discouraged generating schema from existing code. So if you already have some data available, there's plenty of tooling available to generate your schema based on what you already have. We heavily discouraged that. We said, "No. Let's make sure that your API makes sense.
22:45 Kavitha Srinivasan: Let's make sure that your schema makes sense. So go the other direction, given a schema, you can generate code and go from there." So that was another challenge. We had to convince a good number of users to make sure that we do schema-driven development first. And then of course an ongoing challenge is just making sure that the tooling that we provide, the framework that we provide, we had to keep up with the different types of use cases. When we started off, we had a very narrow use case. We said, "Okay, we have GraphQL Java, let's make sure users can quickly write a GraphQL service. And then that quickly evolved. You had different teams. We discovered new requirements and we had to make sure we kept up and we have this concept of a paved path.
23:26 Kavitha Srinivasan: And we wanted to make sure everybody was using the tooling and the framework that we provided. So they didn't have to go off on their own and solve the problems their own way and creating new tweaks of different types of problems. And so to make sure that they experienced the least amount of pain during this migration, we had to make sure that we kept up with evolving needs and all feature requests. So these were some of the challenges. It's been a learning process and it's been getting better and better over time.
23:55 Did you have any “early adopter” type challenges working with the federated GraphQL spec?
23:55 Charles Humble: I'm imagining as well, Netflix was one of the very early adopters of the federated GraphQL-spec. And I'm guessing too, you've probably got one of the largest graphs. You're probably one of the largest users of the federated GraphQL spec at this time. So I'm wondering if you had any particular early adopter type challenges in terms of working with the federated GraphQL spec?.
24:21 Kavitha Srinivasan: That's actually an interesting question. So I did mention that we collaborate heavily with Apollo. We meet with them almost on a by weekly or monthly basis. And so the spec, since we started off has definitely evolved as we've hit new problems. As we found, we needed to tweak the spec to clarify any confusion, to put better checks in place. We've been able to collaborate with them on the spec and both our implementations have evolved, taken the same path in the types of the evolution of the implementation itself.
24:56 Could you describe what the developer documentation looks like at Netflix?
24:56 Charles Humble: When we think about these challenges around adoption, one of the things that Paul and yourself spoke about on the "Architectures You've Always Wondered About" track at QComp Plus in November, was the emphasis that Netflix puts on developer documentation. And I was quite struck by that. And I wondered if perhaps you could describe what the documentation looks like at Netflix?
25:22 Kavitha Srinivasan: Definitely. So we have our internal Wiki site. We host everything up there. So it's more of a tutorial style I would say. Starting off, we've done a mix of different things. Some are video tutorials that we embed on the Wiki page and getting started from there, we have more advanced documentation that we try to keep up to date as much as possible. That's the approach we've taken. And so most users get directed to our internal documentation site and they're able to navigate, either go to tutorials or trying to find specific topics that they'd like and dig deeper into. But we try to make sure that for every feature that we've added, there is some form of documentation available. We try to make sure everything is as self discoverable as possible. We do have a support channel as well, where folks feel free to ask questions, but for the most part, we're able to scale because we're able to redirect them to the documentation site. And if we find any gaps, we immediately try to fix that. So they feel that they can self help as much as possible.
26:26 Charles Humble: And in terms of how the documentation is actually produced, what tooling are you using? Are you working with Java doc with AsciiDoc? Is it published in Markdown?
26:35 Kavitha Srinivasan: Oh, it's all Markdown. We put everything in Markdown format and we have code, snippets in place. That's what we use. It's very similar to GitHub pages.
26:45 When you think about DevEx at Netflix what are you optimizing for?
26:45 Charles Humble: When you think about DevEx, at Netflix, what are you optimizing for? Is it Developer Velocity or is it something else?
26:54 Kavitha Srinivasan: It's primarily Developer Velocity. We want to make sure that the experience is as seamless as possible. Initially, obviously every time we're starting off, we haven't addressed all the different use cases. There's a lot of rough edges. So we use the term incubation at Netflix. So we work with a small subset of users. We participate in their workflows, essentially. We help them adopt the tooling. We help them write their code, using the tooling that we are developing and thereby that's how we discover the rough edges and try to improve. So first and foremost is definitely Developer Velocity, any sort of bugs they experience, we try to fix immediately. We don't want to see them blocked. Obviously, if there's bigger feature requests, it's all prioritized depending on a sense of urgency, but anywhere we find that there's repetitive work being done, or work that can be abstracted away to a higher level. We try to get in there, provide any tools and make that process easier for them.
27:59 Are there particular metrics that you track to measure DevEx success at Netflix?
27:59 Charles Humble: Are there particular metrics that you track to measure DevEx success at Netflix?
28:05 Kavitha Srinivasan: We don't have any metrics. It's a hard concept to measure when it comes to developer experience. Like I said, we have support channels. We also know how many teams have adopted/our using our tooling. So for example, I talked about the Domain Graph Service framework. We know how many services are up and running in test and in production, we know how many teams are involved in this migration. And so it's more the number of teams that have adopted. And that's how we go about measuring how many folks are adopting the tooling successfully and are able to do their job essentially. With developer experience it's more likely that we hear when things are not going well, rather than when things are going well.
28:47 Kavitha Srinivasan: So it's true in that if everything's working well, most folks are satisfied and there's no complaints but when something is off or wrong, you'll immediately hear about it on a support channel or on Slack. And so that's when you know that, that's the real-time feedback that you get. And to answer your question, we don't have concrete metrics that we go by, but it's more in terms of the number of developers who are using the tools and the framework that we provide.
29:15 If people want to get started with federation, where would you suggest they start?
29:15 Charles Humble: So we're at just about the end of our time. One last question would be if people want to get started with federation, where would you suggest they start?
29:23 Kavitha Srinivasan: I would suggest starting with Apollo's website. They have great documentation on federation. They have the federation spec and they also have their gateway that you can just take and play with. That would be a great starting point to get familiar with the concepts as well as just how to run it, and work with it.
29:39 Charles Humble: Kavitha, Thank you very much indeed.
29:41 Kavitha Srinivasan: Thank you very much. Thanks for having me.