Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Not Sold Yet, GraphQL: A Humble Tale from Skeptic to Enthusiast

Not Sold Yet, GraphQL: A Humble Tale from Skeptic to Enthusiast



Garrett Heinlen talks about how Netflix builds and deploys GraphQL and how they are running it in production.


Garrett Heinlen works as a Software Engineer at Netflix. He is an enthusiastic polyglot engineer.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Heinlen: It all started about a year ago, I got a new job at Netflix, it was super amazing. I got to be on this brand-new team, got to do some really cool projects which is always way fun. We evaluated what tools or languages we wanted to use as a part of this. A lot of teams at Netflix use Java, some are using Ruby, Node here and there. We read up on the pros and cons, but ultimately we chose Apollo, GraphQL, Node, TypeScript with very little entities. I'm very proud of that.

Over the year or so that I've been working on this project, our team often eats lunch together and they would always want to know how this project was going because we had adopted a lot of these new tools, and in particular the GraphQL segment of this there, was just a lot of interest in. As I learned more and more about this technology, I would always rave about the cool new features or how it was different than REST, or what I was confused about with what I was learning. I didn't realize I was doing this, but every lunch I would be like, "Oh, I'm not really sold yet," after every single time we would talk about it, so I inherited this catchphrase of not being sold yet, even though every day I'm talking about how cool it is.

After a couple of months my friends convinced me to give a talk from a skeptic point of view, so I did and I've given it this talk once before or so. The first thing you do when you get accepted to give a talk or when Anna asked me to come talk on this, is you buy all the books to try to figure out what the GraphQL is, so I did that.

I hope you've heard of this thing called GraphQL, if not, you're probably in the wrong talk, but I really hope that I can shed some light to the power that I think GraphQL really brings. If not thing else, I hope to give you some talking points if you're already sold to go tell your other teammates. The truth of it is, until recently I wasn't really that big of a fan of GraphQL. I hadn't really used it in any production capacity, I hadn't built any side projects with it, and honestly, I thought it was just this new front end fad that would come and go. Once I started looking into it, really building things with it, digging into it, I noticed some things that I liked about it weren't actually much of the technical parts at all. I really enjoyed what it would enable for an organization, and help solve problems I think people face day-to-day within big teams. Since then, I'm a convert. I'm GraphQL enthusiast.

I just moved from Australia, I like video games, I'm really excited to be here so please feel free to come talk to me after this. I'm happy to share all my experiences and thoughts around this topic, so don't be shy.

Working at Netflix

I work at Netflix and I've been there a little bit over a year, it's gone by amazingly fast. I feel really grateful to be a part of such a great organization and team within Netflix, and they have a very interesting culture, very unlike anywhere else I've ever worked. I think I've learned a lot and grown a lot professionally as well as technically.

Before we get started, I want to share some context around how Netflix actually operates and works. The topic of this presentation is basically two main organizations. One is product or streaming, which you are all hopefully familiar with, like recommendations, downloads, things like this. The other one that's less known about is the content engineering space where we actually help build and facilitate all the tools that go into creating these shows that we all watch. That's the one that I work within.

What does that really mean? Let's say you want to make a band, you need to find people to be in your band. You've got to find a place to practice, you’ve got to figure out what kind of genre music you're going to play. That's just the start. You've got to get your first gig, you've got to record your first album. You start selling some CDs, where do you sell your CDs, what languages do you distribute this in? Stuff like this. You've got to work out who gets paid. Does the lead singer get paid twice as much because they sing? That doesn't make sense. A whole lot goes into making these shows, and that's basically where our teams come in.

In short, we help create all the Netflix originals. We're building the largest internal studio in the world and we have to do that in a way that will scale, not with people but with technology. We often look at what technologies we can use to do this, and we really think that GraphQL is going to have a big impact on that role. Netflix has heaps of teams, an ever larger growing amount of teams, and each team has the freedom and responsibility however they see fit. There's no one dictating way of ruling us, building systems, it’s very freeform. The top spots are innovation, creativity, and it powers individuals to build things the best way possible. This talk is going to specialize in particular within my team within my organization, but more and more teams are starting to use GraphQL across the entire company.

Netflix's culture is very different than everywhere else, they like to take really big bets. They started getting DVDs and shipping them around the world, and now they're one of the largest streaming platforms in the world. They didn't do that by always playing the safe bet, we pride ourselves on making these big bets and like [inaudible 00:05:27] when made a mistake or when they fail, but we really believe that GraphQL is going to drive us forward in the next phase of our development cycle.

My team has been building a single entity graph over the last year or so. We have a lot of different downstream services, so some of them are gRPC integration, some of them are Java, some of them are Ruby, some of them are REST. The thing is, we don't really care. We define a GraphQL schema that's a product focus representation of our domain, and then we build UI’s against them. How we feed the data to that graph is irrelevant for allowing us to move very quickly. We've been showing this off to other people and they want it as well. It's definitely spreading within our organization over the last many months. Many of these graphs, single entity graphs, are starting to emerge and people want to keep adding their entities to ours, they want to get ours into their systems, and it's just really becoming a contagious system within our organization.

We're still really early on in this journey. I think GraphQL is still a fairly new technology, and we've only recently been using it, so I'm excited to hear from people in this audience if you have more experience or things you can help us learn from, please come tell me those things.

QL is Good for Teams

Why should we all care about Graph QL? Why are you hosting a conference? Why are you all here listening to me talk about it? At the end of the day, I think GraphQL is really good for teams. It fosters team communication in an unmatched way, it radically improves the way front end and back end engineers work together, and acts as a living source of documentation for your taxonomy of your API of your system. It's a really powerful tool, so let's dig into why I think it's good for teams.

This is a really big rock, and so is this one. Any guess what this one is? These are all monoliths. This might be an unpopular opinion, but I think monoliths are really great. They have their own problems, I'm sure. Maybe you've all experienced them in your past, but I would argue that a lot of essential companies started as a monolith and there must be some reasons for this. If you look into them, they have some really interesting characteristics.

All the code lives in one source, single system. Each commit is an atomic action. It's easier to debug because it's all in one stack trace. You generally don't version because you're the only client of yourself generally. I guess you could summarize to say information is easy to find when it's in one place. It might be a big ball of mud, but you love that ball of mud and you know how to find it, dig your way through it.

On the other hand, though, distributed systems are inherently complex. They're only as good as the internet in between them. You have to figure out some way to pass messages between them, you have distributed transactions, you have eventual consistency. Some of those buzzwords I was able to find on Anyway, information is harder to find when it's spread across everything, so everything is isolated and separated. There's definitely a lot of pros of this, but I would argue it's much more complex.

Why do we opt for the more complex almost often of the two? I think most successful projects started as a monolith, but we almost always choose the harder of the two. I don't believe it's a coding problem at all, I think it's a people problem, because when you break it down, communication is very complex. It's inherently lossy, it breaks on its scale. It's just an interesting topic in general, where I have a thought in my mind and I want to put that same thought in your mind, and all I have to do that is with words. And that's just challenging one-to-one, let alone when you have 30 or 40 people on your team all trying to move the same directions. Computers will do what we tell them to; people aren't always so easy.

I've been mixing these words on purpose here. There's a really good talk by Rich Hickey called "Simple Made Easy." He covers the main difference between simple and easy, complex and hard. You should really go watch that, so much better than this one, but not right now. I'll try to summarize it very quickly. Simple is composed of a single element. It's very easy to see where one thing starts, and where it ends. A straight rope, pretty simple. On the other hand, this is a complex system. There's one rope, it's very hard to know where one starts and one ends, they're really combined together. This is interesting, but let's see how this comes into play with our day-to-day interactions within our team.

If you look at this amazing diagram that I prepared earlier, each circle is a person and each line is a communication path between a person to a person. If you have three people on your team, you have three lines and they're very simple, they're not complex, they're very one-to-one. If you double the size of a team to six, now you have 15 lines. If you go to a team of just 12, it becomes a very complex matrix of interactions within people on your team. As lines generally get bigger or more numbers of them, you usually need more process. You might have impromptu daily standups or you might have Scrum management, you might have Agile coaches, whatever. It gets way more complicated as you progress in the system.

How do we scale this? There's no way that 12 people or 50 people or 100 people all working on one system can really work efficiently, quickly, all moving in the same direction, because the cost of communication is too complicated and too high. The most logical thing, which I'm sure is no surprise to much of you, is to break these into smaller teams. When you go back to the dream people of three people, three lines, not a very complicated system. The same number of engineers, you can have a much more efficient well-oiled machine doing their little bits.

With microservices, we optimize for independent teams. Each team can become their own dream team and moving very quickly in the same direction and the cost of communication is quite low. This empowers single teams to move very quickly, but that's only one team. What happens when these teams need to work together? Data requirements very rarely ever live on just one system. You need to push and pull data between them. How do we do that today?

There are a number of ways, but a very common way is REST. I'm not here to bash REST, REST has gotten us a very long way and has a lot of good things about it. File uploads, for example, do that with REST, GraphQL is still hard. The main issue I still have with REST is I feel like it's easy to make mistakes with REST. Let's say the resource of person, is the plural of this person, is it people, is it persons? I'm sure the spec talks about exactly what to do, but would you know what to do or would your teammate know what to do when they go make this endpoint? What about versioning?

Great, you want to make a breaking change, you slide everything under a V2, make your new code, you deploy that new resource that you've added, that you've changed. Do you bump all the existing ones to V2, or do you now have a V2 and a V1 and all your clients even know which one to hit? I feel like you've added a complicated system to a complicated system. I'm sure the spec talks about what's the best way to do this but it's been done differently at every company I've ever worked at.

The mesh is back, except it's not with people, it's with many systems, and that's much more challenging over the internet. Each new client, each new resource has to be created manually with all of its nuances. A client needs to know which API do I talk to, what are the auth requirements or what are the API versions? It just becomes a very complicated system which is a costly thing.

GraphQL Allows for Optimizing for Organizations

I believe GraphQL also goes a step further beyond REST and it helps an entire organization of teams communicate in a much more efficient way. It really does change the paradigm of how we build systems and interact with other teams, and that's where the power truly lies. Instead of the back end dictating, "Here are the APIs you receive and here's the shape in the format you're going to get," they express what's possible to access. The clients have all the power between pulling in the data just what they need. The schema is the API contract between all teams and it's a living evolving source of truth for your organization. Gone are the days of people throwing code over the wall thing like, "Good luck, it's done." Instead, GraphQL promotes more of a uniform working experience amongst front end and back end, and I would go further to say even product and designer could have been involved in this process as well to understand the business domain that you're all working within.

The schema itself can be co-developed amongst anyone because it's just an SDL or a Schema Definition Language that has an implementation detail, it's just some syntax that describes where are the entities in your domain. I'm pretty sure most people could write these if they're familiar with their domain apps. No more making funky APIs to meet your graphic needs of your UI constraints, and no more back ends giving you, "Here's what you have to use because it works for us."

Instead, you build a schema that's a representation of your business, and I would argue that schemas are a superset of all use cases because once you have this system defined in a very neat way, many teams can build against that without having to change much. Instead of exposing database tables over in API, which I feel often is done in REST, instead, you can build a product focus schema that really reflects your business and what is possible to do within this domain. I think that is what really powers the clients to build amazing UIs and prototype very quickly, whether that's on an Android device or an iOS device. They can all leverage the same system that is product-focused.

Have you ever been in a meeting where people are talking about slightly different things, users, guests, admins, you're all trying to figure out what you all mean and then you go back to your editor once you've all aligned on what you've got to go build, and you've converted into another word? I think it's a very common within most coding organizations because coding is just challenging. People don't have a ubiquitous language to define what their system actually is and what it does, but I believe GraphQL promotes that by nature of having one schema that reflects your app entities, so if you're all discussing the same words meaning the same things, whether you use GraphQL or not I think having that ubiquitous language is generally a good thing to do in the business, but if technology can drive that as a way of implementing it, that's only going to bring amazing results.

If you have this product schema, you have your designers, your PMs, your other engineers on board, it's going to change. I complained earlier about REST end versioning, I feel like it's a lot easier to do this with GraphQL because you never delete anything, you just always go forward. I'm just kidding, you can delete things but it's much more of an evolution so you keep adding fields, you deprecate old fields, you do some analytics, you say, "Look, no one's using that old field. It's now safe to remove that thing."

There are amazing tools like Apollo Engine and I'm sure there will be many others that come out, that do auditing and client detection of what's actually being used, what's low. You can see your graph for what it is and just move forward, change the shape of the message, deprecate the old things. It's a rolling forward development cycle, as opposed to breaking changes, big bang, V2.

One of the biggest fears I see when people are faced with GraphQL is they have to rewrite the world. You've already built your castle, you have your amazing infrastructure, you have all these microservices, but in reality, none of that has to change. You can leverage your existing tools and platforms and just enhance it with the GraphQL. You don't have to rewrite the whole world.

What this means for back end engineers is amazing, they can keep operating in their same development cycle, they can have their high SLAs, they can care about their gRPC endpoints that they maintain and just wrap that with a GraphQL schema. If you have an old API that no one knows how it works? Ok, treat it as such, make it a black box. Expose the endpoints that you need to interact with it over a GraphQL endpoint. You can continue to optimize what team is best at, while enabling everyone to move quickly and iterate on the UX of their product, or even allow other back end engineers consume this graph if that serves their needs.

You have this product schema, things are moving quickly and you can evolve ever requiring changes that come up. What happens if you have a few of these entity graphs? I've been talking in the context of one entity graph. There are amazing tools, like Apollo Federation that's coming out really soon, which is a way to deploy an API gateway that can emerge all of these entities into one graph. It's really interesting, some people have gone this far, automating this entire process with custom tools, but all of those opensource work is coming out to be available to everyone. What would this look like in practice?

For my amazing graph I prepared earlier as well, these teal boxes are the microservices for users, products, accounts, and reports. Each of these can have their domain entity graphs, their small dream teams working really quickly, moving very fast, and they define their own schema. They can push this up into a schema registry. This is a more specific implementation detail, but basically you want to merge these entity graphs into a way that you can track changes and update and automatically code-gen effectively this federated graph, which is one graph of all of the entities, that then any of the clients can consume and then drive their products forward.

As a consumer of your PS4 or your iPhone, you don't have to care. You say, "Here's one graph of all of my apps," business domains, context entities, how they work. I go there, I get my information, I can build my systems, and as a microservice provider you say, "Here's a graph that I own. Here's entities within them, and here's the schema of that." Then, merge them together.

Let's build our distributed monolith. We'll have a single graph, a single API team developing rapidly, all speaking the same language, moving quickly, and understanding the actual needs of the business. This will empower many teams to move very quickly. I think GraphQL promotes a new type of service. It's a higher-order service. It's like a giant map function in the cloud. Countless clients can develop against it and build whatever workflow or tool they need to make an impact within their business. As I mentioned, my team over the last year has been building this single entity graph.

Having this graph has really changed the way that we think about the information within our system. As I mentioned, we have many downstream services and historically it's been, "This is the data we can get from here. Here's the data we can get from there," and it's been very siloed of how we actually implement systems. Going to this graph model, it doesn't really matter; we just say, "Here's the information. What do we need to solve for the user to build our products?"

We can really dig into the complexities of not where the data's stored, but what do we provide to people that need our tools? This scale's not even within our team. Sure, we build the graph and how it works, but other teams can go to our graph through a graphical tool and explore our schema and understand all of the data entities within our system, without even reading any of our documentation, without talking to us about it, meeting or handover. They can just fetch the data very trivially within our system so it's an optimization there as well.

It's a new way to answer questions that previously have been impossible to answer. We see things as a unified model, but the thing is, this is just the start for us, the prize is bigger. Once you're able to answer these questions and you have this entity in how they relate to things, you can really start to change how you develop systems because you no longer need to care about persistence or things like this. Sure, those things do come up in terms of performance or, we'd have to write things like this, but those are implementation details I feel like the graph schema allows you to change the way they think. The problem is once you have one entity, you really need them all, because like I mentioned earlier with REST, one system really only has one entity within it. They need to push data between them. Things naturally relate to each other and they often live within different systems.

To have a full graph of entities, you want to traverse these relationships and understand them in a much bigger scale. Luckily for us, other teams have wanted to contribute to this system. People are excited to talk about what should it be named, what is the proper meaning of this field, how do we change this moving forward, how do we migrate and also to this model and it's spreading. Over the last year we've had many dozens of apps, but they're single entity graphs and we want to put them together so people are really excited to work on this initiative.

Over the last year, we've been doing exactly that. We have a team dedicated to building a multi-entity graph which is a holistic view of all our content engineering, entities, domains, and operations we can do on them. They've been pulling them into this federated graph and it's a pretty challenging effort. Each team operates independently, they have their own release cycles, their own product managers, their own needs. It's not coming for free, but the trick is we've been using GraphQL for a while at Netflix - not really GraphQL but there's a thing called Falcor and we've been using that for quite a while as well. It's also a graph database or graph API rather.

If you haven't heard about it, it basically supports and it's the way that we fetch data for the needs of product. It's very different, but it has a lot of similarities. You define the paths of the field you want to select and that's only when it's returned to the client. There's a lot of caching, duplication reference and normalization. Some of the things are in GraphQL, some of them aren't, but the thing about it is, we have a lot of experience building Graph APIs to support our products.

We've been sharing notes, we can cheat on this test of learning from the people who have been doing it for the last 5 or 10 years. They work right next to us and we've recently had a reorganization to move our two organizations to be a part of the same team, so a single platform team is not going to be maintaining the graphs for both product and for our content engineering space to allow us to move very quickly and learn from all those experts that have been doing this for quite a while. I'm very interested to see where this goes in the next six to nine months, I'm excited.


With GraphQL there are still some challenges that we're trying to face. One of the learnings I wish we had done was talk to the Falcor people much earlier but we're there now so that's ok. There are a few things that we still need to figure out. Schema management is a very challenging topic. I've alluded to that it's great if you can align and move in the same direction, but there are many ways to solve this problem and I really do think that it depends on your organization and the size of your team to do this well, but I'll just go over some of the challenges or ways that you might be able to do this within your teams.

Do you define the schema to reflect your UI’s, or do you define it to reflect your products? As I mentioned before, there are users, products, payments, and reports. Do you take those exact entities, put it into a graph, and expose that to your UI’s directly, or instead do you say what do the UI’s need exactly, like a gallery or a discount section and find the products use cases and the UI in the schema to be easier to consume. There are pros and cons definitely to both approaches, so that's something you'll need to figure out what makes sense for you.

Then, who owns the schema? Do you have a single team that mandates which changes can be allowed in? Do they say, "No, that doesn't fit our use case," or, "We're actually going to hold off on that until more teams need it."? This can work very well if your organization is small or very large either way, but it's definitely something that is to be considered. Another approach is do you instead have informed captain? Informed captains is a thing we use at Netflix to derive or to determine who's leading an initiative. Potentially per entity you have an informed captain that says, "Here's a schema for our entity and we're going to be the source of truth for maintaining that and extending that, so if you want to make a change come talk to us," is another approach you can take to define your schemas.

Another challenge we've also faced is distributed rights. Everyone talks about GraphQL for reading information. That's only half of the internet, the other half is changing it. How did we solve this? Within our domain we have chosen to only do single entity rights for now, and if we do choose to go down a multi-distributed transaction route, we will have to face very challenging topics at that point, but I believe we can solve this with a new job idea and maybe send a finished payload through a subscription or something like this, we have yet to solve this case but it's something we have punted on it until later.

Another thing that has been quite challenging to figure out how to do it correctly is actually error handling. In the GraphQL spec, there is an errors part of the response that you get back, but I would argue that that is not what you should use it for, that it's more like exceptions, things that your server could not handle. Instead, I would encourage people to look at putting your errors into the schema. For example, if you're loading a user's page and the user's not found, that is a known type of state your app can be in, so your schema should effect all possible known states so you might get user payload back or an authorized payload back or user not found payload back, and your schema tells you the information you need. Your UIs’ can be built around that.

There are very clever things you can do around client selection sets that allow us for moving forward, especially on mobile that you don't break older clients, but it's an implementation detail. Errors I feel should be in the schema and I'm sure I'll get questions around that in a moment, but that's one.

Those are a few of our challenges and we have been working very closely with the product team to figure out what can work for us. On product and streaming to the best of my knowledge, they've been using the ivory tower bucket defining the schema. Our core teams are maintaining schema entities and controlling what goes in and what changes, but they've encouraged us to try this new approach where each service owner or domain entity owner, what we're calling domain service entities or whatever, will maintain their own subgraph and that will get merged into the federated graph, so we're adopting this approach and we're very interested to see how this plays out.

In regards to do you make the graph UI centric or do you make it entity-centric? We are still evaluating the exact approach to solve here, and I'm on a working group interviewing all of the organization to see what are the exact use cases of each team. What we are evaluating is potentially providing a managed experience to put in front of this federated graph that is a BFF, or a back and forward front end for each UI team, to potentially do their own data transformations, maybe extend their graph to be more product-centric, but we're still evaluating the needs of the teams before we rush into solving this problem.

The way that we actually do this is through working groups. It's a very interesting way to solve this problem. Instead of prescribing a solution, we talk to everyone and we say, "What do you actually need? What problems are you facing now and how can we help you?" This has been very successful for us and I think it's been probably one of the bigger factors of why I think GraphQL succeeded in Netflix. There's been lots of interest from product as well as our content space in exploring this, but a lot of the learnings that we've been having we've been sharing amongst the teams. I feel like it's just driving it forward in a very nice way.

Having this unified graph is really going to allow us to answer questions to product people that we haven't ever been able to answer before, because it's been in a siloed system. Once you have this main entity graph or multi-entity graph, it really changes the way that you want to build applications. You don't have to think about talking to this team to get this extended, join that team to get that extended. Instead, all the data that's possible in your organization is there and you just build against it. Within my team in particular back to a smaller scope, we want to see how we can start building applications. Is that possible? What would that look like?

We have a localized innovation lab where one or two people per quarter will basically hack on projects and see what we can do. We've had some really initiative things come out of this which we hope to opensource, but what if you could write an entire app without any GraphQL? We use TypeScript. What if you could just access the data you want, look at the types, map them to props, map that to your GraphQL query, and then send that to the server without writing any GraphQL? We have a library that does that, whether or not we use that in production or whether or not. Maybe it's terrible idea but it's just interesting. What we could go further and have the schemas and have drag and drop GUI’s to make UI’s? It's possible if it's all typed. It's very interesting, so come help us solve these fun problems.

If you have any new crazy ideas on how we can build new apps or make studio better for the world, please come speak to me, we're always hiring. I'll leave you with one final thought. I think GraphQL is amazing, not necessarily for the technology or the tools that are on it, like Apollo and whatever else. GraphQL wins my heart because I think it creates human behavior. It starts teams talking to each other on how they can evolve and enhance the schema, it's a typed living source of truth of what the API taxonomy is for your entire organization, and it moves us back to the monolithic dream team. Each team can be independent and moving quickly, and your monolithic API allows your entire organization to be represented in one place. Together, we can do what we do best, all while bringing everyone together.

Questions and Answers

Participant 1: I have a question about performance and trade limits. For example, when you manipulate REST API, you can say that "Ok, movies are quite fast and not time-consuming, but movies/IDs/reviews are a very heavy request and you can't send more than 100. For GraphQL we have queries and one endpoint. Did you encounter everything and if yes, how'd you try to solve this problem?

Heinlen: There are many questions in there. I'll try to answer all of them. If you want to do rate limiting and complex analysis of the actual GraphQL query, you can definitely do this and you can basically be a for-filter. As part of the validation of the actual schema, you can say, “If the complexity of this query is too deep or has too many field selections, reject their query altogether.” You can go further to say, “Here are all the client's queries that we know about and we're going to basically whitelist them and everything else is blacklisted,” so you know exactly which query they're going to run to production and optimize them. In terms of knowing what's being used and knowing the speed of these things, there are amazing tools like Apollo Engine, but you also build internal tools.

In all opensource libraries there's a spec that gives you timings per field per resolver, to understand the app, basically. I would encourage you to read Principled GraphQL. It's a one page doc that talks about how to build these things in production, and part of that is basically optimize when you have a problem, because it's an end-to-end super complicated problem to know what is every possible combination of queries that I'm going to be receiving. I feel like, solve your known use cases, whitelist your queries, persist your queries if you need to do this in production especially, and then tweak your performance from there.

Yes, you can block the request if it's too complicated, if it has too many fields. You can do detection and timings, opensource tools like Apollo Engine or request pipeline, and stuff like this. I would say optimize when you have a problem. Those are my suggestions.

Participant 2: I have an additional question to the previous one. We have some federated graph with our entities and, for example, we query film review and user entity. What about concurrency and parallelism? Do you use it in Netflix and how?

Heinlen: We have yet to tackle this problem but it's coming in soon with the work of the federated graph. The new tooling from Apollo, Apollo Federation, is going to be opensource or it's opensource as of two weeks ago. It's very complicated. I would recommend that you go read their docs around how they implemented this, but the way they do this is they have a SQL planner but for GraphQL Planner. They're able to detect, “This is the first ID we must have, we have that ID now, we can make all these in parallel, and then we can go make this last request once we have the sequential ID from the previous request.” They're trying to optimize chaining the operations which you must do in which order, and that is going to be open-sourced. I think there's also a paid thing to do the side of that, I suppose.

We're evaluating using that. We're building something similar internally, but that's where the infrastructure team is now dedicated to solving in the coming months, but it is possible, it's just not trivial. If you need it you build it, if not you'll buy it, I suppose would be my tooling, but we're not there yet for the actual solution. I would recommend looking at Apollo Federation, at least how they've gone about solving the problem, and maybe you can build something similar in your team. Another thing I mentioned in terms of performance is, you can do data loading which is a very common way to also speed up apps, so you can cache the request, you can memorize a request.

For example, if your query is user, comments, user, you wouldn't naturally request user one twice. You can memorize it, basically, so if you've already fetched that data per request, you'll get back the same entity and memory as opposed to hitting the API twice, or you can also data load multiple IDs at once so you might have a recursive structure that's, "Get the vendors. Get the vendors." You might make 100 requests to that service, you can as I said, batch up the ideas, make one request and it'll merge it back in the graph. There are lots of strategies to try to optimize and make this more performance heavy, but I feel like that's implementation and I wanted to get, “Here are the nice things about GraphQL.”

Participant 3: In terms of GraphQL, what do you guys use to resolve latency problems when you are using GraphQL to reach out to different sources when they pull back information?

Heinlen: Do you mean if part of the request has failed, or in general?

Participant 3: Yes, if part of the request fails.

Heinlen: I didn't get into the specifics, but there's a really good talk by this woman. She used to work at Medium but now she's at Twitter, that talks about error handling. I would put that into the schema, if something can fail and you know it can fail, it depends on the type of issue it is. For example, an unauthorized system, that's a different type of error than an intermittent this thing if you retry it it'll be fine, but we design our systems and our schema. Everything can be nullable, and the UI can handle something being null, and if there's actually business rules that say, "This can never not be blank," then sure, make that a required field and throw up your exception in your server.

In terms of intermittent failures, you could do retry logic in the resolver. We'd do that in some scenarios, or we try a couple of times before we actually bailout of the resolver, but we don't have a uniformed, "This is the best way to solve this problem generically," but I feel like it really depends on your type of error and if you can retry, go for it. If it's something that can be null, let it be null in the schema, or if it's like an actual “That's going to cause a business logic error,” then put that error into the schema that you can reflect better to your users as what's actually gone wrong.

Participant 4: How do you resolve the situation when you have multiple teams using essentially the same business entity, but they have, let's say, a different view of the entity, that they won't use different data? I suppose you do not want to load the entity with all the fields, etc.

Heinlen: Yes, most definitely. This is obviously going to be challenging but user, for example, I would imagine most people are going to have a user entity in their system and they may have different fields or different representations of that. With this Federation thing from Apollo and the opensource thing, basically each subgraph extends the basic type and only one team will actually own the main type. If there's any merge resolution conflicts, you'll get that in editor with amazing tooling. You can get query timing, you can get per field level timing. This has an ID, how about a string? That's invalid.

I mentioned in my second diagram of the schema registry and that's where it comes into place. Each team locally and in production every environment builds against this schema registry and as a part of your deployments, it basically validates that yes, this won't break any clients. Yes, this marriages all the other entities fairly well, and if it doesn't, it'll get a build error or a PR error and you can go further to dig into this. To summarize, each team can extend these user types that are shared or entity types that are shared and as a part of your validation deployment step with the help of a schema registry, you can actually validate and check that that's not going to break the bigger system.

Participant 5: I was wondering if you could elaborate a little bit on how you guys numerated business logic errors, specifically around the implementation. Are they first-class types or are they fields on a certain type?

Heinlen: This is an ever-evolving topic and there's the GraphQL Paris conference that happened last week, and there's a talk explicitly on this, look up GraphQL Paris Errors and you'll find her talk. She's amazing, she used to work at Medium, now she works at Twitter. You define a union type, instead of you say, "I want users one," as your query and you get a user object back, what happens if that user didn't come back? Maybe there's not a user one in your system.

Instead, your queries return type would be a user payload or 404 or unauthorized and those are other types and then they may have specific fields of the errors or the validation rules or whatever message you wanted to use to just play that in their UI. Then as a client developer, for example, if you use Apollo, a query component, you switch on the data and you do a union and you say, "If it's this type, render this component. If it's this type, render this component."

You can actually have better-represented errors through users as opposed to a big banner on the top being, "Uh-oh." You can redirect them back to the flow you want them to go through if you know exactly what broke, as opposed to digging into the data.errors array and there are 100 field selections in there. If you can co-locate your data request in your errors, you're going to have a much better time as a UI developer and you can actually reason about why things break, as opposed to you as a consumer having to figure out why something might have broken.

Participant 6: As your teams are growing and you're moving into this federated graph, how do you deal with breaking changes that may be introduced from other teams, changing types even in small ways? Is there some system that you handle that you use in CI that gives you some log file for type changes?

Heinlen: We are currently using Apollo Engine which is in production. Every request, all the timing and query information will go to their system and it gets logged. Even development time in your editor you can have your little GraphQL tag in your editor and it'll give you timings, "This field costs this much time," and it uses production data to give you that field by field, but as well as, locally you can do Apollo check or put this into CIG against Java or whatever you're using for CI, to basically say, "Here's the schema that we're going to publish. Does that meet all of our client's specifications and does that match against production, traffic, over the last couple of months? Not only does the type match, but does it match production usage of this?"

It doesn't get into the rules of, "Is this correct for your business?" I feel like that's where PR's would come in but at least the timing and the usage of that graph could be checked with this tooling. That's where the registry comes into play as well; you wouldn't be able to actually publish a schema if it broke your known client's usage of this. In terms of how we do the version changes, again, I'm still mentioning this because it's being developed actively at the moment, but I would roll forward with new fields, new messages, and new queries if you're going to make breaking changes in almost all scenarios if it's possible.

Participant 7: If you could elaborate on your investigation from a tech stack point of view, why you would choose GraphQL over another query language, say, OData?

Heinlen: I can't speak to that because I do not know the other one. Sorry about this, but I have only had a nice experience with GraphQL as of yet. It allows your UIs to move very quickly. I feel like that is a huge advantage as opposed to not using GraphQL. I think it's client-centric, but that's not to say that you can't use it from server to server, but if you are to do this, there's gRPC, there are other nice tooling that have two-way data streaming and stuff like this, but I would think it's optimized for clients especially, and it's made by Facebook and it's all the rage. Maybe there is some hype to it, but I've only had good experiences with it so far, but I can't speak to the other one.

Participant 8: Regarding the teams, is there a methodology that you are using in the communication of the teams, so you can communicate with the teams through this tool or this methodology?

Heinlen: Do you mean in terms of the schema design?

Participant: Yes.

Heinlen: We do have these working groups where we talk more generally around, "Should we do the relay for pagination, should we do this for errors?" We have this forum to discuss these things more publicly, but now with this new reorganization of both product and content engineering merging, we do have a dedicated team to be building out these federated entity graphs and they're kind of making these decisions at the moment.

The ultimate goal is, these per teams will maintain their own graphs themselves. Yes, that's something to consider, how we can enable these teams to elaborate more effectively, especially if they're dealing with multiple entities across multiple domains, extending them and not conflicting with this. I feel like as soon as you go to this distributed model, you do need tooling and communication to do this well, so that's something to be mindful of, but I don't have a clear answer for the best way.

Participant 9: My question is, how do you determine when a graph is needed? Do you build a graph for every feature of your product? Particularly what I'm interested in is as your product becomes larger, how do you prevent a proliferation of all of these graphs, where there's some type of duplication of effort in places, in bits of your product?

Heinlen: That's definitely a fair concern. As of yet, our teams and many teams have been using this are 100% on the GraphQL interface for dealing with the back ends, but I can imagine a world where if you're doing CQRS or you're doing file uploads or something that doesn't fit that paradigm, or doing streaming, you could do subscriptions, but that's less use as of yet. You might want to go out of that scope, but as for us, we find we have no need to not use the graph for all of our use cases so far.

In terms of bloat and over time getting too big, I feel that's just a matter of versioning and deprecating fields if you really want to remove them. In terms of adding new things, I think that's more of an organizational problem to try to solve; do you really need five user systems or can it all be within one? I don't know, we haven't gotten to that stage of that maturity for us to have a good answer for that. I do not know.

Participant 10: My question's just about coupling. I think it's been covered by a few questions already, but I'm just curious about your long-term deprecation strategies. For instance, with your federated graphs, to stop this bloat that people have mentioned and having these older entities hang around for 5, 10 years because some other teams may be using it, do you have maybe a more aggressive deprecation strategy that you've been talking about internally?

Heinlen: At Netflix, we don't tell anyone what to do. Everyone has the freedom and responsibility to do what they want to do, and the judgment comes into that. What things should we be focusing on, what things should we do, but to the best of my knowledge, I could be wrong, don't quote me on this, but I believe Facebook has never deprecated a single API they've ever made and they're doing fine. I don't know around getting rid of old things. I'm sure there are good strategies for that.

If you control all of your clients, you can just say you get to upgrade in the next quarter or something, and we're going to turn old one off. Great, if you have that luxury, I think that's fine. We don't control every single client that uses our app, we can't tell people what to do, so it's just a matter of being, "Please move to the new ones, you get new features." Convince them with the new shiny maybe? I don't know. We have yet to run into this problem and I'd really be curious if it actually becomes one, personally.

Participant 11: I haven't really used GraphQL myself but I read a good amount of literature about it. I'm concerned about one thing. From what I understand, for each query you still need to write a resolver, which is the way I see it from outside, is more like writing an API, a REST API, REST endpoint so you still need to implement the resolver for it. Do you see the advantage of GraphQL in the power of expressing queries in a way which I don't understand yet, or do you see it more in the fact that because you have that uber taxonomy you can make those queries more powerful? The simple fact that you have taxonomy gives the entire organization a common understanding and allows those queries. Do you it more in the intrinsic query power, or more in the advantages that taxonomy gives you?

Heinlen: I think both, personally. As a UI engineer, you can go to this graph. I should have shown an example but it's not a very specific talk, but as a consumer, you can just go CMD+TAB and see all the people queries that are available. You go into there, CMD+TAB what fields are available, what the description and the type, and you just keep digging down as a UI engineer and you're, ", I want this and this," and that's the only data you get back. It's very nice as a UI engineer. You don't have to worry about, "I get this from this REST endpoint. Ok, in that big JSON body I get an ID, I go somewhere else, get some more, and maybe an included batch.”

I feel like it definitely helps the consumer of the API get just what they need when they need it, and I feel like the taxonomy, the documentation of your entire system, is also very appealing. To speak to the REST resolver implementation comparison, it's just a transport layer way to define how to get data. It's the boundary of your system, so you've still got to implement it, I think, at some level, but the bigger benefit is for the clients, I would imagine.


See more presentations with transcripts


Recorded at:

Jul 30, 2019