BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Federated GraphQL to Solve Service Sprawl at Major League Baseball

Federated GraphQL to Solve Service Sprawl at Major League Baseball

Bookmarks
38:09

Summary

Olessya Medvedeva and Matt Oliver discuss how they have begun to implement a Federated GraphQL architecture to solve the issue of service discovery, sprawl and ultimately getting the data needed.

Bio

Olessya Medvedeva is a Software Engineer at Major League Baseball. Matt Oliver is a Senior Engineering Manager at Major League Baseball.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Oliver: Our talk is Federated GraphQL to solve service sprawl at Major League Baseball. We're going to go over some high level, some in-depth on the last 8 months to 12 months of us implementing a Federated GraphQL architecture within the Web Platform team at Major League Baseball. My name is Matt Oliver. I'm the senior engineering manager on the Web Platform team.

Medvedeva: I'm Olessya Medvedeva. I'm a software engineer on the same team.

What Is Web Platform?

Oliver: When we're talking about Web Platform, what is that? Our team handles all the base architecture, infrastructure, and DevOps for MLB's web footprint, so mlb.com, milb.com, the USA Baseball Play Ball. There's a ton of different touch points, within all these different clients that we have, in order to serve baseball experiences to our fans. We support multiple frontend teams, which makes up around 30-plus developers. Really, what we're tasked to do on top of supporting all the stuff above is be more forward thinking within the organization in terms of new web frameworks, new web architecture paradigms, leading that within our organization. Our biggest responsibilities include the server side rendering architecture, which we are moving from an old Handlebars architecture to React and Next. Supporting services that power more niche frontend stuff, like personalization, which is a big cornerstone of our GraphQL architecture. This is very challenging because to render the mlb.com, that you see when you go to mlb.com, takes many services. Historically, there's just been a lot of logistics around how that data is pulled, and how we eventually render our page, because of all the data that's involved.

The Problem

Just kicking off the problem here, we have different clients. To the right, you can see we have web, iOS and Android, on top of a bunch of other things like our connected devices that need data from many teams within MLB that are maintained by different teams, different people. Coordinating all that has been a challenge.

Specifics

Going into the specifics, one of the reasons we started looking into GraphQL is because we have low visibility into who is making calls on our platform. We have a lot of clients that are managed by different teams. Sometimes we have services from within our organization or external clients that are pulling data from us. It's hard to really know who is pulling data from our platform, which makes making model changes and potentially breaking changes really hard, because it's hard to notify those clients, or it's really just hard to logistically coordinate all that stuff that's going on.

There's a lot of evidence of redundant calls within our architecture, so services calling services and duplicating those calls. There's just a lot of waste in terms of how all these services are communicating. Unfortunately, we're prone to DDoSing ourselves because of that. If there's an errant code push that goes out, we end pausing whether it's our CMS or our backend service. It's really these self-inflicted wounds that are making us reevaluate making this architecture a little bit more sane. We've been burned by third-party integrations, and so a big thing is to be able to isolate our backhouse services, or isolate our clients from our back of house services, so if we need to swap things out, whether it's a logistical one or economic one, we could easily do that without affecting our clients.

A lot of our services have different caching structures. Most of our teams manage their own caches, and it becomes hard to figure out what's been in cache, where, for how long, and so trying to centralize everything to make things a little bit more nice and sane, and easy to reason about. Also handling upstream failures. Again, each team throws different errors or handles errors differently, and reacting to those errors is an issue. Again, trying to centralize things in order to have a cohesive strategy around a lot of these things.

Client/Server Communication Today

Medvedeva: The most common request-response pattern today is your typical RESTful service where the clients are doing the fetching of data and stitching things together. In this diagram, we have our diverse set of clients all making calls individually to one or more of our services. If we are lucky, we have a CDN and a cache of some kind fronting our services to give us some stability and breathing room. The burden of the clients is that they're responsible for knowing where and how to stitch together disparate service responses in order to successfully render their content. What's wrong with this picture? We have some of the big issues surrounding the architecture at the clients can be very chatty. They are sometimes making multiple calls to services in an all or nothing fashion. This ends up tying our frontend implementation to the backend data model, which in turn requires multiple code changes when the models inevitably change. We also have issues around backend service exposure, allowing external clients to directly hit backend services, which is both a security and scalability problem.

Oliver: One way that teams look at simplifying this model is to create potentially mashup services that are going to decrease that churn and give a more simplified API service to our clients so they're not themselves calling all these disparate services, but we have a middle layer that is going to handle that for us. In this example here, we have our clients that are contacting these mashups. The mashups could be Lambdas. They could be CloudFlare Workers. They could be Cloud Functions. They could be your own proxy, whatever that may be. However, we're going to run into different and potentially more complex issues because of having this mashup here. Again, what's wrong with this? We've increased complexity. We now have this middle layer of potentially many mashup services, which could be calling themselves. It's really hard to keep track of what all these mashups are doing, and where they're calling.

Again, then there are these maintenance of these mashups, so who owns these mashups? What teams are relying on them? You have a discovery and maintenance issue. You could have duplication of data across these mashups, so we're back to square one where we're making all these redundant calls, because these mashups become overly broad, and now, other teams are pulling from them, and you create this tight coupling. There's no holistic view of your API, because, again, unless it's well documented, you have all these mashups that are living who knows where, or in multiple places. Developers who are implementing these clients might not have a good idea on where to get the information they need, so they end up making their own. We get into this vicious cycle.

What Is GraphQL?

Medvedeva: To solve the issues we've been describing, we decided to explore how GraphQL can play a role in solving those. What is GraphQL? GraphQL is a graph query language. It was developed by Facebook in 2012. They had the first public release in 2015. GraphQL using the query language to request data you want from the server. Queries sent to GraphQL server are usually POST request sent to /graphql endpoint. In our example, we are asking the server to retrieve information about Team 147, and only request in its name. One of the most powerful features of GraphQL is the ability to selectively return partial data from a statically defined model. On response, we got team 147's name, which is New York Yankees.

Anatomy of a GQL Service

A typical GraphQL service is made up of three interconnected parts, the first being the well-defined statically typed models that are similar to other serialization formats, like protobuf. These models describe what the output of our graph will look like to implementing clients. In this example, we have a type team that has an ID of Int, and name of string. These models can include any number of other fields. There are a variety of primitive GraphQL types, but users can easily implement their own object models. A special type of query defines how implementing clients can query the graph. In our example, we have one query of getTeams that accepts an array of Ints and returns an array of teams. The second view of GraphQL service is the resolver. The resolvers take an input from a query, call upstream services to resolve the desired data, and return the data in the defined output model. It is here that we've separated the frontend and backend data model to provide a flexible interlap between the two. The third pillar is obviously the upstream services, of which there could be one or more that are required to resolve the output model.

Going back to our high level architecture diagram, we now have our GraphQL server as the proxy, middle layer between our clients and our upstream services. Our GraphQL server consumes our models and provides them to our clients in how to query the graph. The resolvers take the request, call the services, and return the message result. Some pros in this architecture are we have all of our code in one place. We have a single server that teams can work against and easily view our RPA service. Deployments are easier because we're just dealing with a single server. However, as we start scaling our teams in contributing to the graph, it starts to become logistically complex, iterating on our monolith starts to create tight coupling between our models. We also have to deal with a single point of failure, which isn't ideal.

Apollo

Oliver: Enter Apollo, which is a third party company that extends the GraphQL spec created by Facebook. Specifically, it's around federation. Federation is a topology of services where you have multiple independent services united by a gateway, or a router that is able to take each piece of a subgraph, so each of these services essentially is a single responsibility within your graph, stitches them all together, and provides that super graph to the implementing clients.

Anatomy of Federated GraphQL

Going back to our diagram here, we have our clients who hit our gateway or the router, and that router interfaces with these subservices, which have their own models and their own resolvers. What comes with this is not only the ability to separate out pieces of your graph in a more modular way, but it also gives you the ability to now do inter-service communication between your two services. You now have these subgraph services that are responsible for their own thing, they query their own data sources. If one subgraph needs information from another subgraph, you now have the ability to essentially query each other's subgraphs to create more complex queries.

The pros behind this are each service is responsible for its large part of the graph for creating more separation, because as your graph starts to grow and get complex, it can be very hard to reason about within one single server. It's easier for larger teams to contribute because they're owning pieces of these graphs. Subgraphs map one to one, usually, or potentially multiple teams could service multiple pieces of the graph. You're now getting responsibility away from potentially a single GraphQL team within your organization, and now each underlying team is now maintaining their own piece of the subgraph. These can also be independently versioned and deployed irrespective of other pieces, as long as you're not breaking the graph. We're going to go into mitigations against that. Teams can iterate as fast as they want to, and ship as frequently as they want to, to get things out, irrespective of what the other services are doing.

Some cons against this are, it's much more complex CI/CD. In terms of checking that the subgraph that you're pushing is not going to break other pieces, there are utilities available by Apollo in order to manage that, but it is a concern. We could potentially break the graph. Also, the connections between our subgraphs are not super clear. We're going to get into semantics behind how this inter-service communication works. There are ways to view the complete super graph, but it is somewhat opaque in terms of which sub-service owns which pieces of the graph when you're looking at the full graph itself. You have to dive into things or have people who are knowledgeable on how the entire graph works in order to grip everything that's going on.

MLB's Federated Graph

Medvedeva: This is the representation of our federated graph. We still have more teams to unbolt and bring in their backend data into our federated graph. You already can see how many services are available for consumers to use with only one single entry point to the gateway.

Inter-service Communication

Oliver: Now we're going to get into a little bit more of the nuts and bolts of how a lot of these ideas work within federation.

Medvedeva: One of the most important aspects of Federated GraphQL is the ability for subgraph services to communicate with each other in seamless developer experiences, where the developer doesn't need to know where other models live. They just need to extend already established models and add their desired properties into them. Over the past six months, MLB has started to implement more personalized content within our Web Platform, and to do so we want to attach personalized information to our user for consumption. Here we have a user type that has an ID of string, and a favorite team of Team. However, the team model doesn't live in the user service, but in the baseball service. When the client requests the user's favorite team, what is actually happening is our query enters the gateway and resolves the user. The gateway sees that the favorite team has been requested of type team. After we get the favorite team ID from our user service, the gateway automatically sends that favorite team ID to the baseball service to resolve the entire team for consumption. In the center of the slide, are some semantics around how to supply the connections between the services in Apollo's federation, returning a type that exists in another service, and that type defining a resolve reference allows this cross-communication to happen.

Federated Caching (Whole)

Oliver: Again, when we're talking about redundancy and the safety of our graph and making it as efficient as possible, we want to talk about how different caching strategies work within a federated architecture. Looking at this query we have here, it's somewhat more complex, and under the hood, a lot of things are going on. On the whole, we want to get slashing, so mlb.com slashing. We want to get the title of the page. We want to get the components that make up that page. Specifically in this query, we're looking for videos. On those videos, we want to find the team name of the teams involved in that video. What happens is we go into the gateway and we send this query in. We're going to make calls out to each of our services that can supply that information, which includes our CMS, the DAPI, which is like the actual content itself, and the baseball service or StatsAPI, so a baseball stats data.

We're going to make those calls out to our external services. Each of those external services are going to respond with a cache control header. We make a call to the CMS and we get 10, 30 for the DAPI, and 3600 from StatsAPI because that stuff isn't changing all that often. How do we end up caching this stuff? We're able to cache the entire query result at the gateway at the lowest max-age from all of the services that were called to make up the query response for this query. What it does is it coalesces all those down, finds the least one and stores that in our example here, Redis, for 10 seconds. The process continues. We're caching it as close as possible to the request coming in, at the minimum max-age that was supplied.

Federated Caching (Partial)

As we know, some services and their responses are not cacheable, whether it's dynamic data, or sensitive data. What about partial federated caching? We still want our queries to respond quickly, but there are obviously times where it's uncacheable. Let's talk about our favorite team example here. We want to get our favorite team for a given user ID that query those in the gateway. We're going to go to each of our services to pull that data, which is the user and the baseball service. We're going to go out and call our profile service and the StatsAPI. The StatsAPI as our previous example, in terms of max-age of 3600. However, user data is not going to be cacheable because it's dynamic or it changes a lot, or whatever. What are we supposed to do in this instance? We can't cache it at the gateway, because the whole query cannot be cached, and so one piece of it cannot be cached. However, we can cache that sub-query piece that involves the StatsAPI at the baseball service level, so that's returned quicker. Then we'll wait for the user service to eventually resolve. It's a way of still maintaining some speed in your graph resolution while making sure that you're not caching things that aren't going to be cached.

Upstream Circuit Breaking

When we're starting to talk about now redundancy, what happens if something's happened to our upstream services? We still want to have integrity in our graph. We still want to return data back for our clients. Let's talk about maybe some circuit breaking here. Again, same example, we want to get the favorite team for a given user. A query goes in, we make a successful response to our profile service, but something's happening with StatsAPI, it's taking too long to respond. It's throwing an error, things like that, depending on the business rules around what happens in these failure modes. For our StatsAPI, we're going to say we're going to cache every successful response that we make to the StatsAPI for some long period of time. We're saying that if we need a service data still, we can. If there's an issue, we have that data already cached in a long term cache, so we're going to be able to pull that out of cache and still successfully return a result of this query, while we wait for StatsAPI to become healthy.

Automatic Persisted Queries (APQs)

Medvedeva: The last optimization we are going to cover is automated persistent queries, or APQs. APQs solve problems around payload size, since some queries can be large over the wire, as well as caching at the edge. Because GraphQL by default operates using POST, we are unable to cache our response at the edge. What APQs afford us is the ability to hash and cache the query body itself while using GET requests to cache at the edge for bigger savings. Here's our typical architecture diagram, we have our CDN, and our graph is making external calls to our services. Everything works the same, however, a series of events happen to start caching at the CDN. First, we send a GET request with a hash query value. If we haven't seen this before, or the cache has expired, we send our traditional POST request with the full query and the hash value to be cached. On subsequent calls, successful GET requests are then cached at the CDN, using the previous cache semantics for query caching.

Insight, Tracing, and Metrics

Oliver: One of the things that you can really start to extract out and use to enhance not only the graph itself at your organization, is leveraging a lot of metrics around how your queries are doing: the health of them, the speed, the latency, and a bunch of metrics. Some of these are from Apollo specifically. Apollo has a cloud product where these first two images are part of that, where we can see what are the requests per minute for all of the different types of queries that we have? What's the latency? What are the errors? They have built-in tracing so we can start to dive into, as queries are executed and they're being federated out, where's the slowest happening? What services are our slowest services? How do we make improvements to those services? Are there specific parameters within the queries that are throwing errors, things like that? Then to complement that, we have a whole Prometheus, Grafana, more infra monitoring. However, we can see at a granular level for each service, what the health of those are, in more detail.

Lessons Learned

Medvedeva: Lessons learned so far, we have definitely some wins. We went into production in May. We have a subset of total services. We're doing 60 requests per second at peak time. We decreased number of calls between the services. No more stitching within clients, payloads decreased. Teams leveraging federation no longer need to worry about data access. We also increased visibility upstream with tracing. Centralized data access reduced ambiguity and eased discovery.

Oliver: Of course, with everything, there's going to be caveats, and there's going to be challenges when implementing an architecture that has this scope within your organization. The first is the upfront cost within your organization to learn the GraphQL syntax and grammar, the particularities around federation, a lot of that's a socialization within your organization about how these things work. There's a lot different models by which, Apollo themselves, but just the community recommends you set up teams to do this, but it's a challenge. Again, the federation in the service is not supposed to do service discovery, but figuring out which parts of the subgraph are owned by which services can be challenging, without either documentation or, again, people knowledgeable with the entire graph.

With any, not only product, but just architecture, there has to be organizational buy-in, so challenges around that are the ease of use of onboarding new teams, scaffolding out projects so that they can get running really quickly. Things of that nature. Governance requires a lot of oversight. When I talk about governance, that could be of the architecture itself. That could be of the taxonomy of your graph. That could be, who owns which services. There's a lot that goes into really managing this layer and how it works in your broader organization. That takes time. Every organization is different, and they're going to come up with different patterns or different teams, or however to execute this. It really is an iterative process. Speaking for us specifically, we've gone through a couple of these as we've started scaling out this within our organization, and we're still working out what is going to work best for us and what's going to scale ultimately.

Resources

Medvedeva: There are some resources if you would like to go ahead and look closer at GraphQL. There is a GraphQL doc to give you more details about anatomy. There's also Apollo Docs if you're interested in Apollo version of GraphQL, or Apollo Federation. Also, please check out Apollo GraphQL summits, where Matt will be a part of the round table.

Questions and Answers

Breck: You highlighted the importance of graph governance in making this happen, is that just a social problem, organizational problem, or is that also a technical problem? Are there tools or techniques or suggestions you have around graph governance?

Oliver: A lot of stuff that we talked about is very technical, but half the battle also is the organization, the people, getting people excited about it, onboarding, and the logistics around operating these large systems. I think in any organization, when you have an architecture that's going to span a decent part of your organization, there's a lot of stuff that comes with that, a lot of coordination. It is a very human problem. I think as we've started going through this process, it's been very iterative, where we started with a small team. We started opening up to the larger organization. We started getting feedback. We started then playing around with different governance structures, whether that's a small governance team, whether that's an in-house graph platform team, or what have you. We're still trying to figure out what's going to work the best for us. There is just a lot of coordination and socialization of what the graph is supposed to do, the expectations of it. Then within your tech organization, how you're going to farm out and validate the changes. Whether it's technical changes, whether it's your taxonomy of your graph, there's a lot that goes into having everybody on the same page.

I think it's different from normal, standard everyone has their own REST service, or they have gRPC, or whatever, where the teams are siloed. Then if you need something, you go to that team. This is much more broad where there has to be a constant communication between all the teams, because at the end of the day, it's one graph. Even though within federation, like what I alluded to, each team manages their subgraph or subservice. It eventually is coalesced into one super graph. You're juggling a lot of things. I think we're moving toward having a centralized team that is the traffic cop of all this stuff, and just making sure that there's a high set of principles or policies that are being followed. I think at the end of the day, you need to relinquish control in order to keep the gears turning, because if there's too much processing, it's going really slow things down.

Medvedeva: You can see we're really passionate about governance. It's been a process.

Breck: It seems like a problem someone could overlook in trying to do this and then not have it be successful.

Oliver: How GraphQL started at the MLB is we had the editorial team who does a lot of the frontend work, they ran with their own just single, regular GraphQL server. Just to make the division between these two things really clear, GraphQL Facebook spec, and then Apollo third party extended the spec to include federation and a couple other things. They're separate, however, going toward a similar goal. The first group experimented with just GraphQL. Then you started taking that paradigm, and saw what Apollo was doing, and we're like, we can extend this out throughout the entire organization. It really was a bottom-up effort. I think it's different within your organization, you have a CTO or a director of engineering who's saying, we are going to become a graph first company, and that it comes from top-down, there's an edict. It's different when it's more of a grassroots powered thing. It takes a lot more convincing, a lot more realizing the value prop of what it could be. Because it's going to be a multiyear process, two to five years, probably, depending on how big your infra is across the entire tech, or to convert it over to where you see a lot of the benefits of adopting something like this.

Breck: Someone is asking about alternatives to Apollo. I'd be interested to hear, someone who doesn't want to use Apollo, are there other alternatives. Maybe also address in an environment where it's not just JSON over HTTP, where maybe some services have moved to gRPC. What are the alternatives there? Maybe in a completely gRPC world where someone wants to do federation, do you have any advice there?

Oliver: In terms of alternatives, there are a bunch of reference implementations of GraphQL amongst the languages themselves. Apollo is leading the JavaScript reference implementation. You have things like Sangria, which are in Scala, and across all the languages, they're not run by a company. They're just grassroots libraries. Apollo is driving the federation spec piece, and so you would have lock-in there. In terms of using GraphQL genuinely, you're free to do what you will.

In a lot of our work, obviously, our upstream services, our internal services are all just RESTful services, really. We don't have a ton of gRPC within our organization. Within the resolvers, however, you're getting your information, it's very open ended in terms of how you want to resolve your queries, where you're getting that data from, whether it's directly from a database, whether it's through gRPC, whether it's through a REST service. GraphQL is pretty agnostic in terms of how you resolve the data to resolve your query. In that sense, you could pull from any data source by whatever means are available to you in order to leverage GraphQL. Think about it as just a glorified proxy, GraphQL is. However, you get a bunch of the query semantics, everything on top of it to make working with your data a lot easier. Then if you leverage federation, you get a lot of the fun interconnected bits so that sub-services within your organization don't need to go out redundantly to other pieces to get the data, you can use the graph in order to leverage your infrastructure as a whole.

Breck: It seems there's probably a lot of wins from doing caching, circuit breaking, these kinds of things centrally. Do you do rate limiting through this central service? Then maybe some of the drawbacks of that, too. Someone's saying about Conway's Law and having one monolithic API. It seems like there's a lot of wins from having this centrally, but also attention. Maybe even contrast too, like what it would look like trying to use maybe a service mesh approach or something like that, that isn't as coordinated and opinionated, maybe.

Oliver: In terms of rate limiting, we don't have any rate limits yet. We're not anywhere near 100% in terms of converting our entire infra over. We're doing this very piecemeal. We have just a fraction of our traffic, or you can think about it as features that are being served out of the graph currently, but slowly migrating things over, so eventually we'll approach rate limiting. I think that would be divorced from the graph architecture as well. We'll use it at the CDN or whatever, and do that.

In terms of drawbacks, I think the first question that we were talking about, there's a lot that goes into organizing your organization around a common goal. I fully realize the power of the graph, you want to have a ton of your teams, if not all of them, maintain sub-services feeding up into your super graph and being able to be deployed out. That is a very noble goal. I think that a lot of us who have been architects or seniors who are architecting out their systems, it is pie in the sky to think that you're going to get everybody on board to do the same thing. Even when you do at that scale, logistically, it's a burden in order to maintain everything, all the moving parts. I think it would be simpler to go service mesh-esque route, however, you're not going to get a lot of the things that you're going to get with Apollo and strictly federation for free, like the inter-service communication bits. You get tracing out of the box for free, too. As you query results, and it goes through all these different subgraphs, you can see the time it takes, what data it resolved and everything, and really dive in to see what upstreams are taking forever and work with those teams and try to optimize that. I think it's really pros, cons. It really depends on where your organization is already and where you see it going forward, and what your specific problems are.

Breck: Do you have to interact much with the federated query planner, or is that just basically resolving based on the model and you don't have to do much?

Oliver: We haven't touched it at all. Granted, we haven't gotten super complex in our queries. It's pretty smart to know obviously what to resolve, when and how. Apollo is going to start opening up more utilities in order to give hints to the query planner on like, I want you to resolve this first, or this or that. No, for the most part, we haven't had to touch it at all.

Medvedeva: It's also a great tool for you to actually see how it's been resolved in a way you can optimize where your code is lagging actually, because the query is based on the code on your resolvers. It's a good picture of where your code is, and can you optimize it actually starting with your code? Then let GraphQL do the rest, the optimizations.

Breck: That includes like call paths as well as timing.

Medvedeva: Yes. Apollo Studio shows you actually how long it takes.

 

See more presentations with transcripts

 

Recorded at:

Mar 24, 2022

BT