BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Rebuilding Twitter’s Public API

Rebuilding Twitter’s Public API

Bookmarks
21:29

Summary

Steve Cosenza discusses why Twitter's multi-tenant API platform was built with Scala, GraphQL, and how Twitter uses SLOs for monitoring and alerting in production.

Bio

Steve Cosenza is a Senior Staff Software Engineer at Twitter where for the last 8 years he's built HTTP and stateful streaming systems as well as the underlying open source Finatra framework that powers them. Steve has a BS in Computer Science from Cornell University, and a MS in Information Systems from Johns Hopkins Carey Business School.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Cosenza: How many of you have decomposed a monolith into microservices? How many of you have created a multi-tenant microservice to eliminate the need for other microservices? Over the years, Twitter's learned a lot in this space. We now have a new API plan that we hope will scale well into the future. Today I'm excited to run through a brief history of the Twitter APIs, followed by a discussion of the goals and design of our new public API platform. Then we'll finish with a real end-to-end example of how Twitter engineers build API endpoints using OpenAPI spec, GraphQL, and Scala.

How We Got Here

First, let's start with some history. Our journey starts 14 years ago in 2006, when Twitter launched our first public HTTP API, which was served from a single Ruby on Rails monolith, which later became known as the Monorail. By 2010, we had one of the largest Ruby on Rails systems in the world. Given its growing and complex single code base, it was becoming increasingly difficult to make changes, and more challenging to parallelize work among our engineering teams. It was time for a migration. In 2012, we completed migrating our public API off of the Rails Monorail, and onto numerous Thrift JVM microservices, fronted by a single HTTP JVM microservice named Woodstar. Woodstar was instrumental in getting almost all Twitter traffic onto the JVM. In time, it became its own smaller monolith, which was becoming increasingly difficult to change and operate. The drums started beating once again, it was time for another migration. In 2014, we completed the second migration from a single Woodstar API service to a set of 14 HTTP microservices running on an internal JVM framework named Macaw. While the microservices approach enabled increased development speeds at first, it also resulted in a scattered and disjointed Twitter API, as independent teams designed and built endpoints for their specific use cases with little coordination.

Fast forward to 2017 when the Twitter app started using some new internal only REST and GraphQL APIs. Having internal only REST and GraphQL APIs helped our Twitter app teams move fast and iterate quickly, without having to rely exclusively on the public API, which at this point was time consuming to change. In early 2019, we started to plan for the next major version of the Twitter public API. We knew we needed a new architecture to address the public API's slow iteration speeds and continued fragmentation. Which brings us to this year when our new architecture started powering the public API platform, which now hosts our first API v2 endpoints.

Platform Goals

With that brief history behind us, let's now look at the goals and architecture behind our new public API platform. For our new platform, we knew we needed a new architecture that can more easily scale with the large number of API endpoints needed to serve current and future functionality going forward. As part of this design process, we drafted the following goals. Our abstraction goal is to enable Twitter engineers building the Twitter API to focus on querying and mutating only the data they care about, without needing to worry about the infrastructure and operations of running a production HTTP service. Our ownership goals seek to contain core and common API logic in a single place owned by a single API platform team, while also allowing non-common API components to be authored and owned by different teams. Our goal of consistency is to provide a consistent experience for external developers by relying on our API design principles to reinforce uniformity. With the above goals in mind, we've built a common platform to host all of our new Twitter API endpoints. To operate this multi-tenant platform at scale, we had to minimize any endpoint specific business logic, otherwise, the system would quickly become yet another unmaintainable monolith. A powerful data access layer that emphasized declarative queries over imperative code was crucial to this strategy.

Platform Unified Data Access Layer

In 2017, our Twitter applications began using an internal only GraphQL API. This momentum built throughout 2019 when migrations to GraphQL started to happen across the company. It was perfect timing. Our team followed suit as we realized that the data query needs of the public Twitter API are similar to the needs of our Twitter mobile and desktop clients. Put another way, Twitter clients query for data and render UIs, while the public Twitter APIs query for data and render JSON. A bonus from consolidating our data querying through a single interface is that the public Twitter API can now easily deliver new Twitter features by querying for GraphQL data already being directly used by our consumer apps. With the GraphQL based platform approach decided, we needed a multi-tenant way for different teams to build and contribute to the overall Twitter API.

Platform Components

To facilitate this, we designed several pluggable platform components. We'll focus on resource fields and selections. Resource field components are used to create the core resources in our systems, for example, tweet resources and user resources. Selection components are used to define how to find resources, for example, tweet lookup by ID selection or tweet lookup by search selection. Using these components, teams can independently own and contribute different parts of the overall Twitter API, while still returning uniform representations in responses.

Platform Business Logic

At this point in the story, you may be curious where endpoint specific business logic actually lives. We offer two options here. When an endpoint's business logic can be represented in StratoQL, which is the language used by Twitter's internal data catalog system, then we only need to write a function in a Strato hosted managed column without requiring a separate service. Otherwise, the business logic is contained in a Finatra Thrift microservice written in Scala, exposed by a Thrift Strato column. In either case, since the platform provides the common HTTP needs for API endpoints, new APIs can be released without spinning up a new HTTP service. If an endpoint can be constructed by querying for already existing GraphQL data, or if an endpoint's business logic can be implemented in StratoQL, then we can bypass almost all service owning responsibilities for that API endpoint.

Example

Now that I've discussed the history of the Twitter APIs, and the high level goals and design of our new public API platform, let's walk through an end-to-end example that shows how to create two new Twitter API endpoints. The requirements from our product manager are as follows. We must handle an HTTP GET request to /2/tweets that contains a single tweet ID in the path. If the request is ill formed, return a 400 bad request. Otherwise, return an HTTP 200 JSON response containing a single tweet ID with its default ID and text fields. We must also handle an HTTP GET request to /2/tweets/search/recent that accepts a query, max_results, and next_token query params. Again, if the request is ill formed, return a 400 bad request, otherwise return an HTTP 200 JSON response containing a list of tweets and the next_token so that customers can paginate through the results. Similar to tweet by ID, we need to return the same standard tweet format that's used everywhere else in the Twitter API.

Tweet ID Field

Our requirements will need us to create a tweet resource which has an ID and a text field that are always returned by default. On-screen, you could see where both of these resource fields would be defined in our common config directory. First, let's define the tweet ID field. On the top of the screen you can see the contents of the fragment.graphql file. This fragment is used to retrieve all of the tweet data needed to render just this field. In this case, we query for the ID string field in the GraphQL schema. On the bottom of the screen, you can see the contents of the Field.scala file, which is used to configure various aspects of the field. The owningTeam is used to determine where we send automated pages when this field is experiencing problems. The alwaysInclude field is set to true which makes this a default field. We then specify that this field should be returned in all major and minor versions of our API. Finally, we use a special path based render, which lets us declaratively specify how to render the field without the need for imperative code. When imperative code is needed for rendering or error handling, it will exist in this Scala file. The platform has a goal to limit the need for imperative code wherever possible.

Tweet Text Field

Next let's look at how we define the tweet.textfield. Since the GraphQL and Scala files will look almost identical to the tweet.idfield, let's focus on the two additional files that are present. The first file at the top is a project file, which allows different directories in Twitter's monorepo to be owned by different teams. In this case, we include a project file, which indicates that the tweet text field is owned by the api-vnext-developers group in Phabricator, which is the system where Twitter performs its code reviews. By specifying this project level ownership, Phabricator will then ensure that all changes to this directory have a ship it from at least one member of the owning team. The second file at the bottom is a slos.json file, where the field owner can define a service level objective for retrieving and rendering this individual field. For this example, we'll specify this field should have a 30 day rolling success rate objective of 99%. This file is optional, and if it's not present, a default API-wide SLO will be used.

Compiled Platform Tweet Fragment

At this point, we've defined two fields in a tweet. At the top left, you could see the GraphQL fragment for retrieving the ID string field. Below it, you could see the GraphQL fragment for returning the full_text field. Part of the platform's build process is to read through the entire config directory and generate platform fragments that can be used whenever a tweet needs to be rendered. We can now issue a single GraphQL query with the platform tweet fragment shown on the right, instead of needing to request two separate GraphQL queries containing the fragments on the left. We can now see the key role that GraphQL plays in our component based architecture as we utilize GraphQL fragments as our unit of rendering reads. If you're familiar with React, our setup is similar to how React really combines individual UI fragments into larger fragments, depending on which UI components are included in a composite UI page.

Selection: Single Tweet

Next, let's look at how we configure selections. Selections are how resources are found in our APIs. For this example, we'll implement two selections, tweet by ID, and tweet by search. First, let's define the selection for looking up a tweet by its ID. On the left of your screen, you can see the openapi.yaml file that defines where the selection will be exposed in our Twitter API. The version prefix variable is a placeholder that the platform fills in for whatever major version of our API is currently enabled. The ID variable is a path parameter which is a required string that must conform to the specified regex. On the top right, you can see the contents of the selection.graphql file, which specifies a GraphQL query for retrieving tweets. The required string param in the GraphQL query corresponds to the required path parameter defined in the openapi.yaml file. We then call the tweet by rest_id field in Twitter's GraphQL schema, which returns a tweet type, and then specify that we wish a platform tweet to be returned. Recall that the platform tweet is composed of all the data that's needed to render a tweet depending on the customer's request.

Finally, we have the selection.scala file which is similar to route.scala. Putting all the pieces together, the top row on-screen shows an HTTP request for /2/tweets/20 that comes into the API platform, resulting in a GraphQL query named TweetByID that's sent with a GraphQL variable that sets ID to the tweet 20 found in the incoming HTTP request. Each resource field then reads the GraphQL response and renders its field value, resulting in the render tweet seen in the bottom.

Selection: Search Tweet

Next up, we need to create a selection for retrieving tweets by search. Similar to tweet lookup by ID, we define an openapi.yaml file this time containing additional typed and schema validator query params. Selection.graphql is where we search for tweets using a search query. Note that the place where the platform tweet fragment is put in the query, that's where we expect tweets to be returned. Our GraphQL schema is based on our Strato data catalog system. In this example, when we query for the match tweets GraphQL field, it's actually implemented as a Thrift service. In this case named, ID Hunter, which takes a search query string, num results, and a token as an input, and returns match tweet IDs and a cursor as output. Finally, we have a selection.scala file, which has an optional failWholeRequest method defined to produce more detailed HTTP 400 responses based on what our Thrift service returns to us. Putting all the pieces together, if an HTTP request for /2/tweet/search/recent comes into the API with a match results value that doesn't validate against the open API defined schema, the no GraphQL request has ever issued, we directly return a 400 response with a nice error message. Let's look at a well formed HTTP request. This time we issue a GraphQL request with the GraphQL variables populated from the HTTP request query params. Then at the bottom, we render tweets in the standard expected format with the default ID and text fields.

Tweet Public Metrics Field

We've successfully built two new API endpoints capable of selecting a tweet by ID and by a search query. A new requirement comes in and our product managers want to start exposing tweet metrics through our API. This new field will have a separate engineering team working on it, you'll be optionally returned only if customers request it. This new field will be named public metrics. On-screen, you could see how this new field will be configured. As we've seen with the tweet ID and tweet text fields, we first define a GraphQL fragment to retrieve the data we need to render. Then we include a field.scala file, notably, this time setting alwaysInclude to false. With this new field defined, the API platform will now make this field available to any API customer that specifies that they want this optional field. Since this new field is optional, we'll only want to query for it when an API customer actually requests it.

Generated Platform Tweet Fragment

To do this, let's look at the platform tweet fragment that our platform now generates. As you could see on the right, our API platform adds a GraphQL include directive, and every part of the query associated with an optional field. In this case, you can see that enabling the public_metrics_field variable will result in these public metric fields being included. Some of you may be wondering why we need to use an include directive instead of just including these optional fields in a query when a customer requests them. The reason is that Twitter's GraphQL system does not allow arbitrary queries to be run in production. Instead, we utilize persisted queries which require our API platform to submit all queries at build time, in exchange for an operation ID that is then used in production.

Tweet By ID with Public Metrics

Next, let's see what a customer request looks like for this new optional field. First up is tweet by ID. At the top of the screen you could see an HTTP request requesting our new public_metrics_field. Because an optional field is now specified, we set an additional GraphQL variable named public_metrics_field to true, which will enable our GraphQL include directives to include the additional fields we need to query. We could skip over the GraphQL response and then jump to the bottom, where we can see the individual request to tweet, now with the new public metrics data included.

Tweet Search with Public Metrics

Next, let's look at the tweet by search request. At the top of the screen, you could see an HTTP request with the new public_metrics_field specified. Once again, the public_metrics_field GraphQL variable is set to true. Then below, we could see the search results with the two default ID and text fields, along with the optionally requested public metrics. These examples hopefully illustrate where our API platform really shines, and letting individual developer teams focus on specific API components. In this case, adding a new tweet field. The platform then combines all the components together to ensure a consistent overall API.

SLO Burn Rate Alerts

Now that we've satisfied all the product requirements, let's briefly look at how we alert and monitor on our new APIs. All components on our API platform have service level objectives defined. Our platform ensures that each of these components gets automated alerts and monitoring dashboards generated for them. On-screen, you could see what a PagerDuty alert looks like for an API component.

SLO Dashboards

Then from this alert, our customers can click through to a Grafana SLO dashboard showing the status of their long-term SLO and the various errors scoped down to individual API components, seen across the last one hour, six hours, and three day burn windows, which is what our alerts are based on. First, since each component in our API is owned by a single team, we ensure that critical 24/7 pages are only dispatched for problems that a team controls. For example, if the tweet public_metrics_field starts failing, which causes the entire tweet by ID and tweet by search routes to fail, then the field owner will be paged and woken up, but will never wake up the route owners as there will be nothing actionable for them to do. Secondly, for deciding on when to alert on issues, we use the multiple burn rates alert technique from Google's excellent book titled, "The Site Reliability Workbook." I'd highly recommend checking out this book to learn more about this powerful technique.

Conclusion

We walked through a brief history of Twitter's APIs, followed by why we built the API platform, and then finished off with a handful of real-world API examples. For Twitter, this new API platform is just the start of our journey, and our work is far from done. Notably, we have many more existing endpoints to migrate and entirely new public endpoints to build. Yes, if you're keeping track, us migrating existing v1.1 endpoints to v2 does mean that Twitter is currently in the midst of our third major API migration, so here's hoping good things really come in threes.

 

See more presentations with transcripts

 

Recorded at:

Jun 02, 2021

BT