Netflix Studio Search: Using Elasticsearch and Apache Flink to Index Federated GraphQL Data

Netflix engineers recently published how they built Studio Search, using Apache Kafka streams, an Apache Flink-based Data Mesh process, and an Elasticsearch sink to manage the index. They designed the platform to take a portion of Netflix's federated GraphQL graph and make it searchable. Today, Studio Search powers a significant portion of the user experience for many applications within the organisation.

At Netflix's Content Engineering, each team independently builds and operates their Domain Graph Services (DGS) and, at the same time, connects its domain with other domains in a unified GraphQL schema exposed by a federated gateway. Given this structure, Netflix engineers Alex Hutter, Falguni Jhaveri and Senthil Sayeebaba explain the motivation for Studio Search:

Once entities [...] are available in the graph, it's very common for folks to want to query for a particular entity based on attributes of related entities, e.g. give me all movies that are currently in photography with Ryan Reynolds as an actor.

An example of linked entities in the graph that users want to search through
Source: https://netflixtechblog.com/how-netflix-content-engineering-makes-a-federated-graph-searchable-5c0c1c7d7eaf

According to the authors, in a federated graph architecture, each service "would need to provide an endpoint that accepts a query and filters that may apply to data the service does not own" and use those to identify the appropriate entities it should return. Worse, "every entity owning service could be required to do this work." This common problem of making a federated graph searchable led to the creation of Studio Search.

Studio Search indexing architecture
Source: https://netflixtechblog.com/how-netflix-content-engineering-makes-a-federated-graph-searchable-5c0c1c7d7eaf

The above diagram illustrates Studio Search's architecture and how it maintains an index for a portion of the federated graph. Application events and Change Data Capture (CDC) events are streamed to schematised Kafka streams. Data Mesh processes in Apache Flink consume these events and enrich the data using user-provided GraphQL queries against the federated gateway. Fetched documents are put onto another schematised Kafka topic before processing by an Elasticsearch sink in Data Mesh that indexes them into Elasticsearch.

The team did integrations by hand early on, but as it became inundated with requests to integrate with Studio Search, this did not scale. "We needed to build tools to help us automate as much of the provisioning of the pipelines as possible."

To allow automation, the team defined a single YAML configuration file that enabled users to provide a high-level description of their pipeline. They use this configuration, in turn, to programmatically create the indexing pipeline in the Data Mesh.

Sample .yaml configuration
Source: https://netflixtechblog.com/how-netflix-content-engineering-makes-a-federated-graph-searchable-5c0c1c7d7eaf

From the GraphQL query template in the configuration file, the team generates an Apache Avro schema required for the schematised Kafka streams and an index template required for Elasticsearch. Finally, self-serve deployment is possible via a Python-based command-line interface (CLI) tool.

Challenges the team currently faces are how to bootstrap a new index while not overloading the system, improved usage of reverse lookups, and better index consistency and tolerance against out-of-date or missing data.

About the Author

Eran Stiller

Show moreShow less

InfoQ Software Architects' Newsletter

Follow us on

About the Author

Eran Stiller

Rate this Article

This content is in the Apache Kafka topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter