Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Gutenberg – a Publish-Subscribe Service for Datasets Created by Netflix

Gutenberg – a Publish-Subscribe Service for Datasets Created by Netflix

To propagate datasets from a single producer to multiple consumers, Netflix has created Gutenberg, a service using a publish-subscribe technique to propagate versioned datasets between their microservices – a consumer subscribes to a dataset and is then updated with the latest version when it is available. In a blog post, Ammar Khaku, senior software engineer at Netflix, describes an overview of the design and some use cases for Gutenberg.

In the data model for Gutenberg, the top-level construct is a topic. Publishing to a topic creates a new monotonically increasing version of the data model, where each version contains metadata and a data pointer. Currently Gutenberg supports two types of data pointers: one where data is encoded in the pointer itself, used when the dataset is less than 1MB, and one that points to AWS where the data is stored in S3, used for larger datasets.

One common use case for Gutenberg is propagation data of varied sizes from a single publisher to multiple consumers. Often these use cases are dealing with configuration where the data is kept in memory in the client and used in runtime. Examples include metadata for supported payment methods and A/B test configuration.

Another use case is as a versioned data store, commonly used for machine-learning applications. Teams then build and train models based on historical data, run it for some time to see how it performs, and then change some parameters and run it again.

Khaku emphasizes that Gutenberg is not designed to be an eventing system. It is designed for publishing and consuming an entire immutable view of a dataset and is purely meant for data versioning and propagation. Rapidly publishing new data does not mean a client will read all versions. When the client asks for an update, it will only be provided with the latest version of the data, not any previous versions.

Gutenberg consists of a service providing gRPC and REST APIs and a Java client library that uses the gRPC API. The server is based on a globally replicated Cassandra cluster for persistence and is made up of separate instances for handling consumer requests and dealing with publish requests respectively. This allows for separate scaling and much more consumer requests than publish requests, and also insulates the two types from affecting each other.

In order to handle a large volume of requests from consumers, each instance handling consumer requests has an in-memory cache of what was last published, with the cache updated every few seconds. To prevent misbehaving applications from disturbing the system they use an adaptive concurrency limiter to detect and limit the applications or services. When S3 buckets in multiple regions are used, the server tries to optimize by providing a client with the bucket in the region closest to the client.

Before data is returned to a consumer, the Gutenberg service first runs a consistency check on the data. If the check fails, the service instead checks the history for the requested topic and returns the latest data that is consistent. This is implemented to prevent incomplete data returned to the consumer because of replication delays that sometimes occur in the Cassandra layer.

The Gutenberg client library communicates with the Gutenberg services using gRPC, and uses Eureka for service discovery. The client’s main responsibility is subscription management and S3 uploads and downloads. When a user creates a subscription to start a download, the library starts to retrieve data by polling every 30 seconds and then hands the data over to a listener provided by the user. Different types of configurable retry logic is included to enable a client to deal with download problems as it sees fit. Normally a client will always ask for data with the latest version it knows about and will only consume data from that version. One exception is if a bad deployment is made; to quickly mitigate such a problem, a client can be pinned to read a specific version of data that is known to be correct.

To create visibility for auditing and usage, the Gutenberg service intercepts requests from both publishers and consumers and indexes them in Elasticsearch. This creates a view on how different topics are used, and on topics that are not in use.

The work on Gutenberg is ongoing with new features planned, including client support for Node.js and Python, encryption of sensitive data, and improved incremental rollout.

Rate this Article