Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News QCon New York 2017: Scaling Event Sourcing for Netflix Downloads

QCon New York 2017: Scaling Event Sourcing for Netflix Downloads

Phillipa Avery, senior software engineer at Netflix, and Robert Reta, senior software engineer at Netflix, presented their Cassandra-backed event sourcing architecture at QCon New York 2017. Currently, it powers the download feature in Netflix and was summarised as something which improved the flexibility, reliability, scalability and debuggability of their services.

Avery explained that the main challenge of the new download feature was state. Unlike online content, which has a stateless licensing process, downloaded content must be tracked in order to enforce various business rules. For example, limiting the length of time a license can last or how many devices a download can live on.

In order to be able to reason about state, Avery explained how Netflix decided to adopt an event sourcing architecture:

"Event sourcing gives an immutable transaction history of every single state change that has happened over the history of a customer"

With event sourcing, Reta explained that the domain model is split into aggregates, and each of those aggregates can be decomposed into the events that produce its current state. What’s key is that the aggregates themselves aren’t persisted, just the events.

Reta outlined the three main components of an event sourcing architecture to be:

Event Store: Typically a database which takes a row ID and returns a list of events. These events can be replayed in order to get an aggregate into its current state. 
Aggregate Repository: Something which takes a query and translates it into a statement which retrieves events from the event store. A corresponding empty aggregate is then created, and then the events are applied to it in sequence in order to get it into the correct state.
Aggregate Service: An abstraction which only knows about the domain model. It takes requests from clients, enforces business rules, and then calls the underlying repository.

For their implementation, Netflix chose Cassandra as their event store. This was mainly due to its strong scalability and performance characteristics.  Snapshotting was also implemented, meaning that instead of having to replay every event for an aggregate, they could load the latest snapshot of its state. This resulted in only having to reply events which came after the snapshot. They also used the Kryo framework, an open source serialization library written in Java. It provided useful features like event versioning.

Reta also walked through how they calculate a new download permission end-to-end. First, a client passes a customer ID and title ID to a downloaded service. From this, empty download aggregates are initialized and all the corresponding events are replayed on top of them to get them to the correct state. Finally, the service applies business rules to see if the user has hit the download limit or not.

Avery summarised her experience with the event sourcing architecture, focusing on four key points:

Flexibility: When a new business requirement required a new domain model, implementing it end-to-end was straightforward. There was also the added benefit of fully abstracting away the database.
Debugging: The events provided full auditing information which helped give a view of everything that has happened in the system, making diagnosing issues simple.
Reliability: The architecture's two points of failure were Cassandra and the services. User experience was always the priority, so if something went down then sensible defaults would be returned. An example of this is always allowing a user to download content if the download service ceases to function.
Scalability: Cassandra allowed the application to scale easily. The only problem was disk space, as there is a lot of event data. However, by implementing snapshots, events before them could be stored on larger, slower disks as they no longer needed to be loaded when serving user requests.

The full talk can be watched online, providing additional information such as more specific implementation details.

Rate this Article