Netflix has developed the Rapid Event Notification System (RENO) to create a consistent user experience across various platforms and devices. RENO reacts more quickly and consistently than the traditional request/response model to user-generated actions, ranging from watching a title to changing profile information.
Netflix currently serves 222 million paying subscribers, according to its latest shareholder letter, and supports a wide range of devices from smartphones and laptops to at-home electronics such as smart TVs and modern game consoles. The mix of these two characteristics creates challenges in scaling, compatibility, and resilience.
Netflix RENO architecture diagram - source: Netflix technology blog
Netflix made a couple of notable design decisions to address the scaling challenges they face. RENO segments incoming events by priority and routes them to priority-specific AWS SQS queues and the corresponding compute instance clusters. This helps deliver more important updates such as “change a profile’s maturity” to Netflix devices faster. Events are also passed through a staleness filter and will not be processed if their age exceeds the configured threshold, as many events have little to no value if not sent almost immediately.
In contrast to traditional push-only notifications systems, such as AWS SNS, RENO implemented a push-and-pull delivery model that pushes notifications to online devices with the best effort and pulls periodically through the application lifecycle. This ensures that systems are consistently getting updates for user-generated events and best address its compatibility challenges across a variety of platforms and device types, especially for legacy devices that don’t support push notifications. This model leverages the Netflix Customer Messaging Platform to send notifications to mobile devices, uses Zuul Push for TV and other stream devices, and uses a Cassandra database to store event histories for long polling.
The latest InfoQ architecture and design trends report placed design for resilience in the early adopter category. Resilience can be observed in many layers of RENO. The above-mentioned event priority distributed queue and processing clustered is segmented. While one or more of the queue and processing clusters may fail, they do not impact their siblings and the overall system should remain available. Similarly, the outbound messaging system has a fanout pattern that delivers notifications by device and platform types, such that if “a downstream service or platform fails to deliver the notification, the other devices are not blocked from receiving push notifications.”
RENO was successful at Netflix and quickly positioned as the centralized rapid notification service for all product areas at Netflix. RENO is not currently open-source, although some supporting tools are, such as Zuul, used for device kept-live in RENO, and Mantis, used for observability in RENO.