Distributed Materialized Views: How Airbnb’s Riverbed Processes 2.4 Billion Daily Events

Airbnb created Riverbed, a Lambda-like data framework for producing and managing distributed materialized views. The framework supports over 50 read-heavy use cases where data is sourced from multiple data sources within the company’s service-oriented architecture (SOA) platform. It uses Apache Kafka and Apache Spark for online and offline components, respectively.

Airbnb observed that some complex queries spanning multiple distinct data stores were responsible for contributing to the latency profile of some popular features of the platform. The team couldn’t use a standard materialized view approach, provided out-of-the-box by many databases, as the data required to compute the materialized view was not located in a single database.

The team considered a technique to create a distributed materialized view using a combination of change data capture (CDC), stream processing, and a dedicated database to store final results. They weighed up data processing architectures:

Lambda and Kappa are two real-time data processing architectures. Lambda combines batch and real-time processing for efficient handling of large data volumes, while Kappa focuses solely on streaming processing. Kappa’s simplicity offers better maintainability, but it poses challenges for implementing backfill mechanisms and ensuring data consistency, especially with out-of-order events.

Riverbed framework adopted Lambda architecture and provided a declarative way to define data queries and computation logic using GraphQL for both the online (real-time events) and offline (data backfill) components. The framework takes care of concurrency, versioning, and data correctness guarantees, as well as integration with infrastructure components.

Stream Processing in Riverbed (Source: Airbnb Engineering Blog)

For real-time processing, Riverbed consumes change data capture (CDC) events emitted by the data sources using Apache Kafka for messaging. Events from the CDC trigger updating the materialized view by executing the aggregation logic defined with GraphQL, and the resulting document is stored in the materialized view database. The processing is highly parallelized and batched for efficiency.

The streaming pipeline avoids race conditions since the CDC events are repartitioned in Apache Kafka based on the identifier of the materialized view document so that updates to the materialized views are done sequentially. Additionally, optimistic concurrency control is used between online (real-time) and offline (batch) processing to avoid concurrent writes and potential data inconsistencies.

Batch Processing in Riverbed (Source: Airbnb Engineering Blog)

Riverbed supports data backfilling and reconciliation in case of issues with real-time processing due to missing CDC events perhaps. This part of the solution uses Apache Spark over the data from the data warehouse storing daily snapshots. The framework generates Spark SQL based on GraphQL definitions configured in Riverbed.

Riverbed currently processes 2.4 billion events and writes 350 million documents daily, powering over 50 materialized views across features such as payments, search, reviews, itineraries, and internal products in Airbnb.

About the Author

Rafał Gancarz

Show moreShow less

InfoQ Software Architects' Newsletter

Follow us on

About the Author

Rafał Gancarz

Rate this Article

This content is in the Apache Kafka topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter