Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Uber Implements Disaster Recovery for Multi-Region Kafka

Uber Implements Disaster Recovery for Multi-Region Kafka

This item in japanese

In a recent blog post, Uber engineers highlight how they use a home-grown replication platform to implement disaster recovery at scale with a multi-region Kafka deployment.

Uber has a large deployment of Apache Kafka, processing trillions of messages and multiple petabytes of data per day. To use Kafka, the engineers had to provide business resilience and continuity in the face of natural and human-made disasters. They built uReplicator, which is Uber's open-source solution for replicating Kafka's data. Uber based uReplicator on Kafka's MirrorMaker with improvements on high-reliability, a zero-data-loss guarantee, and ease of operation.

Uber engineers Yupeng Fu and Mingmin Chen summarize their insights:

A key insight from the practices is that offering reliable and multi-regional available infrastructure services like Kafka can greatly simplify the development of the business continuity plan for the applications. The application can store its state in the infrastructure layer and thus become stateless, leaving the complexity of state management, like synchronization and replication across regions, to the infrastructure services.

Using uReplicator, Uber engineers built the following Kafka topology for disaster recovery:


Each producer produces data into a local, regional Kafka cluster. This strategy is the most performant option. In case the local Kafka cluster fails, the producer fails over to another regional Kafka cluster. Then, uReplicator replicates the regional clusters to aggregate Kafka clusters available in all regions. Each cluster contains aggregated data from all other regions.

Message consumption is always done from the aggregate Kafka cluster in each region using two reader topologies - active-active or active-passive. Active-active is preferred when performance and faster time to recovery are more critical. Active-passive, on the other hand, is favored when consistency is more important.

In active-active mode, readers in all regions read the data from the aggregated clusters. However, only a chosen primary service updates the processed data results in an active-active database available in all regions. The figure below demonstrates this concept with a Flink job that calculates Uber's surge pricing data.


In active-passive mode, only one consumer is allowed to consume from the aggregate clusters in one of the regions at a time. Kafka replicates the consumption offset to other regions. Upon failure of the primary region, the consumer fails over to another region and resumes its consumption. Uber needed to handle a caveat in this approach, where "messages from the aggregate clusters may become out of order after aggregating from the regional clusters." Uber introduced an offset manager whose mission is to resolve these discrepancies and achieve zero data loss in the face of region failover, depicted in the following diagram.


Rate this Article