During the second day of QCon San Francisco 2023, Michelle Brush, an engineering director of SRE at Google, discussed challenges, patterns, and practices for disaster recovery actions in massively distributed systems in her session. The session is part of the "Designing for Resilience" track.
Brush started with data management -- addressing data corruption is essential. In the past, stateful systems relied on a single massive database, with backup policies contingent on usage frequency and a defined recovery time objective (RTO), which is the targeted duration between the event of failure and the point where operations resume. However, with distributed systems services interacting with one another, data is distributed across multiple data stores. Restoring data in such complex systems may lead to consistency issues.
She continued providing ideas to perform restores with complex distributed systems:
- Accept the inconsistency. However, that may not be acceptable.
- Coordinate restoration to a globally consistent state. Brush mentions that at Google, systems like Spanner with the TrueTime capability come into play, emphasizing the importance of maintaining a consistent state without focusing on time. Achieving this, however, often requires downtime: a freeze period during which all actions must conclude before taking a backup.
- Rebuild the world option. This involves rebuilding the entire system with a clear front door, designating specific sources of truth, and pushing these sources to restore the system effectively.
What if there’s no clear front door with the "rebuild the world" option? Brush brings up another option: "reconcile the world." This option leverages insights and learning from the previously mentioned options.
Brush explains this option is about bidirectional reconciliation, allowing data to flow in both directions, ultimately determining the most accurate representation of reality. To facilitate this complex process, Google uses powerful tools such as Hadoop, Spark, Kafka, and Dataproc. This enables them to navigate and harmonize data seamlessly, ensuring that their systems reflect the most up-to-date and accurate information. However, according to Brush with microservices, this option doesn’t get around the Backup Availability and Consistency (BAC) theorem.
With disaster recovery, it's essential to recognize that most systems are a blend of different approaches. When designing a system, it's crucial to consider this complexity and think about ways to mitigate potential headaches. Brush recommends every system design should be viewed from four perspectives (derived from four plus one architecture view): the logical view, development view, process view, and physical view, including hardware considerations. Furthermore, it's essential to account for various failure scenarios and plan accordingly.
Lastly, Brush ended the talk with some advice on leveraging the logical, process, physical, and plus one (scenario) views of your system that are crucial for robust disaster recovery planning. Practising recovery scenarios makes perfect sense when determining the Recovery Point Objective (RPO), the maximum time permitted for data to be restored, which may or may not mean data loss. Without regular testing and practice, engineers won't truly know how their system will perform in the face of a disaster.