InfoQ Homepage Disaster Recovery Content on InfoQ
-
Disaster Recovery Across a Million Pieces: Michelle Brush at QCon San Francisco
During the second day of QCon San Francisco 2023, Michelle Brush, an engineering director, SRE at Google, discussed challenges, patterns, and practices for disaster recovery actions in massively distributed systems in her session. The session is part of the "Designing for Resilience" track.
-
Google Introduces Cloud Backup and Disaster Recovery
Google recently introduced Cloud Backup and Disaster Recovery (DR), allowing customers to enable centralized backup management directly from the Google Cloud console. The new backup and recovery service is designed to work with cloud storage repositories, databases, and applications.
-
How to Prepare for the Unexpected: an InfluxData Outage Story Told at KubeCon EU 22
Cloud applications promise high availability and accessibility to its users, but for that to be achieved a disaster recovery plan is essential. The team behind InfluxDB shared at KubeConEU22 their lessons learned from battle testing their disaster recovery strategy on the day when they deleted the production.
-
Amazon Introduces S3 Batch Replication to Replicate Existing Objects
Amazon recently introduced Batch Replication for S3, an option to replicate existing objects and synchronize buckets. The new feature is designed for use cases such as disaster recovery setup, reduce latency or transfer ownership of existing data.
-
Amazon Announces Elastic File System Replication for Multi-Region Deployments
Amazon recently announced Elastic File System Replication to keep an up-to-date copy of a network file system in a second AWS region or within the same region.
-
AWS Announced General Availability of Elastic Disaster Recovery
Recently AWS announced the general availability (GA) of AWS Elastic Disaster Recovery (AWS DRS). With this new service, organizations can minimize downtime and data loss through the fast, reliable recovery of on-premises and cloud-based applications.
-
Amazon Introduces AWS Resilience Hub to Monitor and Improve RPO and RTO
Amazon recently announced the availability of AWS Resilience Hub, a service designed to help customers define, measure, and manage the resilience of their applications on the cloud.
-
AWS Releases Amazon Route 53 Application Recovery Controller into General Availability
Recently, AWS announced the general availability (GA) of Amazon Route 53 Application Recovery Controller, an additional new set of capabilities in Amazon Route 53. With the capabilities, it will be easier for customers to continuously monitor their applications’ ability to recover from failures and control their recovery across AWS Regions, Availability Zones, and on-premises infrastructure.
-
Microsoft Announces the Public Preview of Disk Pool for Azure VMware Solution
Microsoft recently announced the preview of disk pool enabling Azure Disk Storage as a persistent storage option for Azure VMware Solution - a vSAN hyper-converged vSphere cluster. With this persistent storage option, customers have another choice for running VMware workloads on Azure.
-
Uber Implements Disaster Recovery for Multi-Region Kafka
In a recent blog post, Uber engineers highlight how they use a replication platform to implement disaster recovery at scale with a multi-region Kafka deployment. Uber has a large deployment of Apache Kafka, processing trillions of messages and multiple petabytes of data per day. Uber's engineers provided business resilience and continuity in the face of natural and human-made disasters.
-
Amazon Introduces a New Feature for ElastiCache for Redis: Global Datastore
Recently Amazon announced Global Datastore, a new feature of Amazon ElastiCache for Redis that provides fully managed, fast, reliable and secure cross-region replication.
-
Summary of Chaos Community Day v4.0: Resilience, Observability, and Gamedays
Earlier in the year, the fourth edition of “Chaos Community Day” was held at Work-Bench in New York City. Key takeaways from the day included: the topic of chaos engineering draws heavily from other domains, which software engineers can also learn from; understanding systems, and communicating and exchanging the related mental models, is vital for establishing resilience.
-
Building Production-Ready Applications: Michael Kehoe Shares Lessons Learned from LinkedIn
At QCon San Francisco, Michael Kehoe presented “Building Production-Ready Applications”. Drawing on his experience with site reliability engineering (SRE), he introduced the tenets of “production-readiness” that all engineers across the organisation should focus on as: stability and reliability; scalability and performance; fault tolerance and disaster recovery; monitoring; and documentation.
-
Why the World Needs More Resilient Systems: Tammy Butow Discusses Chaos Engineering at QCon London
At QCon London, Tammy Butow, explained why the world needs more resilient systems, and how this can be achieved with the practice of chaos engineering. Three primary prerequisites for chaos engineering were provided -- high severity “SEV” incident management, monitoring, and measuring the impact -- and a series of guidelines, tools and practices presented.
-
Microsoft Introduces Azure Availability Zones, Completes MAREA Transatlantic Connection
In a recent blog post, Microsoft announced the expansion of High Availability (HA) and resiliency options for customers. The update comes in the form of Azure Availability Zones which increase the availability of certain Azure services within a specific region by providing complete redundancy and isolation of the infrastructure. Azure Availability Zones include a financially-backed SLA of 99.99%.