InfoQ Homepage Resilience Content on InfoQ
-
Cloudflare Global Outage Traced to Internal Database Change
Cloudflare’s recent global outage, linked to a database update, caused widespread disruption and highlighted the risks of single-vendor reliance. While service was restored, the incident sparked discussions on the importance of multi-vendor strategies in tech. Cloudflare's CEO vowed to enhance system resilience, emphasizing that outages can impact even the largest providers.
-
EU's Cloud Sovereignty SEAL Ranking Forces Governance and Resilience Trade-offs
The EU's new Cloud Sovereignty Framework establishes a standardized assessment for cloud services, enhancing digital autonomy and reducing dependence on non-EU giants. It introduces a scorecard system based on eight Sovereignty Objectives that influences public sector procurement decisions.
-
From Outages to Order: Netflix’s Approach to Database Resilience with WAL
Netflix uses a Write-Ahead Log (WAL) system to improve data platform resilience, addressing data loss, replication entropy, multi-partition failures, and corruption. WAL decouples producers and consumers, leverages SQS/Kafka with dead-letter queues, and supports delay queues, cross-region replication, and multi-table mutations for high-throughput, consistent, and recoverable database operations.
-
AWS Launches EBS Volume Clones for Instant, Crash-Consistent Data Copies
AWS has unveiled Volume Clones for Amazon EBS, enabling instant, point-in-time copies of storage volumes with a simple API call. This feature provides rapid access with single-digit millisecond latency, ideal for quick test setups and development. While it integrates seamlessly with the EBS CSI driver, understand its limitations, especially around encryption and management.
-
AWS Simplifies Multi-Region Failover with ARC Region Switch
AWS's Amazon Application Recovery Controller Region Switch revolutionizes multi-region failover with a fully-managed, centralized solution. Simplifying disaster recovery, it automates and coordinates essential tasks across AWS services. With proactive validation and a global dashboard, it transforms complex processes into confident, push-button drills, enhancing reliability and cost efficiency.
-
Amazon SQS Fair Queues: a New Approach to Multi-Tenant Resiliency
AWS's new Fair Queues for Amazon SQS revolutionize message handling in multi-tenant systems by mitigating the "noisy neighbor" issue. This feature ensures low message dwell times for quieter tenants without requiring code changes, enhancing both performance and fairness. Developers can effortlessly implement this capability and maintain consistent service quality across applications.
-
Grab Switches from SQS and Redis to Temporal for Its Subscription Platform
Grab based the new architecture for GrabUnlimited on Temporal. The company enhanced user experience and reduced production incidents by 80% for its subscription platform, which serves millions of users. The new architecture significantly improved robustness and scalability, addressing a range of issues with the previous solution.
-
Temporal on AWS Aims to Ease Building Resilient Distributed Systems
Temporal Technologies, the company that created Temporal, an open-source microservices orchestration platform focused on durable execution, has made Temporal Cloud available on the AWS marketplace. By offering their services via AWS, the company aims to simplify the development of resilient distributed systems for large-scale applications.
-
Cultivating a Culture of Resilience in Software Organizations
Resilience helps individuals and organizations respond to challenges. Personal resilience is built through adapting, technical resilience by mastering a variety of tools, and organizational resilience through flexibility and strong networks. In fast-changing software industries, recognizing tech shifts and fostering learning, flexibility, and collaboration, enhances resilience.
-
QCon London 2025: Insights from 20+ Years in Mission-Critical Infrastructure
Matthew Liste, head of infrastructure at American Express, shared insights at QCon London 2025 on building robust cloud platforms in financial services. With 20+ years of experience, he emphasized stability, security, scalability, the value of interchangeable components, and long-term sustainability, urging professionals to maintain focus and foster a strong team culture for platform engineering.
-
Resilience Best Practices: How Amazon Builds Well-Behaved Clients and Well-Protected Services
Using the analogy of addressing the lunch rush in restaurants, Michael Haken, senior principal solutions architect at AWS, describes how Amazon builds both well-behaved clients and well-protected services through operational and architectural strategies.
-
Most Companies Experience Weekly Outages: The State of Resilience 2025 Report
According to The State of Resilience 2025 Report, published by Cockroach Labs, outages are commonplace in most organizations, with 55% of companies reporting weekly and 14% reporting daily outages. Staggering 100% of survey participants experienced revenue losses due to outages, with some companies (8%) reporting losses of USD $1 million or higher over the last 12 months.
-
How Locking, Saturation and CDN Network Issues Brought down Canva
The Canva engineering team recently published their post-mortem on the outage they experienced last November, detailing the API Gateway failure and the lessons learned during the incident.
-
Netflix Rolls Out Service-Level Prioritized Load Shedding to Improve Resiliency
Netflix extended its prioritized load-shedding implementation to the individual service level to further improve system resilience. The approach uses cloud capacity more efficiently by shedding low-priority requests only when necessary instead of maintaining separate clusters for failure isolation.
-
Microsoft's Customer Managed Planned Failover Type for Azure Storage Available in Public Preview
Microsoft’s new customer-managed planned failover for Azure Storage enhances disaster recovery by enabling geo-redundancy without data loss or reconfiguration. This proactive solution supports business continuity during outages and large-scale disasters, aligning with competitive offerings from AWS and Google Cloud.