InfoQ Homepage Resilience Content on InfoQ
-
Resilience Best Practices: How Amazon Builds Well-Behaved Clients and Well-Protected Services
Using the analogy of addressing the lunch rush in restaurants, Michael Haken, senior principal solutions architect at AWS, describes how Amazon builds both well-behaved clients and well-protected services through operational and architectural strategies.
-
Most Companies Experience Weekly Outages: The State of Resilience 2025 Report
According to The State of Resilience 2025 Report, published by Cockroach Labs, outages are commonplace in most organizations, with 55% of companies reporting weekly and 14% reporting daily outages. Staggering 100% of survey participants experienced revenue losses due to outages, with some companies (8%) reporting losses of USD $1 million or higher over the last 12 months.
-
How Locking, Saturation and CDN Network Issues Brought down Canva
The Canva engineering team recently published their post-mortem on the outage they experienced last November, detailing the API Gateway failure and the lessons learned during the incident.
-
Netflix Rolls Out Service-Level Prioritized Load Shedding to Improve Resiliency
Netflix extended its prioritized load-shedding implementation to the individual service level to further improve system resilience. The approach uses cloud capacity more efficiently by shedding low-priority requests only when necessary instead of maintaining separate clusters for failure isolation.
-
Microsoft's Customer Managed Planned Failover Type for Azure Storage Available in Public Preview
Microsoft’s new customer-managed planned failover for Azure Storage enhances disaster recovery by enabling geo-redundancy without data loss or reconfiguration. This proactive solution supports business continuity during outages and large-scale disasters, aligning with competitive offerings from AWS and Google Cloud.
-
Google Cloud Enhances Spanner with Dual-Region Configuration
Google Cloud has introduced a significant update to its fully-managed distributed SQL database service, Spanner, which now offers a dual-region configuration option. The company aims with this enhancement to assist enterprises in complying with data residency norms across countries with limited cloud support while ensuring high availability.
-
Modern Data Architecture, ML, and Resilience Topics Announced for QCon San Francisco 2024
QCon San Francisco returns November 18-22, focusing on innovations and emerging trends you should pay attention to in 2024. With technical talks from international software practitioners, QCon will provide actionable insights and skills you can take back to your teams.
-
QCon London: Scaling Microservices Architecture and Technology Organization at Trainline
During the recent QCon London conference, Trainline’s CTO spoke about the evolution of the company’s system architecture and organizational structure over the last five years. The company had to adapt to market changes and growing customer expectations by improving the performance and reliability of its technology platform.
-
InfoQ & QCon Events: Level up on Generative AI, Security, Platform Engineering, and More Upcoming
As we navigate through these transformative times, the upcoming InfoQ events stand as a platform to help you stay ahead, learn valuable insights, and find practical solutions to your development challenges in 2024 and beyond. The events are carefully curated for senior software engineers, architects, and team leaders, offering practitioner insights into emerging trends, patterns, and practices.
-
Uber Improves Resiliency of Microservices with Adaptive Load Shedding
Uber created a new load-shedding library for its microservice platform, serving over 130 million customers and handling aggregated peaks of millions of requests per second (RPSs). The company replaced the solution based on QALM with Cinnamon library, which, in addition to graceful degradation, can dynamically and continuously adjust the capacity of the service and the amount of load shedding.
-
Zonal Autoshift on AWS: Optimizing Infrastructure Reliability
Zonal autoshift, a new capability of Amazon Route 53 Application Recovery Controller, automatically shifts traffic away from an Availability Zone (AZ) when a potential failure is identified by the cloud provider. The service redirects the traffic back once the AZ failure is resolved.
-
Slack Migrates to Cell-Based Architecture on AWS to Mitigate Gray Failures
Slack migrated most of the critical user-facing services from a monolithic to a cell-based architecture over the last 1.5 years. The move was triggered by the impact of networking outages affecting a single availability zone, causing user-impacting service degradation. The new architecture allows incrementally draining all the traffic away from the affected availability zone within 5 minutes.
-
Roblox Builds New Cellular Infrastructure to Improve Gaming Experience
The online game platform and creation system Roblox has detailed how they have made their infrastructure more efficient and resilient, to support the demands of more than 70 million active daily users engaged in immersive 3D experiences.
-
Chaos Engineering Service Azure Chaos Studio Now Generally Available
Two years after entering public preview, reliability experimentation service Azure Chaos Studio is now generally available. Among its most recent features are experiment templates, dynamic targets, load testing faults, and more.
-
Polly v8 .NET Resilience Library: Resilience Pipelines, Built-in Telemetry, and More
Polly v8 is officially released. This version brings enhancements such as resilience pipelines, built-in telemetry support, and some changes within the configuration for individual resilience strategies.