InfoQ Homepage Resilience Content on InfoQ

News

RSS Feed

Newer Older

Culture & Methods

Building Software Organisations Where People Can Thrive

Continuous learning, adaptability, and strong support networks are the foundations for thriving teams, Matthew Card mentioned. Trust is built through consistent, fair leadership and addressing toxic behaviour, bias, and microaggressions early. By fostering growth, psychological safety, and accountability, people-first leadership drives resilience, collaboration, and performance.

Ben Linders
on Jan 29, 2026
DevOps

Cloudflare Launches "Code Orange: Fail Small" Resilience Plan after Multiple Global Outages

Cloudflare recently published a detailed resilience initiative called Code Orange: Fail Small, outlining a comprehensive plan to prevent large-scale service disruptions after two major network outages in the past six weeks.

Craig Risi
on Jan 16, 2026
Cloud

Cloudflare Global Outage Traced to Internal Database Change

Cloudflare’s recent global outage, linked to a database update, caused widespread disruption and highlighted the risks of single-vendor reliance. While service was restored, the incident sparked discussions on the importance of multi-vendor strategies in tech. Cloudflare's CEO vowed to enhance system resilience, emphasizing that outages can impact even the largest providers.

Steef-Jan Wiggers
on Nov 22, 2025
Cloud

EU's Cloud Sovereignty SEAL Ranking Forces Governance and Resilience Trade-offs

The EU's new Cloud Sovereignty Framework establishes a standardized assessment for cloud services, enhancing digital autonomy and reducing dependence on non-EU giants. It introduces a scorecard system based on eight Sovereignty Objectives that influences public sector procurement decisions.

Steef-Jan Wiggers
on Nov 03, 2025
Architecture & Design

From Outages to Order: Netflix’s Approach to Database Resilience with WAL

Netflix uses a Write-Ahead Log (WAL) system to improve data platform resilience, addressing data loss, replication entropy, multi-partition failures, and corruption. WAL decouples producers and consumers, leverages SQS/Kafka with dead-letter queues, and supports delay queues, cross-region replication, and multi-table mutations for high-throughput, consistent, and recoverable database operations.

Leela Kumili
on Oct 31, 2025
Cloud

AWS Launches EBS Volume Clones for Instant, Crash-Consistent Data Copies

AWS has unveiled Volume Clones for Amazon EBS, enabling instant, point-in-time copies of storage volumes with a simple API call. This feature provides rapid access with single-digit millisecond latency, ideal for quick test setups and development. While it integrates seamlessly with the EBS CSI driver, understand its limitations, especially around encryption and management.

Steef-Jan Wiggers
on Oct 24, 2025
Cloud

AWS Simplifies Multi-Region Failover with ARC Region Switch

AWS's Amazon Application Recovery Controller Region Switch revolutionizes multi-region failover with a fully-managed, centralized solution. Simplifying disaster recovery, it automates and coordinates essential tasks across AWS services. With proactive validation and a global dashboard, it transforms complex processes into confident, push-button drills, enhancing reliability and cost efficiency.

Steef-Jan Wiggers
on Aug 14, 2025
Cloud

Amazon SQS Fair Queues: a New Approach to Multi-Tenant Resiliency

AWS's new Fair Queues for Amazon SQS revolutionize message handling in multi-tenant systems by mitigating the "noisy neighbor" issue. This feature ensures low message dwell times for quieter tenants without requiring code changes, enhancing both performance and fairness. Developers can effortlessly implement this capability and maintain consistent service quality across applications.

Steef-Jan Wiggers
on Jul 31, 2025
Architecture & Design

Grab Switches from SQS and Redis to Temporal for Its Subscription Platform

Grab based the new architecture for GrabUnlimited on Temporal. The company enhanced user experience and reduced production incidents by 80% for its subscription platform, which serves millions of users. The new architecture significantly improved robustness and scalability, addressing a range of issues with the previous solution.

Rafal Gancarz
on Jul 21, 2025
Cloud

Temporal on AWS Aims to Ease Building Resilient Distributed Systems

Temporal Technologies, the company that created Temporal, an open-source microservices orchestration platform focused on durable execution, has made Temporal Cloud available on the AWS marketplace. By offering their services via AWS, the company aims to simplify the development of resilient distributed systems for large-scale applications.

Steef-Jan Wiggers
on May 09, 2025
Culture & Methods

Cultivating a Culture of Resilience in Software Organizations

Resilience helps individuals and organizations respond to challenges. Personal resilience is built through adapting, technical resilience by mastering a variety of tools, and organizational resilience through flexibility and strong networks. In fast-changing software industries, recognizing tech shifts and fostering learning, flexibility, and collaboration, enhances resilience.

Ben Linders
on May 01, 2025
Cloud

QCon London 2025: Insights from 20+ Years in Mission-Critical Infrastructure

Matthew Liste, head of infrastructure at American Express, shared insights at QCon London 2025 on building robust cloud platforms in financial services. With 20+ years of experience, he emphasized stability, security, scalability, the value of interchangeable components, and long-term sustainability, urging professionals to maintain focus and foster a strong team culture for platform engineering.

Steef-Jan Wiggers
on Apr 10, 2025
Cloud

Resilience Best Practices: How Amazon Builds Well-Behaved Clients and Well-Protected Services

Using the analogy of addressing the lunch rush in restaurants, Michael Haken, senior principal solutions architect at AWS, describes how Amazon builds both well-behaved clients and well-protected services through operational and architectural strategies.

Renato Losio
on Mar 08, 2025
Architecture & Design

Most Companies Experience Weekly Outages: The State of Resilience 2025 Report

According to The State of Resilience 2025 Report, published by Cockroach Labs, outages are commonplace in most organizations, with 55% of companies reporting weekly and 14% reporting daily outages. Staggering 100% of survey participants experienced revenue losses due to outages, with some companies (8%) reporting losses of USD $1 million or higher over the last 12 months.

Rafal Gancarz
on Feb 16, 2025
DevOps

How Locking, Saturation and CDN Network Issues Brought down Canva

The Canva engineering team recently published their post-mortem on the outage they experienced last November, detailing the API Gateway failure and the lessons learned during the incident.

Renato Losio
on Feb 08, 2025

Newer News

Older News

InfoQ Software Architects' Newsletter

News