InfoQ Homepage Resilience Content on InfoQ

News

RSS Feed

Newer Older

DevOps

How Locking, Saturation and CDN Network Issues Brought down Canva

The Canva engineering team recently published their post-mortem on the outage they experienced last November, detailing the API Gateway failure and the lessons learned during the incident.

Renato Losio
on Feb 08, 2025
Architecture & Design

Netflix Rolls Out Service-Level Prioritized Load Shedding to Improve Resiliency

Netflix extended its prioritized load-shedding implementation to the individual service level to further improve system resilience. The approach uses cloud capacity more efficiently by shedding low-priority requests only when necessary instead of maintaining separate clusters for failure isolation.

Rafal Gancarz
on Nov 23, 2024
Cloud

Microsoft's Customer Managed Planned Failover Type for Azure Storage Available in Public Preview

Microsoft’s new customer-managed planned failover for Azure Storage enhances disaster recovery by enabling geo-redundancy without data loss or reconfiguration. This proactive solution supports business continuity during outages and large-scale disasters, aligning with competitive offerings from AWS and Google Cloud.

Steef-Jan Wiggers
on Sep 19, 2024
Cloud

Google Cloud Enhances Spanner with Dual-Region Configuration

Google Cloud has introduced a significant update to its fully-managed distributed SQL database service, Spanner, which now offers a dual-region configuration option. The company aims with this enhancement to assist enterprises in complying with data residency norms across countries with limited cloud support while ensuring high availability.

Steef-Jan Wiggers
on Aug 01, 2024
AI, ML & Data Engineering

Modern Data Architecture, ML, and Resilience Topics Announced for QCon San Francisco 2024

QCon San Francisco returns November 18-22, focusing on innovations and emerging trends you should pay attention to in 2024. With technical talks from international software practitioners, QCon will provide actionable insights and skills you can take back to your teams.

Artenisa Chatziou
on May 10, 2024
Architecture & Design

QCon London: Scaling Microservices Architecture and Technology Organization at Trainline

During the recent QCon London conference, Trainline’s CTO spoke about the evolution of the company’s system architecture and organizational structure over the last five years. The company had to adapt to market changes and growing customer expectations by improving the performance and reliability of its technology platform.

Rafal Gancarz
on Apr 17, 2024
Development

InfoQ & QCon Events: Level up on Generative AI, Security, Platform Engineering, and More Upcoming

As we navigate through these transformative times, the upcoming InfoQ events stand as a platform to help you stay ahead, learn valuable insights, and find practical solutions to your development challenges in 2024 and beyond. The events are carefully curated for senior software engineers, architects, and team leaders, offering practitioner insights into emerging trends, patterns, and practices.

Artenisa Chatziou
on Feb 09, 2024
Architecture & Design

Uber Improves Resiliency of Microservices with Adaptive Load Shedding

Uber created a new load-shedding library for its microservice platform, serving over 130 million customers and handling aggregated peaks of millions of requests per second (RPSs). The company replaced the solution based on QALM with Cinnamon library, which, in addition to graceful degradation, can dynamically and continuously adjust the capacity of the service and the amount of load shedding.

Rafal Gancarz
on Feb 06, 2024
Cloud

Zonal Autoshift on AWS: Optimizing Infrastructure Reliability

Zonal autoshift, a new capability of Amazon Route 53 Application Recovery Controller, automatically shifts traffic away from an Availability Zone (AZ) when a potential failure is identified by the cloud provider. The service redirects the traffic back once the AZ failure is resolved.

Renato Losio
on Jan 30, 2024
Architecture & Design

Slack Migrates to Cell-Based Architecture on AWS to Mitigate Gray Failures

Slack migrated most of the critical user-facing services from a monolithic to a cell-based architecture over the last 1.5 years. The move was triggered by the impact of networking outages affecting a single availability zone, causing user-impacting service degradation. The new architecture allows incrementally draining all the traffic away from the affected availability zone within 5 minutes.

Rafal Gancarz
on Jan 17, 2024
DevOps

Roblox Builds New Cellular Infrastructure to Improve Gaming Experience

The online game platform and creation system Roblox has detailed how they have made their infrastructure more efficient and resilient, to support the demands of more than 70 million active daily users engaged in immersive 3D experiences.

Matt Saunders
on Jan 03, 2024
.NET

Polly v8 .NET Resilience Library: Resilience Pipelines, Built-in Telemetry, and More

Polly v8 is officially released. This version brings enhancements such as resilience pipelines, built-in telemetry support, and some changes within the configuration for individual resilience strategies.

Robert Krzaczyński
on Nov 07, 2023
Architecture & Design

Monzo Employs Targeted Traffic Shedding against Stampeding Herd Effect from the Mobile App

Monzo developed a solution for shedding traffic in case its platform comes under intense and unexpected load that could lead to an outage. Traffic spikes can be generated by the mobile app and triggered by push notifications or other bursts in user activity. The solution can reduce the read traffic by almost 50% with 90% overall accuracy without noticeable customer impact.

Rafal Gancarz
on Oct 23, 2023
Architecture & Design

How Amazon Prime Video Delivers 99.999% Availability While Reducing Costs

Amazon Prime Video created a highly available live video streaming architecture by combining redundant components to achieve the five-nines of availability that they require for their platform. The company optimized the deployment topology and video encoding to reduce costs while ensuring optimal video quality for users.

Rafal Gancarz
on Oct 09, 2023
Java

QCon San Francisco 2023 Day 3: Architecting the Cloud, Deep Tech, Frontend Trends, Org Resilience

The 17th annual QCon San Francisco conference was held at the Hyatt Regency San Francisco in San Francisco, California. This five-day event, organized by C4Media, consists of three days of presentations and two days of workshops. Day Three, scheduled on October 4th, 2023, included a keynote address by Will Larson and presentations from four conference tracks and one sponsored track.

Michael Redlich
on Oct 06, 2023

Newer News

Older News

InfoQ Software Architects' Newsletter

News