InfoQ Homepage Site Reliability Engineering Content on InfoQ

News

RSS Feed

Newer Older

Architecture & Design

GitHub Reworks Layered Defenses after Legacy Protections Block Legitimate Traffic

GitHub engineers recently traced user reports of unexpected “Too Many Requests” errors to abuse-mitigation rules that had accidentally remained active long after the incidents that prompted them.

Matt Foster
on Feb 04, 2026
DevOps

Human‑Centred AI for SRE: Multi‑Agent Incident Response without Losing Control

A growing body of recent research and industry commentary suggests that a shift in how organisations approach site reliability engineering is underway. Rather than handing the pager to a machine, teams are designing multi-agent AI systems that work alongside on-call engineers, narrowing the search space and automating the tedious steps while leaving judgment calls to humans.

Matt Saunders
on Jan 18, 2026
DevOps

How Authress Designed for Resilience and Survived a Major AWS Outage

Identity and authentication services company Authress shared its strategy to stay operational during major cloud infrastructure outages like the massive October 2025 AWS outage that disrupted many major services. According to Authress CTO Warren Parad, the company's resilience architecture relies on strategies like multi-region deployment and minimizing reliance on AWS control plane services.

Sergio De Simone
on Dec 28, 2025
Cloud

Cloudflare Global Outage Traced to Internal Database Change

Cloudflare’s recent global outage, linked to a database update, caused widespread disruption and highlighted the risks of single-vendor reliance. While service was restored, the incident sparked discussions on the importance of multi-vendor strategies in tech. Cloudflare's CEO vowed to enhance system resilience, emphasizing that outages can impact even the largest providers.

Steef-Jan Wiggers
on Nov 22, 2025
DevOps

Enhancing Reliability Using Service-Level Prioritized Load Shedding: Netflix at QCon SF 2025

At QCon San Francisco, Netflix engineers unveiled their advanced Service-Level-Prioritized Load-Shedding strategy, enhancing reliability during traffic spikes. By prioritizing high-value requests and automating management across microservices, they safeguard user experience and system stability. Key insights stress prioritization, automation, and structured load shedding for optimal resilience.

Steef-Jan Wiggers
on Nov 20, 2025
Cloud

Race Condition in DynamoDB DNS System: Analyzing the AWS US-EAST-1 Outage

On October 19th and 20th, AWS experienced an extended outage triggered by a failure in Amazon DynamoDB that affected most services in its most popular region, Northern Virginia. The cloud provider released an analysis of the incident, sparking discussions in the community about redundancy on AWS, moving out of public cloud, and multi-region approaches.

Renato Losio
on Nov 15, 2025
DevOps

Report Finds LLMs Not Yet Ready to Replace SREs in Incident Management

A study by ClickHouse found that large language models (LLMs) can't yet replace Site Reliability Engineers (SREs) for tasks such as finding the root causes of incidents. The study tested five leading models against real-world observability data to determine whether AI could autonomously identify production issues.

Matt Saunders
on Sep 27, 2025
Cloud

Azure Advisor Well-Architected Assessment in Public Preview to Optimize Cloud Infrastructure

Microsoft Azure recently announced the public preview of the Advisor Well-Architected assessment. This self-guided questionnaire aims to provide tailored, actionable recommendations to optimize Azure resources while aligning with the Azure Well-Architected Framework (WAF) principles.

Steef-Jan Wiggers
on Aug 29, 2024
DevOps

Advancing System Reliability: Meta's AI-Driven Approach to Root Cause Analysis

Meta recently shared how they are enhancing their system reliability through advanced investigation tools, including the AI-assisted Hawkeye, which aids in debugging machine learning workflows. By integrating Artificial Intelligence, Meta has developed a new investigation system that combines heuristic-based retrieval with large language model (LLM) ranking to assist in root cause analysis.

Claudio Masolo
on Aug 22, 2024
Cloud

AWS Introduces Amazon CloudWatch Internet Weather Map

AWS recently announced the availability of the Internet Weather Map, a new feature of CloudWatch that displays a 24-hour global snapshot of internet latency and availability outages. This new map offers a worldwide perspective on Internet conditions, allowing users to zoom in and analyze performance and availability problems in specific cities or with particular service providers.

Renato Losio
on May 01, 2024
Cloud

Google Delivers Comprehensive Cloud Infrastructure Reliability Guide

Google recently delivered a cloud infrastructure reliability guide combining best practices and expertise from its engineers for its customers.

Steef-Jan Wiggers
on Jan 24, 2023
Cloud

Log Analytics Feature in Cloud Logging Now Generally Available

Google recently made its Cloud Logging Log Analytics feature generally available (GA), allowing users to search, aggregate, and transform all log data types, including application, network, and audit logs.

Steef-Jan Wiggers
on Jan 24, 2023
DevOps

Google Production Excellence Program "ProdEx": Christof Leng at DOES 2022

Christof Leng, SRE lead at Google, presented ProdEx, their production excellence review program that helps manage operational risks and promote best practices. ProdEx is a community that builds platforms together, establishes standards and promotes best practices, so people learn from each others and grow. Today they have more than 100 SRE teams signed up and have performed more than 1000 reviews.

Shaaron A Alvares
on Oct 25, 2022
DevOps

Disney SRE "Proximity Powered Engineering" Culture: Jason Cox at DOES 2022

Jason Cox, SRE director at Disney, shares how he developed a world-class centralized shared services SRE organization based on “proximity-powered empathy engineering” and three core values: Listen: Know the Business - Know the Mission - Know the Team. Empathize: Shared Mission - Shared Struggles - Shared Wins. Actually Help: Build Community - Build Trust - Build Magic Together.

Shaaron A Alvares
on Oct 24, 2022
DevOps

Platform Engineering, DevOps, and Cognitive Load: a Summary of Community Discussions

Operations engineering is moving in the direction of platform engineering according to Charity Majors, CTO at Honeycomb. Majors sees platform teams tending to work higher up the stack than operations, DevOps, and SRE teams do. This shift in focus enables organizations to focus their limited development resources on their core product to drive maximum business value.

Matt Campbell
on Oct 09, 2022

Newer News

Older News

InfoQ Software Architects' Newsletter

News