InfoQ Homepage Site Reliability Engineering Content on InfoQ
-
How a Culture of Data-Driven Conversations Can Support Platform Engineering
To provide SRE as a service, a team built a center of excellence, introducing Federated SREs and roles like production manager and technical tribe lead. They created a culture of data-driven conversations where SLOs and SLAs were democratised. Surviving growing cognitive load meant continuously simplifying architecture and embedding sovereignty and resilience into platform design decisions.
-
Google Cloud Suspends Railway's Production Account, Causing Eight-Hour Platform-Wide Outage
Google Cloud's automated systems suspended Railway's production account without notice, triggering an eight-hour platform-wide outage affecting 3 million users. The cascade took down workloads across all providers including AWS and bare metal because Railway's control plane was hosted on GCP. Railway is demoting GCP to backup-only status.
-
AWS Announces General Availability of DevOps Agent for Automated Incident Investigation
AWS has announced the general availability of DevOps Agent, a generative AI–powered assistant designed to help developers and operators troubleshoot issues, analyze deployments, and automate operational tasks across AWS environments.
-
QCon London 2026: Wrangling Telemetry at Scale, a Guide to Self-Hosted Observability
At QCon London 2026, Colin Douch discussed building and operating self-hosted monitoring stacks, surveyed the current tooling landscape, and explained how to build a coherent observability setup rather than treating logs, metrics, and traces as separate pillars.
-
War in Iran Damages Multiple AWS Data Centers, Challenging Multi-AZ Assumptions
Earlier this month, Iranian drone strikes damaged three AWS data centers in the UAE and Bahrain, causing outages and disruptions to multiple services. The events, which affected multiple facilities within the same AWS region, sparked discussion in the community about how geopolitical conflict can directly impact global cloud infrastructure and multi-AZ deployments.
-
From Paging to Postmortem: Google Cloud SREs on Using Gemini CLI for Outage Response
A recent article by Google Cloud SREs describes how they use the AI-powered Gemini CLI internally to resolve real-world outages. This approach improves reliability in critical infrastructure operations and reduces incident response time by integrating intelligent reasoning directly into the terminal-based operational tools.
-
GitHub Reworks Layered Defenses after Legacy Protections Block Legitimate Traffic
GitHub engineers recently traced user reports of unexpected “Too Many Requests” errors to abuse-mitigation rules that had accidentally remained active long after the incidents that prompted them.
-
Human‑Centred AI for SRE: Multi‑Agent Incident Response without Losing Control
A growing body of recent research and industry commentary suggests that a shift in how organisations approach site reliability engineering is underway. Rather than handing the pager to a machine, teams are designing multi-agent AI systems that work alongside on-call engineers, narrowing the search space and automating the tedious steps while leaving judgment calls to humans.
-
How Authress Designed for Resilience and Survived a Major AWS Outage
Identity and authentication services company Authress shared its strategy to stay operational during major cloud infrastructure outages like the massive October 2025 AWS outage that disrupted many major services. According to Authress CTO Warren Parad, the company's resilience architecture relies on strategies like multi-region deployment and minimizing reliance on AWS control plane services.
-
Cloudflare Global Outage Traced to Internal Database Change
Cloudflare’s recent global outage, linked to a database update, caused widespread disruption and highlighted the risks of single-vendor reliance. While service was restored, the incident sparked discussions on the importance of multi-vendor strategies in tech. Cloudflare's CEO vowed to enhance system resilience, emphasizing that outages can impact even the largest providers.
-
Enhancing Reliability Using Service-Level Prioritized Load Shedding: Netflix at QCon SF 2025
At QCon San Francisco, Netflix engineers unveiled their advanced Service-Level-Prioritized Load-Shedding strategy, enhancing reliability during traffic spikes. By prioritizing high-value requests and automating management across microservices, they safeguard user experience and system stability. Key insights stress prioritization, automation, and structured load shedding for optimal resilience.
-
Race Condition in DynamoDB DNS System: Analyzing the AWS US-EAST-1 Outage
On October 19th and 20th, AWS experienced an extended outage triggered by a failure in Amazon DynamoDB that affected most services in its most popular region, Northern Virginia. The cloud provider released an analysis of the incident, sparking discussions in the community about redundancy on AWS, moving out of public cloud, and multi-region approaches.
-
Report Finds LLMs Not Yet Ready to Replace SREs in Incident Management
A study by ClickHouse found that large language models (LLMs) can't yet replace Site Reliability Engineers (SREs) for tasks such as finding the root causes of incidents. The study tested five leading models against real-world observability data to determine whether AI could autonomously identify production issues.
-
Azure Advisor Well-Architected Assessment in Public Preview to Optimize Cloud Infrastructure
Microsoft Azure recently announced the public preview of the Advisor Well-Architected assessment. This self-guided questionnaire aims to provide tailored, actionable recommendations to optimize Azure resources while aligning with the Azure Well-Architected Framework (WAF) principles.
-
Advancing System Reliability: Meta's AI-Driven Approach to Root Cause Analysis
Meta recently shared how they are enhancing their system reliability through advanced investigation tools, including the AI-assisted Hawkeye, which aids in debugging machine learning workflows. By integrating Artificial Intelligence, Meta has developed a new investigation system that combines heuristic-based retrieval with large language model (LLM) ranking to assist in root cause analysis.