BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Site Reliability Engineering Content on InfoQ

  • Cloudflare Global Outage Traced to Internal Database Change

    Cloudflare’s recent global outage, linked to a database update, caused widespread disruption and highlighted the risks of single-vendor reliance. While service was restored, the incident sparked discussions on the importance of multi-vendor strategies in tech. Cloudflare's CEO vowed to enhance system resilience, emphasizing that outages can impact even the largest providers.

  • Enhancing Reliability Using Service-Level Prioritized Load Shedding: Netflix at QCon SF 2025

    At QCon San Francisco, Netflix engineers unveiled their advanced Service-Level-Prioritized Load-Shedding strategy, enhancing reliability during traffic spikes. By prioritizing high-value requests and automating management across microservices, they safeguard user experience and system stability. Key insights stress prioritization, automation, and structured load shedding for optimal resilience.

  • Race Condition in DynamoDB DNS System: Analyzing the AWS US-EAST-1 Outage

    On October 19th and 20th, AWS experienced an extended outage triggered by a failure in Amazon DynamoDB that affected most services in its most popular region, Northern Virginia. The cloud provider released an analysis of the incident, sparking discussions in the community about redundancy on AWS, moving out of public cloud, and multi-region approaches.

  • Report Finds LLMs Not Yet Ready to Replace SREs in Incident Management

    A study by ClickHouse found that large language models (LLMs) can't yet replace Site Reliability Engineers (SREs) for tasks such as finding the root causes of incidents. The study tested five leading models against real-world observability data to determine whether AI could autonomously identify production issues.

  • Azure Advisor Well-Architected Assessment in Public Preview to Optimize Cloud Infrastructure

    Microsoft Azure recently announced the public preview of the Advisor Well-Architected assessment. This self-guided questionnaire aims to provide tailored, actionable recommendations to optimize Azure resources while aligning with the Azure Well-Architected Framework (WAF) principles.

  • Advancing System Reliability: Meta's AI-Driven Approach to Root Cause Analysis

    Meta recently shared how they are enhancing their system reliability through advanced investigation tools, including the AI-assisted Hawkeye, which aids in debugging machine learning workflows. By integrating Artificial Intelligence, Meta has developed a new investigation system that combines heuristic-based retrieval with large language model (LLM) ranking to assist in root cause analysis.

  • AWS Introduces Amazon CloudWatch Internet Weather Map

    AWS recently announced the availability of the Internet Weather Map, a new feature of CloudWatch that displays a 24-hour global snapshot of internet latency and availability outages. This new map offers a worldwide perspective on Internet conditions, allowing users to zoom in and analyze performance and availability problems in specific cities or with particular service providers.

  • Google Delivers Comprehensive Cloud Infrastructure Reliability Guide

    Google recently delivered a cloud infrastructure reliability guide combining best practices and expertise from its engineers for its customers.

  • Log Analytics Feature in Cloud Logging Now Generally Available

    Google recently made its Cloud Logging Log Analytics feature generally available (GA), allowing users to search, aggregate, and transform all log data types, including application, network, and audit logs.

  • Google Production Excellence Program "ProdEx": Christof Leng at DOES 2022

    Christof Leng, SRE lead at Google, presented ProdEx, their production excellence review program that helps manage operational risks and promote best practices. ProdEx is a community that builds platforms together, establishes standards and promotes best practices, so people learn from each others and grow. Today they have more than 100 SRE teams signed up and have performed more than 1000 reviews.

  • Disney SRE "Proximity Powered Engineering" Culture: Jason Cox at DOES 2022

    Jason Cox, SRE director at Disney, shares how he developed a world-class centralized shared services SRE organization based on “proximity-powered empathy engineering” and three core values: Listen: Know the Business - Know the Mission - Know the Team. Empathize: Shared Mission - Shared Struggles - Shared Wins. Actually Help: Build Community - Build Trust - Build Magic Together.

  • Platform Engineering, DevOps, and Cognitive Load: a Summary of Community Discussions

    Operations engineering is moving in the direction of platform engineering according to Charity Majors, CTO at Honeycomb. Majors sees platform teams tending to work higher up the stack than operations, DevOps, and SRE teams do. This shift in focus enables organizations to focus their limited development resources on their core product to drive maximum business value.

  • Dropbox Unplugs Data Center to Test Resilience

    Dropbox has published a detailed account of why and how they unplugged an entire data center to test their disaster readiness. The disaster readiness team began building tools to make performing frequent failovers possible, and ran their first formalized failover in 2019. Eventually, with new tooling and procedures, the data center was unplugged. This provided a significantly reduced RTO.

  • Building an SLO-Driven Culture at Salesforce

    Salesforce built a platform to monitor Service Level Objectives (SLOs). The platform provided service owners with deep and actionable insights into how to improve or maintain the health of their services, to find dips in SLIs, to find dependent services that weren’t meeting their own SLOs, and overall provide a better understanding of customers’ experience with their services.

  • Site Reliability Engineers and the Specialist Mindset

    A site reliability engineer (SRE) can be a generalist or specialist. Recently, the team at Blameless elaborated on the advantages of a specialized SRE team. The specialist nature of the SRE role can be highlighted from the recruitment process. Depending on the individual skillset, organizations can engage an SRE in a number of specialist roles.

BT