InfoQ Homepage Site Reliability Engineering Content on InfoQ

Articles

RSS Feed

Newer Older

DevOps

Configuration as a Control Plane: Designing for Safety and Reliability at Scale

Configuration has evolved from static deployment files into a live control plane that directly shapes system behavior. The evolution of configuration management highlights why misconfigurations can trigger large outages and how hyperscalers deploy changes safely using staged rollouts, validation, blast radius limits, and automated rollback at scale.

Karthiek Maralla
on Mar 20, 2026
DevOps

Change as Metrics: Measuring System Reliability through Change Delivery Signals

System changes are the primary driver of production incidents, making change-related metrics essential reliability signals. A minimal metric set of Change Lead Time, Change Success Rate, and Incident Leakage Rate assesses delivery efficiency and reliability, supported by actionable technical metrics and an event-centric data warehouse for unified change observability.

Peihao Yuan
on Mar 09, 2026
DevOps

Proactive Autoscaling for Edge Applications in Kubernetes

Kubernetes often reacts too late when traffic suddenly increases at the edge. A proactive scaling approach that considers response time, spare CPU capacity, and container startup delays can add or remove instances more smoothly, prevent sudden spikes, and keep performance stable on systems with limited resources.

Rajeev Kallayil Ravi
on Feb 17, 2026
DevOps

From Alert Fatigue to Agent-Assisted Intelligent Observability

As systems grow, observability becomes harder to maintain and incidents harder to diagnose. Agentic observability layers AI on existing tools, starting in read-only mode to detect anomalies and summarize issues. Over time, agents add context, correlate signals, and automate low-risk tasks. This approach frees engineers to focus on analysis and judgment.

Rohit Dhawan
on Feb 04, 2026
DevOps

Overload Protection: the Missing Pillar of Platform Engineering

Overload protection is often overlooked in platform engineering, leaving teams to create inconsistent, fragile fixes. Centralized rate limits, quotas, adaptive controls, and clear visibility give services predictable ways to handle traffic spikes, reduce reliability debt, and prevent cascading failures across systems.

Gaurav Nanda Tapan Manaktala
on Dec 09, 2025
Architecture & Design

Building Resilient Platforms: Insights from over Twenty Years in Mission-Critical Infrastructure

Building resilient platforms requires understanding the art and science of creating infrastructure that others depend on for critical applications. This perspective applies to anyone who builds software consumed by others at scale. Whether developing infrastructure platforms, software development platforms, or messaging systems, principles address how to build software that others consume at scale

Matthew Liste
on Nov 10, 2025
Cloud

Designing Resilient Event-Driven Systems at Scale

Learn how to design resilient event-driven systems that scale. Explore key patterns like shuffle sharding and decoupling queues to handle load spikes and failures. Understand common pitfalls like over-relying on retries and neglecting observability for robust, scalable architectures.

Rajesh Kumar Pandey
on May 30, 2025
DevOps

Checklist for Kubernetes in Production: Best Practices for SREs

This article provides SREs with a checklist for managing Kubernetes in production environments. It identifies common challenges including resource management, workload placement, high availability, health probes, storage, monitoring, and cost optimization. By implementing consistent GitOps automation across these areas, teams can significantly reduce complexity, and prevent downtime.

Utku Darilmaz
on Mar 10, 2025
DevOps

Mastering Impact Analysis and Optimizing Change Release Processes

Dynamic IT professional with a proven track record in optimizing production processes and analyzing outages in complex systems handling millions of TPS. The recent CrowdStrike outage highlights the importance of continuous improvement and adherence to best practices. Passionate about elevating operational excellence through strategic reviews and effective process enhancements.

Tejas Ghadge
on Sep 11, 2024
Culture & Methods

How Platform and Site Reliability Engineering Are Evolving DevOps

Companies are now looking to grow and more effectively manage DevOps with platform engineering and site reliability engineering roles. No one has these roles perfectly carved out right now — there’s just too much to do and not enough people to do it — but knowing where these three disciplines do and don’t overlap will help organizations evolve and take advantage when they are ready.

Narayanan Raghavan
on Feb 06, 2024
Culture & Methods

Data-Driven Decision Making - Software Delivery Performance Indicators at Different Granularities

Optimizing a software delivery organization is not a straightforward process standardized in the software industry. Getting the organization to analyze the data and act on it is a difficult undertaking. This article presents insights into how a socio-technical framework for optimizing a software delivery organization has been set up and brought to the point of regular use.

Vladyslav Ukis
on May 23, 2023
DevOps

AIOps: Site Reliability Engineering at Scale

AIOps can simplify and streamline processes which can reduce the mental burden on employees while improving communication and collaboration between departments.

Dominick Blue
on May 02, 2023

Newer Articles

Older Articles

InfoQ Software Architects' Newsletter

Articles