InfoQ Homepage Site Reliability Engineering Content on InfoQ
-
Configuration as a Control Plane: Designing for Safety and Reliability at Scale
Configuration has evolved from static deployment files into a live control plane that directly shapes system behavior. The evolution of configuration management highlights why misconfigurations can trigger large outages and how hyperscalers deploy changes safely using staged rollouts, validation, blast radius limits, and automated rollback at scale.
-
Change as Metrics: Measuring System Reliability through Change Delivery Signals
System changes are the primary driver of production incidents, making change-related metrics essential reliability signals. A minimal metric set of Change Lead Time, Change Success Rate, and Incident Leakage Rate assesses delivery efficiency and reliability, supported by actionable technical metrics and an event-centric data warehouse for unified change observability.
-
Proactive Autoscaling for Edge Applications in Kubernetes
Kubernetes often reacts too late when traffic suddenly increases at the edge. A proactive scaling approach that considers response time, spare CPU capacity, and container startup delays can add or remove instances more smoothly, prevent sudden spikes, and keep performance stable on systems with limited resources.
-
From Alert Fatigue to Agent-Assisted Intelligent Observability
As systems grow, observability becomes harder to maintain and incidents harder to diagnose. Agentic observability layers AI on existing tools, starting in read-only mode to detect anomalies and summarize issues. Over time, agents add context, correlate signals, and automate low-risk tasks. This approach frees engineers to focus on analysis and judgment.
-
Overload Protection: the Missing Pillar of Platform Engineering
Overload protection is often overlooked in platform engineering, leaving teams to create inconsistent, fragile fixes. Centralized rate limits, quotas, adaptive controls, and clear visibility give services predictable ways to handle traffic spikes, reduce reliability debt, and prevent cascading failures across systems.
-
Building Resilient Platforms: Insights from over Twenty Years in Mission-Critical Infrastructure
Building resilient platforms requires understanding the art and science of creating infrastructure that others depend on for critical applications. This perspective applies to anyone who builds software consumed by others at scale. Whether developing infrastructure platforms, software development platforms, or messaging systems, principles address how to build software that others consume at scale
-
Designing Resilient Event-Driven Systems at Scale
Learn how to design resilient event-driven systems that scale. Explore key patterns like shuffle sharding and decoupling queues to handle load spikes and failures. Understand common pitfalls like over-relying on retries and neglecting observability for robust, scalable architectures.
-
Checklist for Kubernetes in Production: Best Practices for SREs
This article provides SREs with a checklist for managing Kubernetes in production environments. It identifies common challenges including resource management, workload placement, high availability, health probes, storage, monitoring, and cost optimization. By implementing consistent GitOps automation across these areas, teams can significantly reduce complexity, and prevent downtime.
-
Mastering Impact Analysis and Optimizing Change Release Processes
Dynamic IT professional with a proven track record in optimizing production processes and analyzing outages in complex systems handling millions of TPS. The recent CrowdStrike outage highlights the importance of continuous improvement and adherence to best practices. Passionate about elevating operational excellence through strategic reviews and effective process enhancements.
-
How Platform and Site Reliability Engineering Are Evolving DevOps
Companies are now looking to grow and more effectively manage DevOps with platform engineering and site reliability engineering roles. No one has these roles perfectly carved out right now — there’s just too much to do and not enough people to do it — but knowing where these three disciplines do and don’t overlap will help organizations evolve and take advantage when they are ready.
-
Data-Driven Decision Making - Software Delivery Performance Indicators at Different Granularities
Optimizing a software delivery organization is not a straightforward process standardized in the software industry. Getting the organization to analyze the data and act on it is a difficult undertaking. This article presents insights into how a socio-technical framework for optimizing a software delivery organization has been set up and brought to the point of regular use.
-
AIOps: Site Reliability Engineering at Scale
AIOps can simplify and streamline processes which can reduce the mental burden on employees while improving communication and collaboration between departments.