InfoQ Homepage Agile Content on InfoQ
-
A Better Alternative to Reducing CI Regression Test Suite Sizes
How can you focus in a sea of results from a large regression test suite? This article describes a stochastic approach that relies on some degree of redundancy in your CI regression test set. This approach does not guarantee you will catch every bug every time, but it gives you your best bet of not missing the subtle signatures of all the bugs uncovered by your CI regression test suite runs.
-
Event-Driven Patterns for Cloud-Native Banking: Lessons from What Works and What Hurts
Event-driven architecture helps banks decouple systems, scale services, and create clear activity trails. But it also introduces complexity, new failure modes, and operational challenges. Chris Tacey-Green explains where it adds value in banking systems and the practical patterns, such as inbox/outbox and stable event contracts, needed to make it reliable.
-
Configuration as a Control Plane: Designing for Safety and Reliability at Scale
Configuration has evolved from static deployment files into a live control plane that directly shapes system behavior. The evolution of configuration management highlights why misconfigurations can trigger large outages and how hyperscalers deploy changes safely using staged rollouts, validation, blast radius limits, and automated rollback at scale.
-
Change as Metrics: Measuring System Reliability through Change Delivery Signals
System changes are the primary driver of production incidents, making change-related metrics essential reliability signals. A minimal metric set of Change Lead Time, Change Success Rate, and Incident Leakage Rate assesses delivery efficiency and reliability, supported by actionable technical metrics and an event-centric data warehouse for unified change observability.
-
Virtual Panel - Culture, Code, and Platform: Building High-Performing Teams
In this virtual panel, we'll focus on performance improvement through platform engineering and fostering developer experience, to increase productivity, quality, developer well-being, and more. We'll also explore the role that tech leadership can play in culture change and performance improvement for software development organizations.
-
Building a Least-Privilege AI Agent Gateway for Infrastructure Automation with MCP, OPA, and Ephemeral Runners
This article presents a least-privilege AI Agent Gateway that places clear controls between AI agents and infrastructure. Agents do not access infrastructure APIs directly. Instead, every request is validated, authorized using policy as code with Open Policy Agent (OPA), and executed in short-lived, isolated environments, with built-in observability using OpenTelemetry.
-
Proactive Autoscaling for Edge Applications in Kubernetes
Kubernetes often reacts too late when traffic suddenly increases at the edge. A proactive scaling approach that considers response time, spare CPU capacity, and container startup delays can add or remove instances more smoothly, prevent sudden spikes, and keep performance stable on systems with limited resources.
-
The Friction Fix: Change What Matters
Friction is the invisible current that sinks every transformation. Friction isn’t one thing – it’s systemic. Relationships produce friction: between the people, teams and technology. The fix isn’t Kubernetes, the Cloud or AI. The fix is changing our patterns of thinking, communicating, and organizing.
-
Platform-as-a-Product: Declarative Infrastructure for Developer Velocity
Declarative infrastructure config hides complexity, enabling developers to focus on application code. Unified YAML per service allows early cost validation, while independent CI with centralized CD balances team autonomy and deployment consistency. This standardized approach scales across organizations, making infrastructure invisible and operations automatic.
-
Stop Guessing, Start Improving: Using DORA Metrics and Process Behavior Charts
Delivery performance rarely changes in a straight line. Small degradations caused by tooling, environment instability, or team changes can accumulate quietly, while real improvements take time to emerge. This article shows how combining DORA metrics with Process Behavior Charts helps teams zoom out, detect meaningful shifts early, and validate improvement hypotheses.
-
Overload Protection: the Missing Pillar of Platform Engineering
Overload protection is often overlooked in platform engineering, leaving teams to create inconsistent, fragile fixes. Centralized rate limits, quotas, adaptive controls, and clear visibility give services predictable ways to handle traffic spikes, reduce reliability debt, and prevent cascading failures across systems.
-
Holistic Engineering: Organic Problem Solving for Complex Evolving Systems
Late projects. Architectures that drift from their original design. Code that mysteriously evolves into something nobody planned. These persistent problems in software development often stem not from technical failures, but from forces we pretend don't exist—reward systems that incentivize the wrong behaviors, organizational structures that ignore domain boundaries, and human dynamics.