InfoQ Homepage DevOps Content on InfoQ

Articles

RSS Feed

Newer Older

Cloud

When a Cloud Region Fails: Rethinking High Availability in a Geopolitically Unstable World

Sovereign fault domains are failure boundaries defined by legal, political, or physical jurisdiction rather than hardware topology. The article maps geopolitical events to known distributed-systems failure modes, argues multi-region should replace multi-AZ as the HA baseline for systems crossing jurisdictions, and outlines design patterns, chaos experiments, and an ALE model to justify the spend.

Rohan Vardhan
on Apr 22, 2026
Cloud

Using AWS Lambda Extensions to Run Post-Response Telemetry Flush

At Lead Bank, synchronous telemetry flushing caused intermittent exporter stalls to become user-facing 504 gateway timeouts. By leveraging AWS Lambda's Extensions API and goroutine chaining in Go, flush work is moved off the response path, returning responses immediately while preserving full observability without telemetry loss.

Melvin Philips
on Apr 15, 2026
DevOps

Beyond One-Click: Designing an Enterprise-Grade Observability Extension for Docker

Docker Extensions boost developer speed but create a "visibility gap" by isolating telemetry. To meet enterprise needs, extensions must act as bridges to centralized platforms. This article details how to use OpenTelemetry, policy-as-code, and encryption to build secure pipelines. Learn to balance developer productivity with the governance required for scalable, compliant observability.

Pragya Keshap
on Apr 14, 2026
Architecture & Design

Event-Driven Patterns for Cloud-Native Banking: Lessons from What Works and What Hurts

Event-driven architecture helps banks decouple systems, scale services, and create clear activity trails. But it also introduces complexity, new failure modes, and operational challenges. Chris Tacey-Green explains where it adds value in banking systems and the practical patterns, such as inbox/outbox and stable event contracts, needed to make it reliable.

Chris Tacey-Green
on Mar 31, 2026
DevOps

Configuration as a Control Plane: Designing for Safety and Reliability at Scale

Configuration has evolved from static deployment files into a live control plane that directly shapes system behavior. The evolution of configuration management highlights why misconfigurations can trigger large outages and how hyperscalers deploy changes safely using staged rollouts, validation, blast radius limits, and automated rollback at scale.

Karthiek Maralla
on Mar 20, 2026
DevOps

Change as Metrics: Measuring System Reliability through Change Delivery Signals

System changes are the primary driver of production incidents, making change-related metrics essential reliability signals. A minimal metric set of Change Lead Time, Change Success Rate, and Incident Leakage Rate assesses delivery efficiency and reliability, supported by actionable technical metrics and an event-centric data warehouse for unified change observability.

Peihao Yuan
on Mar 09, 2026
DevOps

Proactive Autoscaling for Edge Applications in Kubernetes

Kubernetes often reacts too late when traffic suddenly increases at the edge. A proactive scaling approach that considers response time, spare CPU capacity, and container startup delays can add or remove instances more smoothly, prevent sudden spikes, and keep performance stable on systems with limited resources.

Rajeev Kallayil Ravi
on Feb 17, 2026
AI, ML & Data Engineering

Architecting Agentic MLOps: a Layered Protocol Strategy with A2A and MCP

In this article, the authors outline protocols for building extensible multi-agent MLOps systems. The core architecture deliberately decouples orchestration from execution, allowing teams to incrementally add capabilities via discovery and evolve operations from static pipelines toward intelligent, adaptive coordination.

Shashank Kapoor Sanjay Surendranath Girija Lakshit Arora
on Feb 16, 2026
DevOps

From Alert Fatigue to Agent-Assisted Intelligent Observability

As systems grow, observability becomes harder to maintain and incidents harder to diagnose. Agentic observability layers AI on existing tools, starting in read-only mode to detect anomalies and summarize issues. Over time, agents add context, correlate signals, and automate low-risk tasks. This approach frees engineers to focus on analysis and judgment.

Rohit Dhawan
on Feb 04, 2026
Development

One Cache to Rule Them All: Handling Responses and In-Flight Requests with Durable Objects

Traditional caching fails to stop "thundering herds" where multiple clients trigger the same work during a miss. This article proposes using Cloudflare Durable Objects to treat in-flight work and finished results as two states of one cache entry. By routing to a single owner, systems eliminate redundant tasks. This pattern replaces complex locks with simple promises, simplifying the system design.

Gabor Koos
on Jan 28, 2026
DevOps

Preventing Data Exfiltration: a Practical Implementation of VPC Service Controls at Enterprise Scale in Google Cloud Platform

Implementing VPC Service Controls is more about people and process than technology. Organizations must conduct extensive upfront discovery, use phased rollouts to avoid breaking production systems, and design VPC Service Controls that enable rather than block work. Success requires automation, clear exception processes, tracking both security and business metrics, and continuous improvement.

Shijin Nair
on Jan 19, 2026
DevOps

Platform-as-a-Product: Declarative Infrastructure for Developer Velocity

Declarative infrastructure config hides complexity, enabling developers to focus on application code. Unified YAML per service allows early cost validation, while independent CI with centralized CD balances team autonomy and deployment consistency. This standardized approach scales across organizations, making infrastructure invisible and operations automatic.

Avinash Sabat
on Jan 14, 2026

Newer Articles

Older Articles

InfoQ Software Architects' Newsletter

Articles