InfoQ Homepage DevOps Content on InfoQ
-
The Mathematics of Backlogs: Capacity Planning for Queue Recovery
Backlogs in distributed systems are arithmetic problems, not mysteries. This article provides practical formulas for calculating backlog drain time, sizing consumer headroom, and setting auto-scaling triggers. It covers key failure modes — retry amplification, metastable states, and cascading pipeline bottlenecks — plus when to shed load instead of draining.
-
Kernel-Level Ground Truth: Why eBPF is Replacing User-Space Agents for Security Observability
eBPF is emerging as a preferred method for security observability over traditional user-space agents. By attaching probes directly to the Linux kernel's syscall interface, it provides consistent visibility even during container-level compromises. eBPF reduces security-related CPU consumption and limits data volume by performing filtering at the kernel level, enhancing operational efficiency.
-
Local-First AI Inference: a Cloud Architecture Pattern for Cost-Effective Document Processing
The Local-First AI Inference pattern routes 70–80% of documents to deterministic local extraction at zero API cost, reserving Azure OpenAI calls for edge cases and flagging low-confidence results for human review. Deployed on 4,700 engineering drawing PDFs, it cut API costs by 75% and processing time by 55%, while bounding errors through a human review tier.
-
Three Pillars of Platform Engineering: a Virtuous Cycle
Platform engineering succeeds when reliability and ergonomics reinforce each other rather than compete. This article explores three foundational pillars: automated reliability, developer ergonomics, and operator ergonomics. Together, they establish a virtuous cycle that strengthens system stability, reduces operational burden, and empowers teams to scale infrastructure with confidence.
-
Securing Autonomous AI Agents on Kubernetes: Trust Boundaries, Secrets, and Observability for a New Category of Cloud Workload
Autonomous AI agents break Kubernetes security assumptions with dynamic dependencies, multi-domain credentials, and unpredictable resource use. This article covers production-tested patterns: Job-based isolation, Vault for scoped short-lived credentials, a four-phase trust model from shadow mode to autonomous operation, and observability for non-deterministic reasoning cycles.
-
When a Cloud Region Fails: Rethinking High Availability in a Geopolitically Unstable World
Sovereign fault domains are failure boundaries defined by legal, political, or physical jurisdiction rather than hardware topology. The article maps geopolitical events to known distributed-systems failure modes, argues multi-region should replace multi-AZ as the HA baseline for systems crossing jurisdictions, and outlines design patterns, chaos experiments, and an ALE model to justify the spend.
-
Using AWS Lambda Extensions to Run Post-Response Telemetry Flush
At Lead Bank, synchronous telemetry flushing caused intermittent exporter stalls to become user-facing 504 gateway timeouts. By leveraging AWS Lambda's Extensions API and goroutine chaining in Go, flush work is moved off the response path, returning responses immediately while preserving full observability without telemetry loss.
-
Beyond One-Click: Designing an Enterprise-Grade Observability Extension for Docker
Docker Extensions boost developer speed but create a "visibility gap" by isolating telemetry. To meet enterprise needs, extensions must act as bridges to centralized platforms. This article details how to use OpenTelemetry, policy-as-code, and encryption to build secure pipelines. Learn to balance developer productivity with the governance required for scalable, compliant observability.
-
Event-Driven Patterns for Cloud-Native Banking: Lessons from What Works and What Hurts
Event-driven architecture helps banks decouple systems, scale services, and create clear activity trails. But it also introduces complexity, new failure modes, and operational challenges. Chris Tacey-Green explains where it adds value in banking systems and the practical patterns, such as inbox/outbox and stable event contracts, needed to make it reliable.
-
Configuration as a Control Plane: Designing for Safety and Reliability at Scale
Configuration has evolved from static deployment files into a live control plane that directly shapes system behavior. The evolution of configuration management highlights why misconfigurations can trigger large outages and how hyperscalers deploy changes safely using staged rollouts, validation, blast radius limits, and automated rollback at scale.
-
Change as Metrics: Measuring System Reliability through Change Delivery Signals
System changes are the primary driver of production incidents, making change-related metrics essential reliability signals. A minimal metric set of Change Lead Time, Change Success Rate, and Incident Leakage Rate assesses delivery efficiency and reliability, supported by actionable technical metrics and an event-centric data warehouse for unified change observability.
-
Proactive Autoscaling for Edge Applications in Kubernetes
Kubernetes often reacts too late when traffic suddenly increases at the edge. A proactive scaling approach that considers response time, spare CPU capacity, and container startup delays can add or remove instances more smoothly, prevent sudden spikes, and keep performance stable on systems with limited resources.