InfoQ Homepage DevOps Content on InfoQ

Articles

RSS Feed

Newer Older

Cloud

Trade-Offs in Multi-Region Architectures: Latency vs. Cost

Adding cloud regions changes latency and cost in ways simple math can't capture. This article presents a framework from multiple launches: decompose your latency budget before committing to infrastructure, choose deployment patterns by consistency and traffic profile, and optimize before expanding. A phased approach cut latency 35% through routing alone, before a new region brought it under 60ms.

Uttara Asthana
on Jul 10, 2026
Cloud

Designing Continuous Authorization for Sensitive Cloud Systems

Most cloud systems make one authorization decision at login. Everything after runs on trust established at authentication time. For systems handling regulated data, that gap is where breaches happen. This article presents a continuous authorization architecture covering risk-tiered evaluation, behavioral baselines, privacy-preserving audit trails, and a phased and incremental rollout.

Venkata Nedunoori
on Jun 19, 2026
Development

The Technology Adoption Curve, Twenty Years On

Today, June 8th, InfoQ celebrates 20 years. This is not a comprehensive history, but a deliberately selective look at the technologies and practices InfoQ identified early, where they sit on the adoption curve in 2026, and how that curve may evolve over the next five to ten years.

Renato Losio Dio Synodinos
on Jun 08, 2026
DevOps

Article Series: Securing the AI Stack: from Model to Production

This series provides your roadmap for the machine age, exploring how to move from vulnerable prototypes to resilient systems through layered defense, robust MLOps, and integrated governance.

Claudio Masolo
on Jun 05, 2026
Cloud

Two Misconfigurations That Caused Spark OOM Failures on Kubernetes

After migrating Spark pipelines to Azure Kubernetes Service, two infrastructure settings interacted destructively: spark.kubernetes.local.dirs.tmpfs=true backed shuffle spill with RAM instead of disk, and a hard podAffinity rule forced all executors onto one node. Together, they caused repeated OOM kills invisible to standard diagnostics.

Pranav Bhasker
on Jun 03, 2026
Cloud

Stragglers, Not Failures: How Adaptive Hedged Requests Reduce p99 Latency by 74 Percent

In fan-out microservice architectures, slow-but-completing requests accumulate across services and drive p99 latency far higher than per-service metrics suggest. This article presents an adaptive hedging mechanism that uses DDSketch for real-time quantile estimation, windowed rotation to handle distribution drift, and a token-bucket budget to prevent load amplification.

Prathamesh Bhope
on May 28, 2026
DevOps

The Mathematics of Backlogs: Capacity Planning for Queue Recovery

Backlogs in distributed systems are arithmetic problems, not mysteries. This article provides practical formulas for calculating backlog drain time, sizing consumer headroom, and setting auto-scaling triggers. It covers key failure modes — retry amplification, metastable states, and cascading pipeline bottlenecks — plus when to shed load instead of draining.

Rajesh Kumar Pandey
on May 21, 2026
DevOps

Kernel-Level Ground Truth: Why eBPF is Replacing User-Space Agents for Security Observability

eBPF is emerging as a preferred method for security observability over traditional user-space agents. By attaching probes directly to the Linux kernel's syscall interface, it provides consistent visibility even during container-level compromises. eBPF reduces security-related CPU consumption and limits data volume by performing filtering at the kernel level, enhancing operational efficiency.

Niranjan Sharma
on May 19, 2026
Cloud

Local-First AI Inference: a Cloud Architecture Pattern for Cost-Effective Document Processing

The Local-First AI Inference pattern routes 70–80% of documents to deterministic local extraction at zero API cost, reserving Azure OpenAI calls for edge cases and flagging low-confidence results for human review. Deployed on 4,700 engineering drawing PDFs, it cut API costs by 75% and processing time by 55%, while bounding errors through a human review tier.

Obinna Iheanachor
on May 11, 2026
DevOps

Three Pillars of Platform Engineering: a Virtuous Cycle

Platform engineering succeeds when reliability and ergonomics reinforce each other rather than compete. This article explores three foundational pillars: automated reliability, developer ergonomics, and operator ergonomics. Together, they establish a virtuous cycle that strengthens system stability, reduces operational burden, and empowers teams to scale infrastructure with confidence.

Pratik Agarwal
on May 05, 2026
Cloud

Securing Autonomous AI Agents on Kubernetes: Trust Boundaries, Secrets, and Observability for a New Category of Cloud Workload

Autonomous AI agents break Kubernetes security assumptions with dynamic dependencies, multi-domain credentials, and unpredictable resource use. This article covers production-tested patterns: Job-based isolation, Vault for scoped short-lived credentials, a four-phase trust model from shadow mode to autonomous operation, and observability for non-deterministic reasoning cycles.

Nik Kale
on May 01, 2026
Cloud

When a Cloud Region Fails: Rethinking High Availability in a Geopolitically Unstable World

Sovereign fault domains are failure boundaries defined by legal, political, or physical jurisdiction rather than hardware topology. The article maps geopolitical events to known distributed-systems failure modes, argues multi-region should replace multi-AZ as the HA baseline for systems crossing jurisdictions, and outlines design patterns, chaos experiments, and an ALE model to justify the spend.

Rohan Vardhan
on Apr 22, 2026

Newer Articles

Older Articles

InfoQ Software Architects' Newsletter

Articles