InfoQ Homepage Cloud Computing Content on InfoQ

Articles

RSS Feed

Newer Older

Cloud

Removing a Hidden Round Trip from a Multi-Region AWS API

When a series of regional outages forced a rethink of a multi-region AWS API, the team discovered that an obstacle to global failover was hiding in plain sight: a pre-flight discovery call baked into every client session years earlier as the only available option. This article describes what it took to remove it, and what the rollout actually cost.

Suresh Gururajan
on Jul 13, 2026
Cloud

Designing Continuous Authorization for Sensitive Cloud Systems

Most cloud systems make one authorization decision at login. Everything after runs on trust established at authentication time. For systems handling regulated data, that gap is where breaches happen. This article presents a continuous authorization architecture covering risk-tiered evaluation, behavioral baselines, privacy-preserving audit trails, and a phased and incremental rollout.

Venkata Nedunoori
on Jun 19, 2026
AI, ML & Data Engineering

Governing AI in the Cloud: a Practical Guide for Architects

In this article, the author outlines a practical approach to AI governance in the cloud, covering discovery of shadow AI, data classification at creation, IAM-based enforcement, policy-as-code, and operational controls. The article shows how organizations can embed governance into delivery pipelines, balancing security, compliance, and developer productivity without relying on manual processes.

Dave Ward
on Jun 15, 2026
Development

The Technology Adoption Curve, Twenty Years On

Today, June 8th, InfoQ celebrates 20 years. This is not a comprehensive history, but a deliberately selective look at the technologies and practices InfoQ identified early, where they sit on the adoption curve in 2026, and how that curve may evolve over the next five to ten years.

Renato Losio Dio Synodinos
on Jun 08, 2026
Cloud

Two Misconfigurations That Caused Spark OOM Failures on Kubernetes

After migrating Spark pipelines to Azure Kubernetes Service, two infrastructure settings interacted destructively: spark.kubernetes.local.dirs.tmpfs=true backed shuffle spill with RAM instead of disk, and a hard podAffinity rule forced all executors onto one node. Together, they caused repeated OOM kills invisible to standard diagnostics.

Pranav Bhasker
on Jun 03, 2026
AI, ML & Data Engineering

Architecting Cloud-Native Kafka: from Tiered Storage towards a Diskless Future

This article explores Kafka's transition toward a cloud-native architecture, examining how tiered storage, FinOps telemetry, elastic consumer scaling, virtual clusters, and Share Groups reshape the operational and economic model of event streaming platforms. It also analyzes emerging diskless-storage proposals and their architectural trade-offs.

Viquar Khan
on May 26, 2026
AI, ML & Data Engineering

Building a Secure MCP Server on AWS for a Million-Company B2B Platform

We wanted to expose a B2B intelligence platform built on more than one million company profiles to an LLM client through an MCP server so a user can ask “find SaaS companies in Germany with 50-200 employees” and receive results through the LLM client. The engineering problem was: how do you make that workflow useful without creating an unsafe bridge between an LLM and production data?

Shadi Elyafi
on May 18, 2026
Cloud

Local-First AI Inference: a Cloud Architecture Pattern for Cost-Effective Document Processing

The Local-First AI Inference pattern routes 70–80% of documents to deterministic local extraction at zero API cost, reserving Azure OpenAI calls for edge cases and flagging low-confidence results for human review. Deployed on 4,700 engineering drawing PDFs, it cut API costs by 75% and processing time by 55%, while bounding errors through a human review tier.

Obinna Iheanachor
on May 11, 2026
Cloud

Securing Autonomous AI Agents on Kubernetes: Trust Boundaries, Secrets, and Observability for a New Category of Cloud Workload

Autonomous AI agents break Kubernetes security assumptions with dynamic dependencies, multi-domain credentials, and unpredictable resource use. This article covers production-tested patterns: Job-based isolation, Vault for scoped short-lived credentials, a four-phase trust model from shadow mode to autonomous operation, and observability for non-deterministic reasoning cycles.

Nik Kale
on May 01, 2026
Web Development

The DPoP Storage Paradox: Why Browser-Based Proof-of-Possession Remains an Unsolved Problem

DPoP closes a real gap in OAuth 2.0. Sender-constrained tokens are a meaningful upgrade over bearer tokens for any client that can implement them. But RFC 9449's silence on browser key storage creates the need for an architectural decision that each team must confront deliberately — there is no safe default that works everywhere.

Dhruv Agnihotri
on Apr 30, 2026
Cloud

Using AWS Lambda Extensions to Run Post-Response Telemetry Flush

At Lead Bank, synchronous telemetry flushing caused intermittent exporter stalls to become user-facing 504 gateway timeouts. By leveraging AWS Lambda's Extensions API and goroutine chaining in Go, flush work is moved off the response path, returning responses immediately while preserving full observability without telemetry loss.

Melvin Philips
on Apr 15, 2026
DevOps

Beyond One-Click: Designing an Enterprise-Grade Observability Extension for Docker

Docker Extensions boost developer speed but create a "visibility gap" by isolating telemetry. To meet enterprise needs, extensions must act as bridges to centralized platforms. This article details how to use OpenTelemetry, policy-as-code, and encryption to build secure pipelines. Learn to balance developer productivity with the governance required for scalable, compliant observability.

Pragya Keshap
on Apr 14, 2026

Newer Articles

Older Articles

InfoQ Software Architects' Newsletter

Articles