InfoQ Homepage Articles

Articles

RSS Feed

Newer Older

Cloud

Two Misconfigurations That Caused Spark OOM Failures on Kubernetes

After migrating Spark pipelines to Azure Kubernetes Service, two infrastructure settings interacted destructively: spark.kubernetes.local.dirs.tmpfs=true backed shuffle spill with RAM instead of disk, and a hard podAffinity rule forced all executors onto one node. Together, they caused repeated OOM kills invisible to standard diagnostics.

Pranav Bhasker
on Jun 03, 2026
AI, ML & Data Engineering

Why Vector Search Alone Isn't Enough: Hybrid Retrieval for RAG

In this article, author Aaditya Chauhan discusses the limitations of RAG pipelines based purely on vector search and how an internal omni-search application using Reciprocal Rank Fusion (RRF) that combines BM25 and vector results, can enhance the search solution.

Aaditya Chauhan
on Jun 02, 2026
Web Development

The AI Productivity Paradox in Test Automation: Moving beyond Structural Validation to Perception and Intent

The AI productivity paradox states that AI scales whatever abstraction it is built on. If that abstraction is structurally brittle, it scales structural brittleness. This article shows that to build a future of reliable, AI-driven test automation, we must stop scaling DOM-centric abstractions and build a new testing paradigm grounded in perception and intent.

Amanul Chowdhury Vinay Gummadavelli
on Jun 01, 2026
Cloud

Stragglers, Not Failures: How Adaptive Hedged Requests Reduce p99 Latency by 74 Percent

In fan-out microservice architectures, slow-but-completing requests accumulate across services and drive p99 latency far higher than per-service metrics suggest. This article presents an adaptive hedging mechanism that uses DDSketch for real-time quantile estimation, windowed rotation to handle distribution drift, and a token-bucket budget to prevent load amplification.

Prathamesh Bhope
on May 28, 2026
AI, ML & Data Engineering

Architecting Cloud-Native Kafka: from Tiered Storage towards a Diskless Future

This article explores Kafka's transition toward a cloud-native architecture, examining how tiered storage, FinOps telemetry, elastic consumer scaling, virtual clusters, and Share Groups reshape the operational and economic model of event streaming platforms. It also analyzes emerging diskless-storage proposals and their architectural trade-offs.

Viquar Khan
on May 26, 2026
Java

The Schema Proliferation Problem in Kafka and Flink Pipelines: How to Solve It

Schema proliferation builds slowly and gets expensive fast. One schema per event type feels right until there are ten tables, union queries spanning all of them, and a single field rename touching every schema. Discriminator-based schema consolidation collapses that to two tables, turning multi-table unions into a single query, while new variants are additive and don't break existing consumers.

Spoorthi Basu
on May 25, 2026
DevOps

The Mathematics of Backlogs: Capacity Planning for Queue Recovery

Backlogs in distributed systems are arithmetic problems, not mysteries. This article provides practical formulas for calculating backlog drain time, sizing consumer headroom, and setting auto-scaling triggers. It covers key failure modes — retry amplification, metastable states, and cascading pipeline bottlenecks — plus when to shed load instead of draining.

Rajesh Kumar Pandey
on May 21, 2026
DevOps

Kernel-Level Ground Truth: Why eBPF is Replacing User-Space Agents for Security Observability

eBPF is emerging as a preferred method for security observability over traditional user-space agents. By attaching probes directly to the Linux kernel's syscall interface, it provides consistent visibility even during container-level compromises. eBPF reduces security-related CPU consumption and limits data volume by performing filtering at the kernel level, enhancing operational efficiency.

Niranjan Sharma
on May 19, 2026
AI, ML & Data Engineering

Building a Secure MCP Server on AWS for a Million-Company B2B Platform

We wanted to expose a B2B intelligence platform built on more than one million company profiles to an LLM client through an MCP server so a user can ask “find SaaS companies in Germany with 50-200 employees” and receive results through the LLM client. The engineering problem was: how do you make that workflow useful without creating an unsafe bridge between an LLM and production data?

Shadi Elyafi
on May 18, 2026
AI, ML & Data Engineering

Time-Series Storage: Design Choices That Shape Cost and Performance

Every time-series database makes a set of storage design decisions: how to lay out rows, when to compress, what to partition on. These decisions determine cost and query performance more than the choice of database itself. This article works through those fundamentals from first principles, using widely available tools like PostgreSQL and Apache Parquet to make each trade-off measurable.

Nirmesh Khandelwal
on May 12, 2026
Cloud

Local-First AI Inference: a Cloud Architecture Pattern for Cost-Effective Document Processing

The Local-First AI Inference pattern routes 70–80% of documents to deterministic local extraction at zero API cost, reserving Azure OpenAI calls for edge cases and flagging low-confidence results for human review. Deployed on 4,700 engineering drawing PDFs, it cut API costs by 75% and processing time by 55%, while bounding errors through a human review tier.

Obinna Iheanachor
on May 11, 2026
.NET

Implementing the Sidecar Pattern in Microservices-Based ASP.NET Core Applications

Today's applications require monitoring, logging, configuration, etc. Each of these concerns can be implemented as a component or a service. These cross-cutting concerns can be tightly integrated into the application. While this tight coupling ensures effective use of shared resources, an outage in any of these components can take your application down. Enter the sidecar design pattern.

Joydip Kanjilal
on May 08, 2026

Newer Articles

Older Articles

InfoQ Software Architects' Newsletter

Articles