InfoQ Homepage Infrastructure Content on InfoQ

Articles

RSS Feed

Newer Older

AI, ML & Data Engineering

Autonomous Big Data Optimization: Multi-Agent Reinforcement Learning to Achieve Self-Tuning Apache Spark

This article introduces a reinforcement learning (RL) approach grounded in Apache Spark that enables distributed computing systems to learn optimal configurations autonomously, much like an apprentice engineer who learns by doing. The author also implements a lightweight agent as a driver-side component that uses RL to choose configuration settings before a job runs.

Hina Gandhi
on Jan 30, 2026
Development

One Cache to Rule Them All: Handling Responses and In-Flight Requests with Durable Objects

Traditional caching fails to stop "thundering herds" where multiple clients trigger the same work during a miss. This article proposes using Cloudflare Durable Objects to treat in-flight work and finished results as two states of one cache entry. By routing to a single owner, systems eliminate redundant tasks. This pattern replaces complex locks with simple promises, simplifying the system design.

Gabor Koos
on Jan 28, 2026
AI, ML & Data Engineering

Reducing False Positives in Retrieval-Augmented Generation (RAG) Semantic Caching: a Banking Case Study

In this article, author Elakkiya Daivam discusses why Retrieval Augmented Generation (RAG) and semantic caching techniques are powerful levers for reducing false positives in AI powered applications. She shares the insights from a production-grade evaluation with 1,000 query variations tested across seven bi-encoder models.

Elakkiya Daivam
on Nov 14, 2025
Architecture & Design

Building Resilient Platforms: Insights from over Twenty Years in Mission-Critical Infrastructure

Building resilient platforms requires understanding the art and science of creating infrastructure that others depend on for critical applications. This perspective applies to anyone who builds software consumed by others at scale. Whether developing infrastructure platforms, software development platforms, or messaging systems, principles address how to build software that others consume at scale

Matthew Liste
on Nov 10, 2025
AI, ML & Data Engineering

Disaggregation in Large Language Models: the Next Evolution in AI Infrastructure

Large Language Model (LLM) inference faces a fundamental challenge: the same hardware that excels at processing input prompts struggles with generating responses, and vice versa. Disaggregated serving architectures solve this by separating these distinct computational phases, delivering throughput improvements and better resource utilization while reducing costs.

Anat Heilper
on Sep 29, 2025
Cloud

Ransomware-Resilient Storage: the New Frontline Defense in a High-Stakes Cyber Battle

Cybersecurity has evolved, with ransomware now primarily targeting data storage and backups. To combat this, modern defense strategies focus on making storage systems more resilient. Key tactics include using immutable storage that prevents data from being altered or deleted, employing AI-powered detection, and implementing air-gapping to create isolated, tamper-proof recovery points.

Arjun Mullick
on Aug 25, 2025
Cloud

Zero-Downtime Critical Cloud Infrastructure Upgrades at Scale

Engineers can avoid common pitfalls in large-scale infrastructure upgrades by studying others' experiences. The article provides lessons learned from big firms like eBay and Snowflake, offering solutions for legacy systems, performance validation, and rollback planning. It emphasizes systematic preparation and clear communication to handle challenges and ensure zero-downtime upgrades at scale.

Kiran Bhat
on Aug 18, 2025
Architecture & Design

One Network: Cloud-Agnostic Service and Policy-Oriented Network Architecture

Bringing together software infrastructure leads to faster development time and easy control of large, spread-out systems through clear rules. In this QCon SF 2024 presentation, Anna Berenberg shared learnings and achievements when building One Network, addressing complex infrastructure layers, open-source integration, and uniform policy enforcement for improved reliability and security.

Anna Berenberg
on Aug 12, 2025
DevOps

Ceph RBD Turns 15: a Story of Open Source Creation

Fifteen years ago, Ceph RBD began as a community-driven idea that grew into essential infrastructure powering today's cloud platforms. This insider story from Yehuda Sadeh-Weinraub reveals how two developers started a distributed storage that now supports OpenStack and Kubernetes through transparent, collaborative development.

Yehuda Sadeh-Weinraub
on Jul 07, 2025
DevOps

Analyzing Apache Kafka Stretch Clusters: WAN Disruptions, Failure Scenarios, and DR Strategies

Proficient in analyzing the dynamics of Apache Kafka Stretch Clusters, I assess WAN disruptions and devise effective Disaster Recovery (DR) strategies. With deep expertise, I ensure high availability and data integrity across multi-region deployments. My insights optimize operational resilience, safeguarding vital services against service level agreement violations.

Srikanth Daggumalli Nishchai Jayanna Manjula
on Jun 20, 2025
Cloud

Designing Resilient Event-Driven Systems at Scale

Learn how to design resilient event-driven systems that scale. Explore key patterns like shuffle sharding and decoupling queues to handle load spikes and failures. Understand common pitfalls like over-relying on retries and neglecting observability for robust, scalable architectures.

Rajesh Kumar Pandey
on May 30, 2025
Development

Binary Size Matters: the Challenges of Fitting Complex Applications in Storage-Constrained Devices

This article explores developing software for microcontrollers in C or C++, where constraints are the limited amount of volatile memory and the embedded hardware platform on which the software runs. It shows how to adopt languages like C++ while optimizing for binary size due to stringent hardware constraints, and trade off between runtime efficiency and binary size in architecture decisions.

Paulo Martinez
on May 16, 2025

Newer Articles

Older Articles

InfoQ Software Architects' Newsletter

Articles