InfoQ Homepage Distributed Systems Content on InfoQ

News

RSS Feed

Newer Older

Architecture & Design

Lessons on How to Get Timeouts, Retries and Idempotency Right from Sam Newman at QCon London

At QCon London, Sam Newman - the architect who has attributed the coining of the term microservices, went back to the basics to underline the three critical things to get right when working with distributed systems: timeouts, retries and idempotency. Through the talk, he provided mechanisms allowing distributed systems to be more robust.

Olimpiu Pop
on Apr 09, 2025
Architecture & Design

Dapr Agents: Scalable AI Workflows with LLMs, Kubernetes & Multi-Agent Coordination

Introducing Dapr Agents—a groundbreaking framework for creating scalable AI agents using Large Language Models (LLMs). With robust workflows, multi-agent coordination, and cloud-neutral architecture, it enables enterprises to deploy thousands of resilient agents. Built on Dapr’s proven infrastructure, Dapr Agents ensures reliability and observability in AI-driven applications.

Eran Stiller
on Mar 20, 2025
Architecture & Design

How Monzo Bank Built a Cost-Effective, Unorthodox Backup System to Ensure Resilient Banking

Monzo Bank recently revealed Stand-in, an independent backup system on GCP that ensures essential banking services remain operational during application and AWS infrastructure outages. Unlike traditional backups, it's a minimal stand-alone system that exclusively supports key operations and features a cost-effective design, resulting in 1% of the operational costs of the primary deployment.

Eran Stiller
on Feb 24, 2025
AI, ML & Data Engineering

Distributed Multi-Modal Database Aerospike 8 Brings Support for Real-Time ACID Transactions

Aerospike has announced version 8.0 of its distributed multi-modal database, bringing support for distributed ACID transactions. This enables large-scale online transaction processing (OLTP) applications like banking, e-commerce, inventory management, health care, order processing, and more, says the company.

Sergio De Simone
on Feb 16, 2025
Architecture & Design

Inside Netflix’s Distributed Counter: Scalable, Accurate, and Real-Time Counting at Global Scale

Netflix engineers recently published a deep dive into their Distributed Counter Abstraction, a scalable service designed to track user interactions, feature usage, and business performance metrics with low latency. The system balances performance, accuracy, and cost through configurable counting modes, resilient data aggregation, and a globally distributed architecture.

Eran Stiller
on Dec 10, 2024
Cloud

Improving Distributed System Data Integrity with Amazon S3 Conditional Writes

AWS recently announced support for conditional writing in Amazon S3, allowing users to check for the existence of an object before creating it. This feature helps prevent overwriting existing objects when uploading data, making it easier for applications to manage data.

Steef-Jan Wiggers
on Aug 28, 2024
Architecture & Design

How Amazon Aurora Serverless Manages Resources and Scaling for Fleets of 10K+ Instances

AWS engineers published a paper describing the evolution and latest design of resource management and scaling for the Amazon Aurora Serverless platform. Aurora Serverless uses a combination of components at different levels to create a holistic approach for dynamically scaling and adjusting resources to satisfy the needs of customer workloads.

Rafal Gancarz
on Aug 23, 2024
DevOps

Uber Drives Apache Kafka's Tiered Storage Feature; Sparks Efficiency Debate

Apache Kafka, the popular distributed event streaming platform, has introduced a new tiered storage feature in version 3.6.0, initially proposed by Uber engineers. This feature, currently in early access, aims to address the scalability and efficiency challenges faced by organizations running large Kafka clusters.

Matt Saunders
on Aug 14, 2024
DevOps

Apache Skywalking v10: Application Performance Monitoring Tool for Distributed Systems

The Apache Software Foundation has released version 10 of Apache SkyWalking, an open-source observability platform designed to provide comprehensive monitoring, tracing, and analytics for distributed systems. It features many new features and enhancements...

Andrea Messetti
on Jun 12, 2024
Architecture & Design

ClickHouse Keeper: Efficient Apache ZooKeeper Alternative Created with C++ and Raft

ClickHouse project team created an in-house replacement for Apache Zookeeper as it needed a more efficient implementation that would also address some of Zookeeper's shortcomings. Now, ClickHouse Keeper is an essential part of the ClickHouse project and a cornerstone of this open-source analytical database, but can also be used independently for many distributed coordination use cases.

Rafal Gancarz
on Dec 01, 2023
Architecture & Design

How DoorDash Rearchitected its Cache to Improve Scalability and Performance

DoorDash rearchitected the heterogeneous caching system they were using across all of their microservices and created a common, multi-layered cache providing a generic mechanism and solving a number of issues coming from the adoption of a fragmented cache.

Sergio De Simone
on Oct 28, 2023
DevOps

Disaster Recovery Across a Million Pieces: Michelle Brush at QCon San Francisco

During the second day of QCon San Francisco 2023, Michelle Brush, an engineering director, SRE at Google, discussed challenges, patterns, and practices for disaster recovery actions in massively distributed systems in her session. The session is part of the "Designing for Resilience" track.

Steef-Jan Wiggers
on Oct 04, 2023
Architecture & Design

LinkedIn's Open-Source "iris-message-processor" Achieves 86.6x Faster Escalation Management Speeds

LinkedIn developed a new open-source service called "iris-message-processor" to enhance the performance and reliability of its existing Iris escalation management system. "iris-message-processor" significantly improves processing speeds, being ~4.6x faster under average loads and ~86.6x faster under high loads than its predecessor.

Eran Stiller
on Sep 11, 2023
Architecture & Design

Pinterest Revamps Its Asynchronous Computing Platform with Kubernetes and Apache Helix

Pinterest created the next-generation asynchronous computing platform, Pacer, to replace the older solution, Pinlater, which the company outgrew, resulting in scalability and reliability challenges. The new architecture leverages Kubernetes for scheduling job-execution workers and Apache Helix for cluster management.

Rafal Gancarz
on Aug 21, 2023
Architecture & Design

Cadence 1.0: Uber Releases Its Scalable Workflow Orchestration Platform

Uber released a major version of its workflow orchestration platform named Cadence after six years in development. Uber and other companies use Cadence to build stateful services at scale using native programming languages.

Rafal Gancarz
on Aug 07, 2023

Newer News

Older News

InfoQ Software Architects' Newsletter

News