InfoQ Homepage Distributed Systems Content on InfoQ

News

RSS Feed

Newer Older

Architecture & Design

Pinterest Reduces Spark OOM Failures by 96% through Auto Memory Retries

Pinterest Engineering cut Apache Spark out-of-memory failures by 96% using improved observability, configuration tuning, and automatic memory retries. Staged rollout, dashboards, and proactive memory adjustments stabilized data pipelines, reduced manual intervention, and lowered operational overhead across tens of thousands of daily jobs.

Leela Kumili
on Apr 06, 2026
Development

Discord Engineers Add Distributed Tracing to Elixir's Actor Model without Performance Penalty

Discord engineering detailed how they added distributed tracing to Elixir's actor model. Their custom Transport library wraps messages with trace context and uses dynamic sampling to handle million-user fanouts. CPU optimizations included skipping unsampled traces and filtering context before deserialization, recovering 10+ percentage points of overhead.

Steef-Jan Wiggers
on Mar 28, 2026
Architecture & Design

Inside Agoda’s Storefront: a Latency-Aware Reverse Proxy for Improving DNS Based Load Distribution

Agoda engineers developed Storefront, a Rust-based S3-compatible reverse proxy that improves load balancing, request routing, and observability across large-scale object storage systems. The proxy addresses DNS-based distribution limitations, implements latency-aware routing, cross-data-center optimizations, IO safeguards, credential-less authentication, and exposes telemetry via OpenTelemetry.

Leela Kumili
on Mar 27, 2026
Architecture & Design

Inside Netflix’s Graph Abstraction: Handling 650TB of Graph Data in Milliseconds Globally

Netflix engineers built Graph Abstraction, a high-throughput platform managing 650 TB of graph data with millisecond latency. Supporting services from Netflix Gaming’s social graphs to operational topology graphs, it maintains global availability via asynchronous replication. This article covers its architecture, caching, and traversal design for high-scale performance.

Leela Kumili
on Mar 23, 2026
Architecture & Design

From Minutes to Seconds: Uber Boosts MySQL Cluster Uptime with Consensus Architecture

Uber redesigned its MySQL fleet using a consensus-driven architecture based on MySQL Group Replication, reducing cluster failover time from minutes to seconds. By moving leader election and failure detection into the database layer, Uber improved availability, simplified external orchestration, and strengthened consistency across thousands of production clusters.

Leela Kumili
on Mar 11, 2026
Architecture & Design

Hybrid Cloud Data at Uber: How Engineers Solved Extreme-Scale Replication Challenges

Uber’s HiveSync team optimized Hadoop Distcp to handle multi-petabyte replication across hybrid cloud and on-premise data lakes. Enhancements include task parallelization, Uber jobs for small transfers, and improved observability, enabling 5x replication capacity and seamless on-premise-to-cloud migration.

Leela Kumili
on Mar 02, 2026
Architecture & Design

Uforwarder: Uber’s Scalable Kafka Consumer Proxy for Efficient Event-Driven Microservices

Uber has open-sourced uForwarder, a push-based Kafka consumer proxy built to handle trillions of messages and multiple petabytes of data daily. The system introduces context-aware routing, head-of-line blocking mitigation, adaptive auto-rebalancing, and partition-level delay processing to improve scalability, workload isolation, and hardware efficiency in large-scale event-driven microservices.

Leela Kumili
on Feb 23, 2026
Architecture & Design

How Dropbox Built a Scalable Context Engine for Enterprise Knowledge Search

Dropbox engineers have detailed how the company built the context engine behind Dropbox Dash, revealing a shift toward index-based retrieval, knowledge graph-derived context, and continuous evaluation to support enterprise AI at scale.

Matt Foster
on Feb 18, 2026
Architecture & Design

Uber and OpenAI Retool Rate Limiting Systems

Uber and OpenAI are replacing static rate limits with adaptive, infrastructure-level platforms. Uber’s Global Rate Limiter utilizes probabilistic shedding to manage 80M RPS, while OpenAI’s Access Engine implements a credit waterfall to prevent user interruptions. Both architectures utilize distributed enforcement and soft controls to maintain system stability and service continuity at scale.

Patrick Farry
on Feb 17, 2026
Architecture & Design

GitHub Reworks Layered Defenses after Legacy Protections Block Legitimate Traffic

GitHub engineers recently traced user reports of unexpected “Too Many Requests” errors to abuse-mitigation rules that had accidentally remained active long after the incidents that prompted them.

Matt Foster
on Feb 04, 2026
Development

Cloudflare Open Sources tokio‑quiche, Promising Easier QUIC and HTTP/3 in Rust

Cloudflare has open-sourced tokio-quiche, an asynchronous QUIC and HTTP/3 Rust library that wraps its battle-tested quiche implementation with the Tokio runtime to simplify the development of high-performance QUIC applications. The library was used internally to back the edge services, the Oxy HTTP proxies or MASQUE-based tunnels replacing the Wireguard-based tunnels in the WARP client.

Olimpiu Pop
on Dec 27, 2025
Architecture & Design

Benchmarking beyond the Application Layer: How Uber Evaluates Infrastructure Changes and Cloud Skus

Uber’s Ceilometer framework automates infrastructure performance benchmarking beyond applications. It standardizes testing across servers, workloads, and cloud SKUs, helping teams validate changes, identify regressions, and optimize resources. Future plans include AI integration, anomaly detection, and continuous validation.

Leela Kumili
on Dec 26, 2025
Architecture & Design

From On-Demand to Live : Netflix Streaming to 100 Million Devices in under 1 Minute

Netflix’s global live streaming platform powers millions of viewers with cloud-based ingest, custom live origin, Open Connect delivery, and real-time recommendations. This article explores the architecture, low-latency pipelines, adaptive bitrate streaming, and operational monitoring that ensure reliable, scalable, and synchronized live event experiences worldwide.

Leela Kumili
on Dec 05, 2025
Architecture & Design

Stripe's Zero-Downtime Data Movement Platform Migrates Petabytes with Millisecond Traffic Switches

At QCon SF, a Stripe engineer presented the company's Zero-Downtime Data Movement Platform, a system enabling petabyte-scale database migrations with traffic switches that typically complete in milliseconds. The platform supports Stripe's infrastructure, handling 5 million database queries per second while maintaining 99.9995% reliability for $1.4 trillion in annual transactions.

Eran Stiller
on Nov 24, 2025
Architecture & Design

Netflix Tackles Data Deletion at Scale with Centralized Platform Architecture

Netflix engineers presented their architecture for a centralized data-deletion platform at QCon San Francisco, addressing a critical yet rarely discussed system design challenge. The platform manages deletion across heterogeneous data stores while balancing durability, availability, and correctness, processing 76.8 billion row deletions across 1,300 datasets with zero data loss incidents.

Eran Stiller
on Nov 21, 2025

Newer News

Older News

InfoQ Software Architects' Newsletter

News