InfoQ Homepage Distributed Systems Content on InfoQ
-
Inside Agoda’s Storefront: A Latency-Aware Reverse Proxy for Improving DNS Based Load Distribution
Agoda engineers developed Storefront, a Rust-based S3-compatible reverse proxy that improves load balancing, request routing, and observability across large-scale object storage systems. The proxy addresses DNS-based distribution limitations, implements latency-aware routing, cross-data-center optimizations, IO safeguards, credential-less authentication, and exposes telemetry via OpenTelemetry.
-
Inside Netflix’s Graph Abstraction: Handling 650TB of Graph Data in Milliseconds Globally
Netflix engineers built Graph Abstraction, a high-throughput platform managing 650 TB of graph data with millisecond latency. Supporting services from Netflix Gaming’s social graphs to operational topology graphs, it maintains global availability via asynchronous replication. This article covers its architecture, caching, and traversal design for high-scale performance.
-
From Minutes to Seconds: Uber Boosts MySQL Cluster Uptime with Consensus Architecture
Uber redesigned its MySQL fleet using a consensus-driven architecture based on MySQL Group Replication, reducing cluster failover time from minutes to seconds. By moving leader election and failure detection into the database layer, Uber improved availability, simplified external orchestration, and strengthened consistency across thousands of production clusters.
-
Hybrid Cloud Data at Uber: How Engineers Solved Extreme-Scale Replication Challenges
Uber’s HiveSync team optimized Hadoop Distcp to handle multi-petabyte replication across hybrid cloud and on-premise data lakes. Enhancements include task parallelization, Uber jobs for small transfers, and improved observability, enabling 5x replication capacity and seamless on-premise-to-cloud migration.
-
Uforwarder: Uber’s Scalable Kafka Consumer Proxy for Efficient Event-Driven Microservices
Uber has open-sourced uForwarder, a push-based Kafka consumer proxy built to handle trillions of messages and multiple petabytes of data daily. The system introduces context-aware routing, head-of-line blocking mitigation, adaptive auto-rebalancing, and partition-level delay processing to improve scalability, workload isolation, and hardware efficiency in large-scale event-driven microservices.
-
How Dropbox Built a Scalable Context Engine for Enterprise Knowledge Search
Dropbox engineers have detailed how the company built the context engine behind Dropbox Dash, revealing a shift toward index-based retrieval, knowledge graph-derived context, and continuous evaluation to support enterprise AI at scale.
-
Uber and OpenAI Retool Rate Limiting Systems
Uber and OpenAI are replacing static rate limits with adaptive, infrastructure-level platforms. Uber’s Global Rate Limiter utilizes probabilistic shedding to manage 80M RPS, while OpenAI’s Access Engine implements a credit waterfall to prevent user interruptions. Both architectures utilize distributed enforcement and soft controls to maintain system stability and service continuity at scale.
-
GitHub Reworks Layered Defenses after Legacy Protections Block Legitimate Traffic
GitHub engineers recently traced user reports of unexpected “Too Many Requests” errors to abuse-mitigation rules that had accidentally remained active long after the incidents that prompted them.
-
Cloudflare Open Sources tokio‑quiche, Promising Easier QUIC and HTTP/3 in Rust
Cloudflare has open-sourced tokio-quiche, an asynchronous QUIC and HTTP/3 Rust library that wraps its battle-tested quiche implementation with the Tokio runtime to simplify the development of high-performance QUIC applications. The library was used internally to back the edge services, the Oxy HTTP proxies or MASQUE-based tunnels replacing the Wireguard-based tunnels in the WARP client.
-
Benchmarking beyond the Application Layer: How Uber Evaluates Infrastructure Changes and Cloud Skus
Uber’s Ceilometer framework automates infrastructure performance benchmarking beyond applications. It standardizes testing across servers, workloads, and cloud SKUs, helping teams validate changes, identify regressions, and optimize resources. Future plans include AI integration, anomaly detection, and continuous validation.
-
From On-Demand to Live : Netflix Streaming to 100 Million Devices in under 1 Minute
Netflix’s global live streaming platform powers millions of viewers with cloud-based ingest, custom live origin, Open Connect delivery, and real-time recommendations. This article explores the architecture, low-latency pipelines, adaptive bitrate streaming, and operational monitoring that ensure reliable, scalable, and synchronized live event experiences worldwide.
-
Stripe's Zero-Downtime Data Movement Platform Migrates Petabytes with Millisecond Traffic Switches
At QCon SF, a Stripe engineer presented the company's Zero-Downtime Data Movement Platform, a system enabling petabyte-scale database migrations with traffic switches that typically complete in milliseconds. The platform supports Stripe's infrastructure, handling 5 million database queries per second while maintaining 99.9995% reliability for $1.4 trillion in annual transactions.
-
Netflix Tackles Data Deletion at Scale with Centralized Platform Architecture
Netflix engineers presented their architecture for a centralized data-deletion platform at QCon San Francisco, addressing a critical yet rarely discussed system design challenge. The platform manages deletion across heterogeneous data stores while balancing durability, availability, and correctness, processing 76.8 billion row deletions across 1,300 datasets with zero data loss incidents.
-
QCon SF: Database-Backed Workflow Orchestration Challenges Traditional Architecture
During QCon SF, Jeremy Edberg and Qian Li from DBOS presented a non-conventional architectural approach to workflow orchestration: treating PostgreSQL not just as a data store, but as the orchestration layer itself. Their talk addressed a persistent problem in distributed systems: workflows frequently fail, recovery mechanisms are complex, and visibility into workflow state remains challenging.
-
QCon London 2026 Announces Tracks: AI Engineering, Building Teams, Tech of Finance, and More
The QCon London 2026 tracks are live: 15 practitioner-curated deep dives on AI adoption, resilient architectures, distributed systems, performance, modern languages, data, security, and Staff+ leadership, rooted in real production lessons.