InfoQ Homepage Infrastructure Content on InfoQ
-
Swiggy Improves Search Autocomplete Using Real Time Machine Learning Ranking
Swiggy detailed real-time machine-learning ranking system for autocomplete built on OpenSearch. The architecture separates candidate generation and ranking, uses feature stores for real time signals, and applies learning to rank models for improved relevance. It replaces heuristic ranking while maintaining strict latency constraints and enabling continuous model updates from user behavior signals.
-
Cloudflare Optimizes Edge Stack for High-Core CPUs instead of Large Cache
Cloudflare recently introduced its Gen 13 servers, marking a shift in how its network handles traffic. Instead of relying on large CPU caches for speed, the company redesigned its software to leverage many more processor cores working in parallel in its latest AMD-based servers.
-
Dropbox Collaborates with GitHub to Reduce Monorepo Size from 87GB to 20GB
Dropbox reduced its backend monorepo from 87GB to 20GB by optimizing Git delta compression in collaboration with GitHub. The changes improved clone times, CI performance, and developer velocity, highlighting how repository storage inefficiencies can impact large-scale engineering workflows.
-
From Minutes to Seconds: Uber Boosts MySQL Cluster Uptime with Consensus Architecture
Uber redesigned its MySQL fleet using a consensus-driven architecture based on MySQL Group Replication, reducing cluster failover time from minutes to seconds. By moving leader election and failure detection into the database layer, Uber improved availability, simplified external orchestration, and strengthened consistency across thousands of production clusters.
-
How CNAME Ordering in RFC Specs Caused Cloudflare 1.1.1.1 Outage
In a recent article titled "What came first- the CNAME or the A record?" Cloudflare explains how an unclear RFC specification caused the popular Cloudflare’s 1.1.1.1 service to break. After identifying the breakage and the ambiguity in older DNS standards regarding record order, Cloudflare proposes a clarified specification.
-
GitHub Reworks Layered Defenses after Legacy Protections Block Legitimate Traffic
GitHub engineers recently traced user reports of unexpected “Too Many Requests” errors to abuse-mitigation rules that had accidentally remained active long after the incidents that prompted them.
-
Cloudflare Scales Infrastructure as Code with Shift-Left Security Practices
Cloudflare has eliminated manual configuration errors across hundreds of production accounts by implementing Infrastructure as Code with automated policy enforcement, processing approximately 30 merge requests daily while catching security violations before deployment rather than after incidents occur.
-
Benchmarking beyond the Application Layer: How Uber Evaluates Infrastructure Changes and Cloud Skus
Uber’s Ceilometer framework automates infrastructure performance benchmarking beyond applications. It standardizes testing across servers, workloads, and cloud SKUs, helping teams validate changes, identify regressions, and optimize resources. Future plans include AI integration, anomaly detection, and continuous validation.
-
NVIDIA Dynamo Addresses Multi-Node LLM Inference Challenges
Serving Large Language Models (LLMs) at scale is complex. Modern LLMs now exceed the memory and compute capacity of a single GPU or even a single multi-GPU node. As a result, inference workloads for 70B+, 120B+ parameter models, or pipelines with large context windows, require multi-node, distributed GPU deployments.
-
Azure API Management Premium v2 GA: Simplified Private Networking and VNet Injection
Microsoft has launched API Management Premium v2, redefining security and ease-of-use in cloud API gateways. This new architecture enhances private networking by eliminating management traffic from customer VNets. With features like Inbound Private Link, availability zone support, and custom CA certificates, users gain unmatched networking flexibility, resilience, and significant cost savings.
-
Airbnb Adds Adaptive Traffic Control to Manage Key Value Store Spikes
Airbnb upgraded Mussel, its multi-tenant key-value store, replacing static per-client rate limits with an adaptive, resource-aware traffic control system. The redesign ensures resilience during traffic spikes, protects critical workflows, and maintains fair usage across thousands of tenants while scaling efficiently.
-
KubeCon NA 2025 - Salesforce’s Approach to Self-Healing Using AIOps and Agentic AI
AIOps and Agentic AI technologies can help in developing solutions to intelligently analyze Kubernetes cluster health, automatically diagnose problems, and orchestrate issue resolutions with minimal human intervention. Vikram Venkataraman and Srikanth Rajan spoke at KubeCon + CloudNativeCon NA 2025 Conference about Salesforce’s approach to self-healing systems using AIOps and AI Agents.
-
Airbnb’s Mussel V2: Next-Gen Key Value Storage to Unify Streaming and Bulk Ingestion
Airbnb’s engineering team re-architected its internal key-value storage system, Mussel, to unify streaming and bulk ingestion while simplifying operations, achieving over 100,000 writes per second and sub-25ms read latencies on 100-terabyte tables, while leveraging Kubernetes, Kafka, and a NewSQL backend to improve scalability, reliability, and operational efficiency across its internal services.
-
Anthropic Reveals Three Infrastructure Bugs behind Claude Performance Issues
Anthropic recently published a postmortem revealing that three distinct infrastructure bugs intermittently degraded the output quality of its Claude models in recent weeks. While the company states it has now resolved those issues and is modifying its internal processes to prevent similar disruptions, the community highlights the challenges of running the service across three hardware platforms.
-
Microsoft Tests Microfluidic Cooling for Next-Generation AI Chips
Microsoft has announced progress on a new chip cooling approach that could help address one of the biggest bottlenecks in scaling AI infrastructure: heat. The company’s researchers have successfully demonstrated in-chip microfluidic cooling, a system that channels liquid coolant directly into etched grooves on the back of silicon chips.