InfoQ Homepage Infrastructure Content on InfoQ

News

RSS Feed

Newer Older

Development

How CNAME Ordering in RFC Specs Caused Cloudflare 1.1.1.1 Outage

In a recent article titled "What came first- the CNAME or the A record?" Cloudflare explains how an unclear RFC specification caused the popular Cloudflare’s 1.1.1.1 service to break. After identifying the breakage and the ambiguity in older DNS standards regarding record order, Cloudflare proposes a clarified specification.

Renato Losio
on Feb 07, 2026
Architecture & Design

GitHub Reworks Layered Defenses after Legacy Protections Block Legitimate Traffic

GitHub engineers recently traced user reports of unexpected “Too Many Requests” errors to abuse-mitigation rules that had accidentally remained active long after the incidents that prompted them.

Matt Foster
on Feb 04, 2026
DevOps

Cloudflare Scales Infrastructure as Code with Shift-Left Security Practices

Cloudflare has eliminated manual configuration errors across hundreds of production accounts by implementing Infrastructure as Code with automated policy enforcement, processing approximately 30 merge requests daily while catching security violations before deployment rather than after incidents occur.

Claudio Masolo
on Jan 12, 2026
Architecture & Design

Benchmarking beyond the Application Layer: How Uber Evaluates Infrastructure Changes and Cloud Skus

Uber’s Ceilometer framework automates infrastructure performance benchmarking beyond applications. It standardizes testing across servers, workloads, and cloud SKUs, helping teams validate changes, identify regressions, and optimize resources. Future plans include AI integration, anomaly detection, and continuous validation.

Leela Kumili
on Dec 26, 2025
DevOps

NVIDIA Dynamo Addresses Multi-Node LLM Inference Challenges

Serving Large Language Models (LLMs) at scale is complex. Modern LLMs now exceed the memory and compute capacity of a single GPU or even a single multi-GPU node. As a result, inference workloads for 70B+, 120B+ parameter models, or pipelines with large context windows, require multi-node, distributed GPU deployments.

Claudio Masolo
on Dec 04, 2025
Cloud

Azure API Management Premium v2 GA: Simplified Private Networking and VNet Injection

Microsoft has launched API Management Premium v2, redefining security and ease-of-use in cloud API gateways. This new architecture enhances private networking by eliminating management traffic from customer VNets. With features like Inbound Private Link, availability zone support, and custom CA certificates, users gain unmatched networking flexibility, resilience, and significant cost savings.

Steef-Jan Wiggers
on Dec 03, 2025
Architecture & Design

Airbnb Adds Adaptive Traffic Control to Manage Key Value Store Spikes

Airbnb upgraded Mussel, its multi-tenant key-value store, replacing static per-client rate limits with an adaptive, resource-aware traffic control system. The redesign ensures resilience during traffic spikes, protects critical workflows, and maintains fair usage across thousands of tenants while scaling efficiently.

Leela Kumili
on Nov 21, 2025
AI, ML & Data Engineering

KubeCon NA 2025 - Salesforce’s Approach to Self-Healing Using AIOps and Agentic AI

AIOps and Agentic AI technologies can help in developing solutions to intelligently analyze Kubernetes cluster health, automatically diagnose problems, and orchestrate issue resolutions with minimal human intervention. Vikram Venkataraman and Srikanth Rajan spoke at KubeCon + CloudNativeCon NA 2025 Conference about Salesforce’s approach to self-healing systems using AIOps and AI Agents.

Srini Penchikala
on Nov 12, 2025
Architecture & Design

Airbnb’s Mussel V2: Next-Gen Key Value Storage to Unify Streaming and Bulk Ingestion

Airbnb’s engineering team re-architected its internal key-value storage system, Mussel, to unify streaming and bulk ingestion while simplifying operations, achieving over 100,000 writes per second and sub-25ms read latencies on 100-terabyte tables, while leveraging Kubernetes, Kafka, and a NewSQL backend to improve scalability, reliability, and operational efficiency across its internal services.

Leela Kumili
on Oct 24, 2025
AI, ML & Data Engineering

Anthropic Reveals Three Infrastructure Bugs behind Claude Performance Issues

Anthropic recently published a postmortem revealing that three distinct infrastructure bugs intermittently degraded the output quality of its Claude models in recent weeks. While the company states it has now resolved those issues and is modifying its internal processes to prevent similar disruptions, the community highlights the challenges of running the service across three hardware platforms.

Renato Losio
on Oct 03, 2025
AI, ML & Data Engineering

Microsoft Tests Microfluidic Cooling for Next-Generation AI Chips

Microsoft has announced progress on a new chip cooling approach that could help address one of the biggest bottlenecks in scaling AI infrastructure: heat. The company’s researchers have successfully demonstrated in-chip microfluidic cooling, a system that channels liquid coolant directly into etched grooves on the back of silicon chips.

Robert Krzaczyński
on Oct 01, 2025
DevOps

Imagine Learning Highlights Linkerd’s Role in Cloud-Native Scale and Cost Savings

Innovative education technology provider Imagine Learning relies on Linkerd as the backbone of its cloud-native infrastructure, enabling rapid growth and ensuring reliability, scalability, and security. With over 80% reduction in compute needs and a 40% cut in networking costs, Linkerd offers a proven solution that enhances efficiency across diverse sectors.

Mark Silvester
on Sep 28, 2025
DevOps

System Initiative Launches “AI Native” Platform to Simplify Infrastructure Automation

System Initiative recently released its AI Native Infrastructure Automation platform, aiming to offer DevOps teams a new way to manage infrastructure through natural language.

Craig Risi
on Sep 09, 2025
Cloud

AWS Launches Memory-Optimized EC2 R8i and R8i-flex Instances with Custom Intel Xeon 6 Processors

AWS has launched its eighth-generation Amazon EC2 R8i and R8i-flex instances, powered by custom Intel Xeon 6 processors. Designed for memory-intensive workloads, these instances offer up to 15% better price performance and enhanced memory throughput, making them ideal for real-time data processing and AI applications.

Steef-Jan Wiggers
on Aug 29, 2025
Cloud

AWS CCAPI MCP Server: Natural Language Infra

AWS introduces the Cloud Control API (CCAPI) MCP Server, revolutionizing infrastructure management by enabling natural language commands for resource management. This tool boosts developer productivity with automated security checks, IaC template generation, and cost estimation, bridging the gap between intent and cloud deployment. Embrace simplicity and efficiency in cloud operations!

Steef-Jan Wiggers
on Aug 22, 2025

Newer News

Older News

InfoQ Software Architects' Newsletter

News