InfoQ Homepage Architecture & Design Content on InfoQ
-
Cloudflare Global Outage Traced to Internal Database Change
Cloudflare’s recent global outage, linked to a database update, caused widespread disruption and highlighted the risks of single-vendor reliance. While service was restored, the incident sparked discussions on the importance of multi-vendor strategies in tech. Cloudflare's CEO vowed to enhance system resilience, emphasizing that outages can impact even the largest providers.
-
QConSF 2025: Humans in the Loop: Engineering Leadership in a Chaotic Industry
At QCon SF 2025, Michelle Brush of Google explored the evolving landscape of software engineering in her keynote “Humans in the Loop: Engineering Leadership in a Chaotic Industry.” She highlighted the complexities engineers face amid automation and AI, stressing the importance of conscious competence, higher-level problem-solving, and effective leadership in navigating today's challenges.
-
Airbnb Adds Adaptive Traffic Control to Manage Key Value Store Spikes
Airbnb upgraded Mussel, its multi-tenant key-value store, replacing static per-client rate limits with an adaptive, resource-aware traffic control system. The redesign ensures resilience during traffic spikes, protects critical workflows, and maintains fair usage across thousands of tenants while scaling efficiently.
-
Netflix Tackles Data Deletion at Scale with Centralized Platform Architecture
Netflix engineers presented their architecture for a centralized data-deletion platform at QCon San Francisco, addressing a critical yet rarely discussed system design challenge. The platform manages deletion across heterogeneous data stores while balancing durability, availability, and correctness, processing 76.8 billion row deletions across 1,300 datasets with zero data loss incidents.
-
AWS Lambda Rust Support Reaches General Availability
AWS has elevated Rust support in Lambda from experimental to generally available, empowering developers to create high-performance, memory-safe serverless applications. This milestone enhances developer confidence, backed by AWS support and SLA. While it offers speed comparable to C++, challenges such as lengthy SDK compile times and increased binary sizes remain key considerations.
-
Enhancing Reliability Using Service-Level Prioritized Load Shedding: Netflix at QCon SF 2025
At QCon San Francisco, Netflix engineers unveiled their advanced Service-Level-Prioritized Load-Shedding strategy, enhancing reliability during traffic spikes. By prioritizing high-value requests and automating management across microservices, they safeguard user experience and system stability. Key insights stress prioritization, automation, and structured load shedding for optimal resilience.
-
The Decisions You Don't Know You're Making: QCon Keynote Explores Hidden Choices in Engineering
Engineering teams make their most consequential decisions not in architecture reviews or sprint planning, but through invisible choices embedded in metrics, defaults, and everyday behaviors. In their QCon San Francisco 2025 keynote, Shawna Martell and Dan Fike challenged the industry's focus on documented decision-making while the decisions that truly shape systems and culture go unrecognized.
-
Parting the Clouds: the Rise of Disaggregated Systems by Murat Demirbas at QCon SF 2025
Cloud computing is evolving through disaggregation, addressing inefficiencies of traditional architectures by decoupling compute and storage. This shift enhances scalability, fault isolation, and operational simplicity, driven by advancements in networking. As seen in cloud databases such as Amazon Aurora, embracing these principles enables true economic optimization and innovative design.
-
Cloudflare Workflows Adds Python Support for Durable AI Pipelines
Innovative Cloudflare Workflows now supports both TypeScript and Python, enabling developers to orchestrate complex applications seamlessly. With durable execution and state persistence, it simplifies the development of robust data pipelines and AI/ML models. Experience enhanced concurrency and intuitive design, making orchestration effortless for Python enthusiasts.
-
QCon SF: Database-Backed Workflow Orchestration Challenges Traditional Architecture
During QCon SF, Jeremy Edberg and Qian Li from DBOS presented a non-conventional architectural approach to workflow orchestration: treating PostgreSQL not just as a data store, but as the orchestration layer itself. Their talk addressed a persistent problem in distributed systems: workflows frequently fail, recovery mechanisms are complex, and visibility into workflow state remains challenging.
-
AI-Generated Code Creates New Wave of Technical Debt, Report Finds
AI-generated code is “highly functional but systematically lacking in architectural judgment”, a new report from Ox Security has found. In a report released in late October called Army of Juniors: The AI Code Security Crisis, AI application security (AppSec) company Ox Security outlined 10 architecture and security anti-patterns that are commonly found in AI-generated code.
-
Java News Roundup: Spring Framework 7.0, Spring Data, Spring AI, Payara Platform, OpenJDK, JobRunr
This week's Java roundup for November 10th, 2025, features news highlighting: OpenJDK JEPs targeted for JDK 26; the GA release of Spring Framework 7.0; point releases of Spring Data, Spring AI, JobRunr and Jox; the November 2025 edition of Payara Platform; the fifth release candidate of Maven 4.0; and a maintenance release of Micronaut.
-
Race Condition in DynamoDB DNS System: Analyzing the AWS US-EAST-1 Outage
On October 19th and 20th, AWS experienced an extended outage triggered by a failure in Amazon DynamoDB that affected most services in its most popular region, Northern Virginia. The cloud provider released an analysis of the incident, sparking discussions in the community about redundancy on AWS, moving out of public cloud, and multi-region approaches.
-
Microsoft Addresses Data Residency with Private Cloud Expansion
Microsoft has strengthened its Sovereign Cloud offering to meet stringent global data-residency and control regulations, particularly in Europe. New capabilities include a commitment to EU Data Boundary, expanded in-country data processing, and enhanced Sovereign Private Cloud features.
-
Monzo’s Real-Time Fraud Detection Architecture with BigQuery and Microservices
Monzo has redesigned its fraud prevention platform to detect scams in real time, handle growing payment volumes, and deploy new controls rapidly. Explore the bank’s modular control architecture, feature computation pipeline, and observability using BigQuery for accurate, low-latency fraud detection.