InfoQ Homepage Reliability Content on InfoQ

News

RSS Feed

Newer Older

DevOps

Netflix Scales "Human Infrastructure" to Manage Global Live Operations

Netflix has introduced a "human infrastructure" layer to manage live broadcasts at scale. Using a low-latency "telemetry hot path" and a Live Operations Centre, the company now balances automated scaling with human oversight. This shift, which mirrors strategies at AWS and Disney+, focuses on maintaining reliability through expert intervention during high-concurrency global events.

Mark Silvester
on Apr 30, 2026
Architecture & Design

From Minutes to Seconds: Uber Boosts MySQL Cluster Uptime with Consensus Architecture

Uber redesigned its MySQL fleet using a consensus-driven architecture based on MySQL Group Replication, reducing cluster failover time from minutes to seconds. By moving leader election and failure detection into the database layer, Uber improved availability, simplified external orchestration, and strengthened consistency across thousands of production clusters.

Leela Kumili
on Mar 11, 2026
Architecture & Design

Airbnb Expands Global Checkout with “Pay as a Local,” Scaling to 220 Markets in 14 Months

Airbnb expands its global checkout with the “Pay as a Local” initiative, supporting over 20 locally preferred payment methods across 220 markets. The company replatformed its payments system with domain-oriented services, reusable flow archetypes, and a centralized configuration, enhancing integration speed, reliability, testing, and observability for diverse payment methods worldwide.

Leela Kumili
on Feb 02, 2026
AI, ML & Data Engineering

QConAI NY 2025 - Designing AI Platforms for Reliability: Tools for Certainty, Agents for Discovery

Aaron Erickson at QCon AI NYC 2025 emphasized treating agentic AI as an engineering challenge, focusing on reliability through the blend of probabilistic and deterministic systems. He argued for clear operational structures to minimize risks and optimize performance, highlighting the importance of specialized agents and deterministic paths to enhance accuracy and control in AI workflows.

Andrew Hoblitzell
on Dec 21, 2025
DevOps

AWS Debuts “DevOps Agent” to Automate Incident Response and Improve System Reliability

AWS recently announced the public preview of AWS DevOps Agent, a new "frontier agent" that aims to help organizations react more quickly to production incidents, identify root causes, and proactively strengthen system reliability.

Craig Risi
on Dec 17, 2025
Architecture & Design

From On-Demand to Live : Netflix Streaming to 100 Million Devices in under 1 Minute

Netflix’s global live streaming platform powers millions of viewers with cloud-based ingest, custom live origin, Open Connect delivery, and real-time recommendations. This article explores the architecture, low-latency pipelines, adaptive bitrate streaming, and operational monitoring that ensure reliable, scalable, and synchronized live event experiences worldwide.

Leela Kumili
on Dec 05, 2025
Architecture & Design

Reddit Migrates Comment Backend from Python to Go Microservice to Halve Latency

Reddit has rebuilt its core backend, migrating Comments, Accounts, Posts, and Subreddits from a legacy Python monolith to Go microservices. The migration improves performance, halves critical write latency, and modernizes the platform for future scalability while preserving correctness across multiple datastores.

Leela Kumili
on Nov 28, 2025
DevOps

Enhancing Reliability Using Service-Level Prioritized Load Shedding: Netflix at QCon SF 2025

At QCon San Francisco, Netflix engineers unveiled their advanced Service-Level-Prioritized Load-Shedding strategy, enhancing reliability during traffic spikes. By prioritizing high-value requests and automating management across microservices, they safeguard user experience and system stability. Key insights stress prioritization, automation, and structured load shedding for optimal resilience.

Steef-Jan Wiggers
on Nov 20, 2025
AI, ML & Data Engineering

QCon AI New York 2025 Schedule Published, Highlights Practical Enterprise AI

The QCon AI New York 2025 schedule is now live for its Dec 16-17 event. Focused on moving AI from PoC to production, the program offers a practical roadmap for senior engineers & tech leaders. It addresses the real-world challenges of building, scaling, and deploying reliable, enterprise-grade AI systems, helping organizations overcome the hurdles of productionizing their AI initiatives.

Artenisa Chatziou
on Oct 10, 2025
Culture & Methods

How NASA Tests Their Software for the Space Shuttle and the Orion MPCV

NASA uses multiple testing levels, independent validation, standards, safety communities, and tools to ensure safety. Darrel Raines gave a talk about software development and testing for the Space Shuttle and the Orion MPCV. He explained how they learn from failures and near misses and continually improve their process.

Ben Linders
on Aug 14, 2025
Cloud

Azure Event Hubs Geo-Replication Reaches General Availability

Microsoft has launched the General Availability of Geo-replication for Azure Event Hubs, enhancing data availability and redundancy. This feature allows seamless cross-region data replication, ensuring business continuity during outages. With synchronous and asynchronous options, users can choose their preferred data consistency, backed by increased health metrics for better monitoring.

Steef-Jan Wiggers
on Jul 24, 2025
Cloud

Google Cloud Introduces Non-Disruptive Cloud Storage Bucket Relocation

Google Cloud's innovative Cloud Storage bucket relocation feature enables seamless, non-disruptive data migration across regions while preserving metadata and minimizing application downtime. Maintain governance, enhance lifecycle management, and leverage insights for optimized storage—all without altering access paths. Experience efficient, low-latency solutions tailored for your needs.

Steef-Jan Wiggers
on Jul 19, 2025
Architecture & Design

AWS Promotes Responsible AI in the Well-Architected Generative AI Lens

AWS announced the availability of the new Well-Architected Generative AI Lens, focused on providing best practices for designing and operating generative AI workloads. The lens is aimed at organizations delivering robust and cost-effective generative AI solutions on AWS. The document offers cloud-agnostic best practices, implementation guidance and links to additional resources.

Rafal Gancarz
on Apr 27, 2025
Architecture & Design

Stripe Rearchitects Its Observability Platform with Managed Prometheus and Grafana on AWS

Stripe replaced its observability platform, which used a third-party vendor solution, with a new architecture utilizing managed services on AWS. The company made the move due to scalability limits, reliability issues, and increasing costs while transitioning to microservices. The migration involved dual-writing metrics, translating assets, validation, and user training.

Rafal Gancarz
on Nov 27, 2024
Architecture & Design

Netflix Rolls Out Service-Level Prioritized Load Shedding to Improve Resiliency

Netflix extended its prioritized load-shedding implementation to the individual service level to further improve system resilience. The approach uses cloud capacity more efficiently by shedding low-priority requests only when necessary instead of maintaining separate clusters for failure isolation.

Rafal Gancarz
on Nov 23, 2024

Newer News

Older News

InfoQ Software Architects' Newsletter

News