InfoQ Homepage Reliability Content on InfoQ
-
QConAI NY 2025 - Designing AI Platforms for Reliability: Tools for Certainty, Agents for Discovery
Aaron Erickson at QCon AI NYC 2025 emphasized treating agentic AI as an engineering challenge, focusing on reliability through the blend of probabilistic and deterministic systems. He argued for clear operational structures to minimize risks and optimize performance, highlighting the importance of specialized agents and deterministic paths to enhance accuracy and control in AI workflows.
-
AWS Debuts “DevOps Agent” to Automate Incident Response and Improve System Reliability
AWS recently announced the public preview of AWS DevOps Agent, a new "frontier agent" that aims to help organizations react more quickly to production incidents, identify root causes, and proactively strengthen system reliability.
-
From On-Demand to Live : Netflix Streaming to 100 Million Devices in under 1 Minute
Netflix’s global live streaming platform powers millions of viewers with cloud-based ingest, custom live origin, Open Connect delivery, and real-time recommendations. This article explores the architecture, low-latency pipelines, adaptive bitrate streaming, and operational monitoring that ensure reliable, scalable, and synchronized live event experiences worldwide.
-
Reddit Migrates Comment Backend from Python to Go Microservice to Halve Latency
Reddit has rebuilt its core backend, migrating Comments, Accounts, Posts, and Subreddits from a legacy Python monolith to Go microservices. The migration improves performance, halves critical write latency, and modernizes the platform for future scalability while preserving correctness across multiple datastores.
-
Enhancing Reliability Using Service-Level Prioritized Load Shedding: Netflix at QCon SF 2025
At QCon San Francisco, Netflix engineers unveiled their advanced Service-Level-Prioritized Load-Shedding strategy, enhancing reliability during traffic spikes. By prioritizing high-value requests and automating management across microservices, they safeguard user experience and system stability. Key insights stress prioritization, automation, and structured load shedding for optimal resilience.
-
QCon AI New York 2025 Schedule Published, Highlights Practical Enterprise AI
The QCon AI New York 2025 schedule is now live for its Dec 16-17 event. Focused on moving AI from PoC to production, the program offers a practical roadmap for senior engineers & tech leaders. It addresses the real-world challenges of building, scaling, and deploying reliable, enterprise-grade AI systems, helping organizations overcome the hurdles of productionizing their AI initiatives.
-
How NASA Tests Their Software for the Space Shuttle and the Orion MPCV
NASA uses multiple testing levels, independent validation, standards, safety communities, and tools to ensure safety. Darrel Raines gave a talk about software development and testing for the Space Shuttle and the Orion MPCV. He explained how they learn from failures and near misses and continually improve their process.
-
Azure Event Hubs Geo-Replication Reaches General Availability
Microsoft has launched the General Availability of Geo-replication for Azure Event Hubs, enhancing data availability and redundancy. This feature allows seamless cross-region data replication, ensuring business continuity during outages. With synchronous and asynchronous options, users can choose their preferred data consistency, backed by increased health metrics for better monitoring.
-
Google Cloud Introduces Non-Disruptive Cloud Storage Bucket Relocation
Google Cloud's innovative Cloud Storage bucket relocation feature enables seamless, non-disruptive data migration across regions while preserving metadata and minimizing application downtime. Maintain governance, enhance lifecycle management, and leverage insights for optimized storage—all without altering access paths. Experience efficient, low-latency solutions tailored for your needs.
-
AWS Promotes Responsible AI in the Well-Architected Generative AI Lens
AWS announced the availability of the new Well-Architected Generative AI Lens, focused on providing best practices for designing and operating generative AI workloads. The lens is aimed at organizations delivering robust and cost-effective generative AI solutions on AWS. The document offers cloud-agnostic best practices, implementation guidance and links to additional resources.
-
Stripe Rearchitects Its Observability Platform with Managed Prometheus and Grafana on AWS
Stripe replaced its observability platform, which used a third-party vendor solution, with a new architecture utilizing managed services on AWS. The company made the move due to scalability limits, reliability issues, and increasing costs while transitioning to microservices. The migration involved dual-writing metrics, translating assets, validation, and user training.
-
Netflix Rolls Out Service-Level Prioritized Load Shedding to Improve Resiliency
Netflix extended its prioritized load-shedding implementation to the individual service level to further improve system resilience. The approach uses cloud capacity more efficiently by shedding low-priority requests only when necessary instead of maintaining separate clusters for failure isolation.
-
Netflix’s Pushy: Evolution of Scalable WebSocket Platform That Handles 100Ms Concurrent Connections
Netflix shared details on the evolution of Pushy, a WebSocket messaging platform that supports push notifications and inter-device communication across many different devices for the company’s products. Netflix’s engineers implemented many improvements across the Pushy ecosystem to ensure the platform's scalability and reliability and support new capabilities.
-
How Google Does Chaos Testing to Improve Spanner's Reliability
To ensure their Spanner database keeps working reliably, Google engineers use chaos testing to inject faults into production-like instances and stress the system's ability to behave in a correct way in the face of unexpected failures.
-
QCon London: Scaling Microservices Architecture and Technology Organization at Trainline
During the recent QCon London conference, Trainline’s CTO spoke about the evolution of the company’s system architecture and organizational structure over the last five years. The company had to adapt to market changes and growing customer expectations by improving the performance and reliability of its technology platform.