InfoQ Homepage DevOps Content on InfoQ
-
Enhancing Reliability Using Service-Level Prioritized Load Shedding: Netflix at QCon SF 2025
At QCon San Francisco, Netflix engineers unveiled their advanced Service-Level-Prioritized Load-Shedding strategy, enhancing reliability during traffic spikes. By prioritizing high-value requests and automating management across microservices, they safeguard user experience and system stability. Key insights stress prioritization, automation, and structured load shedding for optimal resilience.
-
AWS Disruption Exposes Fragility in Critical Cloud Infrastructure
On October 20, 2025, Amazon Web Services (AWS) experienced a major outage that disrupted global internet services, affecting millions of users and thousands of companies across more than 60 countries. The incident originated in the US-EAST-1 region and was traced to a DNS resolution failure affecting the DynamoDB endpoint, which cascaded into outages across multiple dependent services.
-
Parting the Clouds: the Rise of Disaggregated Systems by Murat Demirbas at QCon SF 2025
Cloud computing is evolving through disaggregation, addressing inefficiencies of traditional architectures by decoupling compute and storage. This shift enhances scalability, fault isolation, and operational simplicity, driven by advancements in networking. As seen in cloud databases such as Amazon Aurora, embracing these principles enables true economic optimization and innovative design.
-
Cloudflare Workflows Adds Python Support for Durable AI Pipelines
Innovative Cloudflare Workflows now supports both TypeScript and Python, enabling developers to orchestrate complex applications seamlessly. With durable execution and state persistence, it simplifies the development of robust data pipelines and AI/ML models. Experience enhanced concurrency and intuitive design, making orchestration effortless for Python enthusiasts.
-
AWS Introduces Remote Build Cache in ECR to Accelerate Docker Image Builds
Amazon Web Services has announced enhancements to its CodeBuild service, allowing teams to use Amazon ECR as a remote Docker layer cache, significantly reducing image build times in CI/CD pipelines. By leveraging ECR repositories to persist and reuse build layers across runs, organisations can skip rebuilding unchanged parts of containers and accelerate delivery.
-
Race Condition in DynamoDB DNS System: Analyzing the AWS US-EAST-1 Outage
On October 19th and 20th, AWS experienced an extended outage triggered by a failure in Amazon DynamoDB that affected most services in its most popular region, Northern Virginia. The cloud provider released an analysis of the incident, sparking discussions in the community about redundancy on AWS, moving out of public cloud, and multi-region approaches.
-
Microsoft Addresses Data Residency with Private Cloud Expansion
Microsoft has strengthened its Sovereign Cloud offering to meet stringent global data-residency and control regulations, particularly in Europe. New capabilities include a commitment to EU Data Boundary, expanded in-country data processing, and enhanced Sovereign Private Cloud features.
-
GitHub Rolls out Post-Quantum SSH Security to Protect Code from Future Threats
GitHub has deployed a hybrid post-quantum key-exchange algorithm for SSH access, strengthening protection against future quantum decryption threats. The rollout, now live across most regions, pairs classical and quantum-resistant methods to counter “store now, decrypt later” attacks and marks a major step toward quantum-safe software development.
-
Crossplane Reaches Production Maturity by Graduating CNCF
The Cloud Native Computing Foundation (CNCF) has graduated Crossplane, marking a major milestone for the open-source project that turns Kubernetes into a universal control plane for cloud infrastructure. For practitioners, it signals that Crossplane is no longer an experimental idea but a production-hardened foundation for building internal platforms.
-
HashiCorp’s New Guide Offers Practical Advice on Writing and Rightsizing Terraform Modules
In a blog post titled "How to write and rightsize Terraform modules", HashiCorp shares a comprehensive framework for creating maintainable, scalable modules in the Terraform ecosystem. Author Mitch Pronschinske draws on insights from consultant Rene Schach's HashiDays 2025 session to focus on four key pillars: module scope, code strategy, security, and testing.
-
Google Cloud Introduces Chaos Engineering Framework and Recipes for Distributed Systems
Google Cloud's Expert Services Team has released a detailed guide on chaos engineering for cloud-based distributed systems. It highlights that the intentional creation of failures is essential for developing resilient architectures. The initiative provides open-source recipes and helpful guidance for applying controlled disruption testing in Google Cloud environments.
-
AWS Launches Capabilities by Region Tool
AWS has launched "AWS Capabilities by Region," a powerful tool that streamlines service visibility for architects and developers. No more manual checks—now you can compare AWS services across regions interactively and plan deployments efficiently. With enhanced transparency and automated capability checks, streamline global projects and minimize delays.
-
Microsoft Moves Azure DevOps MCP Server from Preview to General Availability
Microsoft announced in October 2025 that its Azure DevOps MCP Server, a local Model Context Provider designed to bring richer context to AI assistants like GitHub Copilot, has exited public preview and become generally available.
-
Grafana and GitLab Introduce Serverless CI/CD Observability Integration
In a move to streamline development workflows, Daniel Fritzgerald of GrafanaLabs has published a new open-source solution that links GitLab CI/CD events into Grafana's observability stack via a serverless architecture.
-
Azure APIM Simplifies Event-Driven Architecture with Native Service Bus Policy
Microsoft's new feature in API Management (APIM) enables seamless messaging to Azure Service Bus, simplifying API connections in event-driven architectures. By using the send-service-bus-message policy, developers can easily route HTTP requests to Service Bus for asynchronous processing, enhancing integration, security, and control without additional components.