InfoQ Homepage DevOps Content on InfoQ
-
Railway Highlights the Importance of Logs, Metrics, Traces, and Alerts for Diagnosing System Failure
Railway’s engineering team published a comprehensive guide to observability, explaining how developers and SRE teams can use logs, metrics, traces, and alerts together to understand and diagnose production system failures.
-
Google BigQuery Adds SQL-Native Managed Inference for Hugging Face Models
Google has launched SQL-native managed inference for 180,000+ Hugging Face models in BigQuery. The preview release collapses the ML lifecycle into a unified SQL interface, eliminating the need for separate Kubernetes or Vertex AI management. Key features include automated resource governance via endpoint_idle_ttl and secure identity-based execution using existing data warehouse permissions.
-
Cedar Joins CNCF as a Sandbox Project
Cedar, an open-source policy language architected by AWS, has joined the CNCF as a Sandbox project. Designed for fine-grained application permissions, it decouples access control from code using a verifiable, high-performance policy engine. Cedar supports RBAC, ABAC, and ReBAC, offering a secure, analyzable alternative to general-purpose tools like OPA.
-
Two Missing Characters: How a Regex Flaw Exposed AWS GitHub Repos to Supply-Chain Risk
AWS recently published a security bulletin acknowledging a configuration issue affecting some popular AWS-managed open-source GitHub repositories. Dubbed CodeBreach, the critical vulnerability could have resulted in the introduction of malicious code and hijacking of the repositories leveraging AWS CodeBuild.
-
OpenCost Looks Back on 2025 Milestones and Charts a Roadmap for 2026
The OpenCost project, an open-source cost and resource management tool hosted by the Cloud Native Computing Foundation (CNCF), has published a year-in-review reflecting on its progress in 2025 and outlining priorities for 2026.
-
AWS Launches European Sovereign Cloud amid Questions about U.S. Legal Jurisdiction
AWS has launched its European Sovereign Cloud with a €7.8 billion investment, designed to meet EU regulatory demands and address data privacy concerns amid geopolitical tensions. Despite its operational separation from global regions, questions linger about legal protections against U.S. data access. Competitors like Microsoft and local providers may present stronger sovereignty options.
-
Salesforce Migrates 1,000+ EKS Clusters to Karpenter to Improve Scaling Speed and Efficiency
Salesforce has completed a phased migration of more than 1,000 Amazon Elastic Kubernetes Service (EKS) clusters from the Kubernetes Cluster Autoscaler to Karpenter, AWS’s open-source node-provisioning and autoscaling solution.
-
Microsoft Releases Azure Functions Support for Model Context Protocol Servers
Microsoft has launched its Model Context Protocol (MCP) for Azure Functions, ensuring secure, standardized workflows for AI agents. With built-in OBO authentication and streamable HTTP transport, it addresses key security concerns. Now supporting multiple languages and self-hosting, MCP empowers developers to deploy with ease while safeguarding sensitive data.
-
GitLab 18.8 Marks General Availability of the Duo Agent Platform
GitLab 18.8 brings a number of new features, including GitLab Duo Planner Agent, GitLab Duo Security Analyst Agent, auto-dismiss irrelevant vulnerabilities, and more. With this release, the GitLab Duo Agent Platform, enabling organizations to orchestrate AI agents, reaches general availability.
-
Microsoft Open Sources XAML Studio, Reviving a Longstanding Prototyping Tool
Microsoft has officially open-sourced XAML Studio, a lightweight rapid prototyping tool for XAML-based UI development, under the .NET Foundation. The tool, originally released through the Microsoft Store as part of the Microsoft Garage initiative, now welcomes community contributions and collaboration via its GitHub repository.
-
Pinterest's Moka: How Kubernetes Is Rewriting the Rules of Big Data Processing
Digital pinboard provider Pinterest has published an article explaining its blueprint for the future of large-scale data processing with its new platform Moka. The company is moving core workloads from ageing Hadoop infrastructure to a Kubernetes-based system on Amazon EKS, with Apache Spark as the main engine and support for other frameworks on the way.
-
Human‑Centred AI for SRE: Multi‑Agent Incident Response without Losing Control
A growing body of recent research and industry commentary suggests that a shift in how organisations approach site reliability engineering is underway. Rather than handing the pager to a machine, teams are designing multi-agent AI systems that work alongside on-call engineers, narrowing the search space and automating the tedious steps while leaving judgment calls to humans.
-
Cloudflare Automates Salt Configuration Management Debugging, Reducing Release Delays
Cloudflare recently shared how it manages its huge global fleet with SaltStack (Salt). They discussed the engineering tasks needed for the "grain of sand" problem. This concern is about finding one configuration error among millions of state applications. Cloudflare’s Site Reliability Engineering (SRE) team redesigned their configuration observability.
-
Pulumi Adds Native Support for Terraform and HCL
Pulumi now natively supports Terraform and HCL, enabling direct HCL execution and state management within Pulumi Cloud. Currently in private beta with a Q1 2026 release, the update aids migration from legacy tools. A new financial "escape hatch" offers credits for existing HashiCorp contracts, targeting teams affected by recent licensing shifts.
-
Cloudflare Launches "Code Orange: Fail Small" Resilience Plan after Multiple Global Outages
Cloudflare recently published a detailed resilience initiative called Code Orange: Fail Small, outlining a comprehensive plan to prevent large-scale service disruptions after two major network outages in the past six weeks.