InfoQ Homepage DevOps Content on InfoQ
-
Holistic Engineering: Organic Problem Solving for Complex Evolving Systems
Late projects. Architectures that drift from their original design. Code that mysteriously evolves into something nobody planned. These persistent problems in software development often stem not from technical failures, but from forces we pretend don't exist—reward systems that incentivize the wrong behaviors, organizational structures that ignore domain boundaries, and human dynamics.
-
When Reverse Proxies Surprise You: Hard Lessons from Operating at Scale
Operating massive reverse proxy fleets reveals hard lessons: optimizations that work on smaller systems fail at scale; mundane oversights like missing commas cause major outages; and abstractions meant to simplify become hidden fragility points. Success requires profiling on target hardware, relentlessly monitoring boring details, keeping hot paths lean, and trusting instrumentation over theory.
-
Building Resilient Platforms: Insights from over Twenty Years in Mission-Critical Infrastructure
Building resilient platforms requires understanding the art and science of creating infrastructure that others depend on for critical applications. This perspective applies to anyone who builds software consumed by others at scale. Whether developing infrastructure platforms, software development platforms, or messaging systems, principles address how to build software that others consume at scale
-
InfoQ Cloud and DevOps Trends Report - 2025
This InfoQ Trends Report offers readers a comprehensive overview of emerging trends and technologies in the areas of Cloud and DevOps. This report summarizes the InfoQ editorial team’s and external guests' view on the current trends in Cloud and DevOps technologies and what to look out for in the next 12 months.
-
Beyond the Padlock: Why Certificate Transparency is Reshaping Internet Trust
Certificate Transparency (CT) creates public, append-only logs of every TLS certificate issued, enabling detection of rogue or mistaken certificates. This article explores how CT has transformed internet PKI by moving from reliance on certificate authority trustworthiness to providing verifiable transparency that major browsers now require.
-
How Causal Reasoning Addresses the Limitations of LLMs in Observability
Large language models excel at converting observability telemetry into clear summaries but struggle with accurate root cause analysis in distributed systems. LLMs often hallucinate explanations and confuse symptoms with causes. This article suggests how causal reasoning models with Bayesian inference offer more reliable incident diagnosis.
-
Ransomware-Resilient Storage: the New Frontline Defense in a High-Stakes Cyber Battle
Cybersecurity has evolved, with ransomware now primarily targeting data storage and backups. To combat this, modern defense strategies focus on making storage systems more resilient. Key tactics include using immutable storage that prevents data from being altered or deleted, employing AI-powered detection, and implementing air-gapping to create isolated, tamper-proof recovery points.
-
Zero-Downtime Critical Cloud Infrastructure Upgrades at Scale
Engineers can avoid common pitfalls in large-scale infrastructure upgrades by studying others' experiences. The article provides lessons learned from big firms like eBay and Snowflake, offering solutions for legacy systems, performance validation, and rollback planning. It emphasizes systematic preparation and clear communication to handle challenges and ensure zero-downtime upgrades at scale.
-
One Network: Cloud-Agnostic Service and Policy-Oriented Network Architecture
Bringing together software infrastructure leads to faster development time and easy control of large, spread-out systems through clear rules. In this QCon SF 2024 presentation, Anna Berenberg shared learnings and achievements when building One Network, addressing complex infrastructure layers, open-source integration, and uniform policy enforcement for improved reliability and security.
-
Sandbox as a Service: Building an Automated AWS Sandbox Framework
This article outlines an automated AWS Sandbox Framework to provide secure, cost-controlled environments for innovation. It leverages AWS services like Control Tower and open-source tools to automate provisioning, enforce security policies, manage resource lifecycles, and optimize costs through automated cleanup and governance.
-
Backend FinOps: Engineering Cost-Efficient Microservices in the Cloud
Backend FinOps integrates financial discipline into microservices, crucial for cutting cloud costs. Challenges such as resource fragmentation and cold starts underscore the need for intelligent design, effective language choice, robust tagging, and automation. Implementing FinOps via IaC, CI/CD checks, and dynamic autoscaling (e.g., Karpenter) ensures sustained efficiency.
-
Ceph RBD Turns 15: a Story of Open Source Creation
Fifteen years ago, Ceph RBD began as a community-driven idea that grew into essential infrastructure powering today's cloud platforms. This insider story from Yehuda Sadeh-Weinraub reveals how two developers started a distributed storage that now supports OpenStack and Kubernetes through transparent, collaborative development.