InfoQ Homepage Monitoring Content on InfoQ

Articles

RSS Feed

Newer Older

DevOps

Change as Metrics: Measuring System Reliability through Change Delivery Signals

System changes are the primary driver of production incidents, making change-related metrics essential reliability signals. A minimal metric set of Change Lead Time, Change Success Rate, and Incident Leakage Rate assesses delivery efficiency and reliability, supported by actionable technical metrics and an event-centric data warehouse for unified change observability.

Peihao Yuan
on Mar 09, 2026
DevOps

Proactive Autoscaling for Edge Applications in Kubernetes

Kubernetes often reacts too late when traffic suddenly increases at the edge. A proactive scaling approach that considers response time, spare CPU capacity, and container startup delays can add or remove instances more smoothly, prevent sudden spikes, and keep performance stable on systems with limited resources.

Rajeev Kallayil Ravi
on Feb 17, 2026
DevOps

From Alert Fatigue to Agent-Assisted Intelligent Observability

As systems grow, observability becomes harder to maintain and incidents harder to diagnose. Agentic observability layers AI on existing tools, starting in read-only mode to detect anomalies and summarize issues. Over time, agents add context, correlate signals, and automate low-risk tasks. This approach frees engineers to focus on analysis and judgment.

Rohit Dhawan
on Feb 04, 2026
DevOps

How Causal Reasoning Addresses the Limitations of LLMs in Observability

Large language models excel at converting observability telemetry into clear summaries but struggle with accurate root cause analysis in distributed systems. LLMs often hallucinate explanations and confuse symptoms with causes. This article suggests how causal reasoning models with Bayesian inference offer more reliable incident diagnosis.

Dhairya Dalal
on Sep 02, 2025
Cloud

Backend FinOps: Engineering Cost-Efficient Microservices in the Cloud

Backend FinOps integrates financial discipline into microservices, crucial for cutting cloud costs. Challenges such as resource fragmentation and cold starts underscore the need for intelligent design, effective language choice, robust tagging, and automation. Implementing FinOps via IaC, CI/CD checks, and dynamic autoscaling (e.g., Karpenter) ensures sustained efficiency.

Vivek Arora
on Aug 06, 2025
Cloud

Engineering Principles for Building a Successful Cloud-Prem Solution

Discover how Cloud-Prem solutions combine cloud efficiency with on-premise control, meeting data sovereignty and compliance demands while optimizing operational costs and enhancing customer security.

Satyam Dhar
on Jun 26, 2025
DevOps

Analyzing Apache Kafka Stretch Clusters: WAN Disruptions, Failure Scenarios, and DR Strategies

Proficient in analyzing the dynamics of Apache Kafka Stretch Clusters, I assess WAN disruptions and devise effective Disaster Recovery (DR) strategies. With deep expertise, I ensure high availability and data integrity across multi-region deployments. My insights optimize operational resilience, safeguarding vital services against service level agreement violations.

Srikanth Daggumalli Nishchai Jayanna Manjula
on Jun 20, 2025
Cloud

We Took Developers out of the Portal: How APIOps and IaC Reshaped Our API Strategy

Dynamic API strategist with expertise in transforming legacy management into efficient APIOps frameworks using Infrastructure as Code (IaC). Proven track record in automating API lifecycles, enhancing security, and fostering developer productivity through CI/CD integration. Adept at driving operational excellence and consistency across environments, enabling rapid deployment and innovation.

Balakrishna Sudabathula
on Jun 12, 2025
Culture & Methods

InfoQ Culture and Methods Trends Report - 2025

This report summarizes how the InfoQ Culture and Methods editorial team sees the ongoing and emergent trends in the culture and methods space.

Shane Hastie Charity Majors Ben Linders Rafiq Gemmail Craig Smith
on May 09, 2025
Architecture & Design

Applying Flow Metrics to Design Resilient Microservices

Software design with resilience is an acknowledgement to the reality that everything fails. We put metrics in place to help us detect and resolve such problems and failures. Flow metrics, commonly used to measure how well teams deliver software, can be used to measure and improve system resilience.

Mourjo Sen
on Mar 26, 2025
AI, ML & Data Engineering

Beyond Notebook: Building Observable Machine Learning Systems

In this article, the author discusses a machine learning pipeline with observability built-in for credit card fraud detection use case, with tools like MLflow, FastAPI, Streamlit, Apache Kafka, Prometheus, Grafana, and Evidently AI.

Lakshmithejaswi Narasannagari
on Mar 14, 2025
AI, ML & Data Engineering

Secure AI-Powered Early Detection System for Medical Data Analysis & Diagnosis

In this article, the author discusses the techniques for securing AI applications in healthcare with an use case of early detection system for medical data analysis & diagnosis. The proposed layered architecture includes application components to support secure computation, ai modeling, governance and compliance, and monitoring and auditing.

Mahesh Vaijainthymala Krishnamoorthy
on Mar 03, 2025

Newer Articles

Older Articles

InfoQ Software Architects' Newsletter

Articles