InfoQ Homepage Performance Content on InfoQ
-
Change as Metrics: Measuring System Reliability through Change Delivery Signals
System changes are the primary driver of production incidents, making change-related metrics essential reliability signals. A minimal metric set of Change Lead Time, Change Success Rate, and Incident Leakage Rate assesses delivery efficiency and reliability, supported by actionable technical metrics and an event-centric data warehouse for unified change observability.
-
Read-Copy-Update (RCU): the Secret to Lock-Free Performance
Innovative software engineer with expertise in optimizing concurrency through advanced techniques like Read-Copy-Update (RCU). Proven track record of boosting read performance by over 110% in read-heavy workloads. Skilled in leveraging RCU principles across production systems, enhancing architecture efficiency, and streamlining data handling to maximize scalability and minimize overhead.
-
Proactive Autoscaling for Edge Applications in Kubernetes
Kubernetes often reacts too late when traffic suddenly increases at the edge. A proactive scaling approach that considers response time, spare CPU capacity, and container startup delays can add or remove instances more smoothly, prevent sudden spikes, and keep performance stable on systems with limited resources.
-
From Alert Fatigue to Agent-Assisted Intelligent Observability
As systems grow, observability becomes harder to maintain and incidents harder to diagnose. Agentic observability layers AI on existing tools, starting in read-only mode to detect anomalies and summarize issues. Over time, agents add context, correlate signals, and automate low-risk tasks. This approach frees engineers to focus on analysis and judgment.
-
Engineering Speed at Scale — Architectural Lessons from Sub-100-ms APIs
Sub‑100-ms APIs emerge from disciplined architecture using latency budgets, minimized hops, async fan‑out, layered caching, circuit breakers, and strong observability. But long‑term speed depends on culture, with teams owning p99, monitoring drift, managing thread pools, and treating performance as a shared, continuous responsibility.
-
One Cache to Rule Them All: Handling Responses and In-Flight Requests with Durable Objects
Traditional caching fails to stop "thundering herds" where multiple clients trigger the same work during a miss. This article proposes using Cloudflare Durable Objects to treat in-flight work and finished results as two states of one cache entry. By routing to a single owner, systems eliminate redundant tasks. This pattern replaces complex locks with simple promises, simplifying the system design.
-
Reducing False Positives in Retrieval-Augmented Generation (RAG) Semantic Caching: a Banking Case Study
In this article, author Elakkiya Daivam discusses why Retrieval Augmented Generation (RAG) and semantic caching techniques are powerful levers for reducing false positives in AI powered applications. She shares the insights from a production-grade evaluation with 1,000 query variations tested across seven bi-encoder models.
-
When Reverse Proxies Surprise You: Hard Lessons from Operating at Scale
Operating massive reverse proxy fleets reveals hard lessons: optimizations that work on smaller systems fail at scale; mundane oversights like missing commas cause major outages; and abstractions meant to simplify become hidden fragility points. Success requires profiling on target hardware, relentlessly monitoring boring details, keeping hot paths lean, and trusting instrumentation over theory.
-
How Causal Reasoning Addresses the Limitations of LLMs in Observability
Large language models excel at converting observability telemetry into clear summaries but struggle with accurate root cause analysis in distributed systems. LLMs often hallucinate explanations and confuse symptoms with causes. This article suggests how causal reasoning models with Bayesian inference offer more reliable incident diagnosis.
-
Zero-Downtime Critical Cloud Infrastructure Upgrades at Scale
Engineers can avoid common pitfalls in large-scale infrastructure upgrades by studying others' experiences. The article provides lessons learned from big firms like eBay and Snowflake, offering solutions for legacy systems, performance validation, and rollback planning. It emphasizes systematic preparation and clear communication to handle challenges and ensure zero-downtime upgrades at scale.
-
Backend FinOps: Engineering Cost-Efficient Microservices in the Cloud
Backend FinOps integrates financial discipline into microservices, crucial for cutting cloud costs. Challenges such as resource fragmentation and cold starts underscore the need for intelligent design, effective language choice, robust tagging, and automation. Implementing FinOps via IaC, CI/CD checks, and dynamic autoscaling (e.g., Karpenter) ensures sustained efficiency.
-
Why Is My Docker Image So Big? A Deep Dive with ‘dive’ to Find the Bloat
AI images typically bloat from massive library installations and base OS components, with large Docker images slowing AI development and increasing costs. Chirag Agrawal demonstrates how to diagnose bloat using Docker's history and the interactive 'dive' tool to examine each layer in detail. The article shows how effective diagnosis leads to targeted optimizations.