InfoQ Homepage Observability Content on InfoQ
-
Cloudflare Introduces Workflows V2 with Deterministic Execution and 50K Concurrent Workflows
Cloudflare introduces Workflows V2, a redesigned distributed workflow orchestration system with deterministic replayable execution, improved observability, and major scaling upgrades, including 50,000 concurrent instances and 2M queued workflows. It supports AI agents, data pipelines, and background processing with improved reliability across distributed systems.
-
Pinterest Engineers Eliminate CPU Zombies to Resolve Production Bottlenecks
Pinterest identified and resolved CPU starvation issues that affected machine learning training jobs on its Kubernetes-based platform, PinCompute. The engineers traced the problem to an unused Amazon ECS agent, which caused memory cgroup leaks. By disabling the agent, they stabilised performance. This case illustrates the importance of understanding system defaults for effective troubleshooting.
-
Grafana's Pyroscope 2.0 Makes Continuous Profiling Practical at Scale
Grafana Labs has launched Pyroscope 2.0, a rearchitected open-source continuous profiling database. This version improves storage costs, query performance, and operational complexity. Key changes include single write paths for profiles, stateless query processing, and enhanced capabilities for profiling data. It supports the OpenTelemetry Protocol, aligning with current trends in observability.
-
Netflix Serves 84% of Query Results from Cache with Interval-Aware Caching in Apache Druid
Netflix improves Apache Druid performance with interval aware caching, serving 84% of analytics results from cache and reducing query load by 33%. The system decomposes rolling window queries into reusable time segments, enabling partial cache reuse and recomputation only for recent data. At scale, it reduces scan volume, improves P90 latency, and optimizes real time analytics workloads.
-
How GitHub Is Securing Agentic Workflows in Modern CI CD Systems
GitHub detailed a defense-in-depth security architecture for agentic workflows in CI/CD pipelines, focusing on isolation, constrained execution, and auditability. The design aims to safely integrate autonomous AI agents while mitigating risks like prompt injection, privilege escalation, and unintended actions, using sandboxed environments, restricted permissions, and full execution traceability.
-
Cloudflare Processes 10M+ Daily Insights with New Security Overview Dashboard
Cloudflare has launched a Security Overview dashboard that consolidates security signals into prioritized action items. It surfaces millions of daily insights, helping teams identify and remediate critical risks faster. Built on distributed checkers and real-time event processing, it integrates analytics workflows to reduce investigation overhead and improve response efficiency.
-
Amazon CloudWatch Introduces OpenTelemetry Metrics Support in Preview
AWS has introduced the public preview of OpenTelemetry metrics support in Amazon CloudWatch. This update allows developers to send metrics directly to CloudWatch using the OpenTelemetry protocol and view them alongside existing AWS service metrics.
-
Grafana Rearchitects Loki with Kafka and Ships a CLI to Bring Observability into Coding Agent
At GrafanaCON 2026 in Barcelona, Grafana Labs announced Grafana 13 with the new Loki Kafka-backed architecture at the ingestion layer and the AI Observability in Grafana Cloud to monitor and evaluate AI systems in real time. In particular, the new CLI called GCX was announced, designed to surface Grafana Cloud data inside agentic development environments.
-
How Observability and Telemetry Can Enhance the Practice of Software Engineering
Observability must evolve with serverless, event-driven architectures. OpenTelemetry can decouple telemetry from vendors, letting developers emit consistent, high-quality data that explains real system behavior. Shared vocabularies and good telemetry make debugging faster and improve reliability, speed, and developer productivity.
-
OpenTelemetry Declarative Configuration Reaches Stability Milestone
The OpenTelemetry project has announced that key portions of its declarative configuration specification have reached stable status. The observability framework is a vendor-neutral and language-agnostic way to configure telemetry collection.
-
Airbnb Migrates High-Volume Metrics Pipeline to OpenTelemetry
Airbnb's observability engineering team has published details of a large-scale migration away from StatsD and a proprietary Veneur-based aggregation pipeline toward a modern, open-source metrics stack built on OpenTelemetry Protocol (OTLP), the OpenTelemetry Collector, and VictoriaMetrics' vmagent. The resulting system now ingests over 100 million samples per second in production.
-
Pinterest Reduces Spark OOM Failures by 96% through Auto Memory Retries
Pinterest Engineering cut Apache Spark out-of-memory failures by 96% using improved observability, configuration tuning, and automatic memory retries. Staged rollout, dashboards, and proactive memory adjustments stabilized data pipelines, reduced manual intervention, and lowered operational overhead across tens of thousands of daily jobs.
-
Kubernetes Autoscaling Demands New Observability Focus beyond Vendor Tooling
As adoption of Kubernetes autoscalers like Karpenter accelerates, a new set of platform-agnostic observability practices is emerging, shifting focus from traditional infrastructure metrics to deeper insights into provisioning behavior, scheduling latency, and cost efficiency.
-
Discord Engineers Add Distributed Tracing to Elixir's Actor Model without Performance Penalty
Discord engineering detailed how they added distributed tracing to Elixir's actor model. Their custom Transport library wraps messages with trace context and uses dynamic sampling to handle million-user fanouts. CPU optimizations included skipping unsampled traces and filtering context before deserialization, recovering 10+ percentage points of overhead.
-
QCon London 2026: Wrangling Telemetry at Scale, a Guide to Self-Hosted Observability
At QCon London 2026, Colin Douch discussed building and operating self-hosted monitoring stacks, surveyed the current tooling landscape, and explained how to build a coherent observability setup rather than treating logs, metrics, and traces as separate pillars.