InfoQ Homepage Observability Content on InfoQ

News

RSS Feed

Newer Older

Architecture & Design

Netflix Serves 84% of Query Results from Cache with Interval-Aware Caching in Apache Druid

Netflix improves Apache Druid performance with interval aware caching, serving 84% of analytics results from cache and reducing query load by 33%. The system decomposes rolling window queries into reusable time segments, enabling partial cache reuse and recomputation only for recent data. At scale, it reduces scan volume, improves P90 latency, and optimizes real time analytics workloads.

Leela Kumili
on May 11, 2026
Architecture & Design

How GitHub Is Securing Agentic Workflows in Modern CI CD Systems

GitHub detailed a defense-in-depth security architecture for agentic workflows in CI/CD pipelines, focusing on isolation, constrained execution, and auditability. The design aims to safely integrate autonomous AI agents while mitigating risks like prompt injection, privilege escalation, and unintended actions, using sandboxed environments, restricted permissions, and full execution traceability.

Leela Kumili
on May 08, 2026
Architecture & Design

Cloudflare Processes 10M+ Daily Insights with New Security Overview Dashboard

Cloudflare has launched a Security Overview dashboard that consolidates security signals into prioritized action items. It surfaces millions of daily insights, helping teams identify and remediate critical risks faster. Built on distributed checkers and real-time event processing, it integrates analytics workflows to reduce investigation overhead and improve response efficiency.

Leela Kumili
on May 04, 2026
DevOps

Amazon CloudWatch Introduces OpenTelemetry Metrics Support in Preview

AWS has introduced the public preview of OpenTelemetry metrics support in Amazon CloudWatch. This update allows developers to send metrics directly to CloudWatch using the OpenTelemetry protocol and view them alongside existing AWS service metrics.

Renato Losio
on Apr 29, 2026
DevOps

Grafana Rearchitects Loki with Kafka and Ships a CLI to Bring Observability into Coding Agent

At GrafanaCON 2026 in Barcelona, Grafana Labs announced Grafana 13 with the new Loki Kafka-backed architecture at the ingestion layer and the AI Observability in Grafana Cloud to monitor and evaluate AI systems in real time. In particular, the new CLI called GCX was announced, designed to surface Grafana Cloud data inside agentic development environments.

Claudio Masolo
on Apr 23, 2026
Culture & Methods

How Observability and Telemetry Can Enhance the Practice of Software Engineering

Observability must evolve with serverless, event-driven architectures. OpenTelemetry can decouple telemetry from vendors, letting developers emit consistent, high-quality data that explains real system behavior. Shared vocabularies and good telemetry make debugging faster and improve reliability, speed, and developer productivity.

Ben Linders
on Apr 23, 2026
DevOps

OpenTelemetry Declarative Configuration Reaches Stability Milestone

The OpenTelemetry project has announced that key portions of its declarative configuration specification have reached stable status. The observability framework is a vendor-neutral and language-agnostic way to configure telemetry collection.

Matt Saunders
on Apr 15, 2026
DevOps

Airbnb Migrates High-Volume Metrics Pipeline to OpenTelemetry

Airbnb's observability engineering team has published details of a large-scale migration away from StatsD and a proprietary Veneur-based aggregation pipeline toward a modern, open-source metrics stack built on OpenTelemetry Protocol (OTLP), the OpenTelemetry Collector, and VictoriaMetrics' vmagent. The resulting system now ingests over 100 million samples per second in production.

Claudio Masolo
on Apr 14, 2026
Architecture & Design

Pinterest Reduces Spark OOM Failures by 96% through Auto Memory Retries

Pinterest Engineering cut Apache Spark out-of-memory failures by 96% using improved observability, configuration tuning, and automatic memory retries. Staged rollout, dashboards, and proactive memory adjustments stabilized data pipelines, reduced manual intervention, and lowered operational overhead across tens of thousands of daily jobs.

Leela Kumili
on Apr 06, 2026
DevOps

Kubernetes Autoscaling Demands New Observability Focus beyond Vendor Tooling

As adoption of Kubernetes autoscalers like Karpenter accelerates, a new set of platform-agnostic observability practices is emerging, shifting focus from traditional infrastructure metrics to deeper insights into provisioning behavior, scheduling latency, and cost efficiency.

Craig Risi
on Mar 31, 2026
Development

Discord Engineers Add Distributed Tracing to Elixir's Actor Model without Performance Penalty

Discord engineering detailed how they added distributed tracing to Elixir's actor model. Their custom Transport library wraps messages with trace context and uses dynamic sampling to handle million-user fanouts. CPU optimizations included skipping unsampled traces and filtering context before deserialization, recovering 10+ percentage points of overhead.

Steef-Jan Wiggers
on Mar 28, 2026
DevOps

QCon London 2026: Wrangling Telemetry at Scale, a Guide to Self-Hosted Observability

At QCon London 2026, Colin Douch discussed building and operating self-hosted monitoring stacks, surveyed the current tooling landscape, and explained how to build a coherent observability setup rather than treating logs, metrics, and traces as separate pillars.

Renato Losio
on Mar 19, 2026
DevOps

QCon London 2026: Uncorking Queueing Bottlenecks with OpenTelemetry

At QCon London 2026, Julian Wreford and Oli Lane from Gearset showcased how distributed tracing and SLOs solve asynchronous observability gaps. By shifting from queue-size metrics to latency-based alerts, the team improved incident response. Key technical takeaways included using OpenTelemetry trace state for async duration tracking and wide events to uncover hidden architectural waste.

Mark Silvester
on Mar 18, 2026
DevOps

Elastic Releases Version 9.3.0 with Enhanced AI Tools and OTel Support

Elastic 9.3.0 is now available, featuring enhanced vector search indexing for RAG applications and significant upgrades to the ES|QL query language. The release deepens OpenTelemetry integration for vendor-neutral observability and updates the AI Assistant with better contextual analysis. Security visibility is also expanded across Kubernetes and serverless architectures.

Mark Silvester
on Mar 15, 2026
Architecture & Design

Hybrid Cloud Data at Uber: How Engineers Solved Extreme-Scale Replication Challenges

Uber’s HiveSync team optimized Hadoop Distcp to handle multi-petabyte replication across hybrid cloud and on-premise data lakes. Enhancements include task parallelization, Uber jobs for small transfers, and improved observability, enabling 5x replication capacity and seamless on-premise-to-cloud migration.

Leela Kumili
on Mar 02, 2026

Newer News

Older News

InfoQ Software Architects' Newsletter

News