InfoQ Homepage Observability Content on InfoQ
-
Kubernetes Autoscaling Demands New Observability Focus beyond Vendor Tooling
As adoption of Kubernetes autoscalers like Karpenter accelerates, a new set of platform-agnostic observability practices is emerging, shifting focus from traditional infrastructure metrics to deeper insights into provisioning behavior, scheduling latency, and cost efficiency.
-
Discord Engineers Add Distributed Tracing to Elixir's Actor Model without Performance Penalty
Discord engineering detailed how they added distributed tracing to Elixir's actor model. Their custom Transport library wraps messages with trace context and uses dynamic sampling to handle million-user fanouts. CPU optimizations included skipping unsampled traces and filtering context before deserialization, recovering 10+ percentage points of overhead.
-
QCon London 2026: Wrangling Telemetry at Scale, a Guide to Self-Hosted Observability
At QCon London 2026, Colin Douch discussed building and operating self-hosted monitoring stacks, surveyed the current tooling landscape, and explained how to build a coherent observability setup rather than treating logs, metrics, and traces as separate pillars.
-
QCon London 2026: Uncorking Queueing Bottlenecks with OpenTelemetry
At QCon London 2026, Julian Wreford and Oli Lane from Gearset showcased how distributed tracing and SLOs solve asynchronous observability gaps. By shifting from queue-size metrics to latency-based alerts, the team improved incident response. Key technical takeaways included using OpenTelemetry trace state for async duration tracking and wide events to uncover hidden architectural waste.
-
Elastic Releases Version 9.3.0 with Enhanced AI Tools and OTel Support
Elastic 9.3.0 is now available, featuring enhanced vector search indexing for RAG applications and significant upgrades to the ES|QL query language. The release deepens OpenTelemetry integration for vendor-neutral observability and updates the AI Assistant with better contextual analysis. Security visibility is also expanded across Kubernetes and serverless architectures.
-
Hybrid Cloud Data at Uber: How Engineers Solved Extreme-Scale Replication Challenges
Uber’s HiveSync team optimized Hadoop Distcp to handle multi-petabyte replication across hybrid cloud and on-premise data lakes. Enhancements include task parallelization, Uber jobs for small transfers, and improved observability, enabling 5x replication capacity and seamless on-premise-to-cloud migration.
-
OpenAI Introduces Harness Engineering: Codex Agents Power Large‑Scale Software Development
OpenAI introduces Harness Engineering, an AI-driven methodology where Codex agents generate, test, and deploy a million-line production system. The platform integrates observability, architectural constraints, and structured documentation to automate key software development workflows.
-
OpenTelemetry Project Publishes “Demystifying OpenTelemetry” Guide to Broaden Observability Adoption
The OpenTelemetry open-source observability project recently published a comprehensive guide titled "Demystifying OpenTelemetry" aimed at helping organizations understand, adopt, and scale observability using the OpenTelemetry standard.
-
LinkedIn Leverages GitHub Actions, CodeQL, and Semgrep for Code Scanning
LinkedIn has rebuilt its static application security testing (SAST) pipeline using GitHub Actions and custom workflows, enabling consistent, enforceable code scanning across thousands of repositories. The redesign improves security coverage, developer workflow, and observability while supporting the company’s shift-left strategy.
-
Datadog Integrates Google Agent Development Kit into LLM Observability Tools
Datadog recently announced that its LLM Observability platform now provides automatic instrumentation for applications built with Google's Agent Development Kit (ADK), offering deeper visibility into the behavior, performance, cost, and safety of AI-driven agentic systems.
-
Airbnb Expands Global Checkout with “Pay as a Local,” Scaling to 220 Markets in 14 Months
Airbnb expands its global checkout with the “Pay as a Local” initiative, supporting over 20 locally preferred payment methods across 220 markets. The company replatformed its payments system with domain-oriented services, reusable flow archetypes, and a centralized configuration, enhancing integration speed, reliability, testing, and observability for diverse payment methods worldwide.
-
Uber Gets Ready for AI in Network Observability with Cloud Native Overhaul
Transportation company Uber has publishing a detailed account of its new observability platform on its blog, highlighting that for them, network visibility is now a strategic capability rather than a set of discrete monitoring tools.
-
Railway Highlights the Importance of Logs, Metrics, Traces, and Alerts for Diagnosing System Failure
Railway’s engineering team published a comprehensive guide to observability, explaining how developers and SRE teams can use logs, metrics, traces, and alerts together to understand and diagnose production system failures.
-
HL is a Fast, Rust-Based JSON Log Viewer Offering up to 2GiB/s Parsing Speed
Open-source log viewer hl is designed for efficient processing of structured logs in JSON or logfmt format. Built in Rust, it provides fast indexing and parsing, enabling to scan very large log files quickly, whether they are uncompressed or compressed.
-
Grab Adds Real-Time Data Quality Monitoring to Its Platform
Grab updated its internal platform to monitor Apache Kafka data quality in real time. The system uses FlinkSQL and an LLM to detect syntactic and semantic errors. It currently tracks 100+ topics, preventing invalid data from reaching downstream users. This proactive strategy aligns with industry trends to treat data streams as reliable products.