InfoQ Homepage Observability Content on InfoQ
-
Grab Adds Real-Time Data Quality Monitoring to Its Platform
Grab updated its internal platform to monitor Apache Kafka data quality in real time. The system uses FlinkSQL and an LLM to detect syntactic and semantic errors. It currently tracks 100+ topics, preventing invalid data from reaching downstream users. This proactive strategy aligns with industry trends to treat data streams as reliable products.
-
AWS Distributed Tracing Service X-Ray Transitions to OpenTelemetry
AWS recently announced that AWS X-Ray is transitioning to OpenTelemetry as its primary instrumentation standard for application tracing, with the AWS X-Ray SDKs and Daemon moving to maintenance mode.
-
Groundcover Takes Aim at Datadog with Observability Migration Tool
Observability platform company Groundcover has launched a new migration tool to help organisations move their observability stacks from other vendors (such as Datadog) to its own platform. The company is claiming that organisations can migrate metrics, dashboards and monitors with full automation, and without needing any downtime nor consultants.
-
Grafana Unveils Smarter Logs, an MCP Server, and TraceQL Upgrades in Latest Releases
Grafana Labs has published major updates across two of its core observability products: Grafana 12.3, and Grafana Tempo 2.9. The two releases have distinct improvements in monitoring, logs, and tracing for Grafana users.
-
Grafana Labs Releases Mimir 3.0 with Redesigned Architecture for Enhanced Performances
Grafana Labs has released Grafana Mimir 3.0. This is a significant advancement for the open-source, horizontally scalable time series database. The release features a new design that separates read and write operations. This change greatly boosts performance, reliability, and cost efficiency for organizations handling metrics at scale.
-
Inside Uber’s Query Architecture: Simplifying Layers and Improving Observability
Uber rebuilt its Apache Pinot query architecture, replacing the Presto-based Neutrino system with a lightweight proxy called Cellar and Pinot’s Multi-Stage Engine Lite Mode. The redesign simplifies SQL execution, improves resource management, and ensures predictable performance for large-scale analytics workloads.
-
QCon London 2026 Announces Tracks: AI Engineering, Building Teams, Tech of Finance, and More
The QCon London 2026 tracks are live: 15 practitioner-curated deep dives on AI adoption, resilient architectures, distributed systems, performance, modern languages, data, security, and Staff+ leadership, rooted in real production lessons.
-
Flipkart Scales Prometheus to 80 Million Metrics Using Hierarchical Federation
Flipkart engineers recently published a detailed case study describing how they overcame severe scalability limits in monitoring by adopting a hierarchical federation design in Prometheus.
-
Vercel Introduces Drains for Unified Data Export
Vercel has released Vercel Drains, a system for exporting observability data from its platform into external services. The feature unifies logs, distributed traces, web analytics events, and performance metrics into a single streaming mechanism.
-
Google Cloud Observability Adopts OpenTelemetry Protocol for Native Trace Ingestion
Google Cloud has announced native support for the OpenTelemetry Protocol (OTLP) in its Cloud Trace service, marking a significant step toward vendor-neutral observability infrastructure. The new capability allows developers to send trace data directly using OTLP through the telemetry.googleapis.com endpoint, eliminating the need for vendor-specific exporters and custom data transformations.
-
Datadog Launches Monocle, a Unified Rust-Powered Real-Time Metrics Engine
Datadog has launched Monocle, a new real-time time series storage engine written in Rust. The system unifies the company’s metrics storage infrastructure, delivering higher ingestion throughput and lower query latency while reducing operational complexity. Monocle replaces several generations of storage backends, addressing concurrency challenges and scaling limits that accumulated over time.
-
PagerDuty's Kafka Outage Silences Alerts for Thousands of Companies
PagerDuty, the incident management platform used by thousands of organisations to alert them to problems on their systems, suffered a major outage itself on 28th August, 2025. In a comprehensive outage report, the company detailed the scope of the problem, the customer impact, and how it is working to prevent a recurrence.
-
Azure Service Groups Enter Public Preview Offering New Abstraction Layer for Resource Management
Microsoft has launched Azure Service Groups in public preview, a new feature designed to simplify resource management and administration. Acting as a flexible, tenant-level container, Service Groups allow users to organize Azure resources from anywhere within their tenant without affecting RBAC or policy inheritance.
-
Honeycomb Hosted MCP Brings Observability Data into the IDE
Honeycomb has launched its hosted Model Context Protocol (MCP), giving developers real-time access to observability data inside IDEs and AI tools like GitHub Copilot. Available as a managed service on AWS Marketplace, it removes the need for self-hosting and streamlines debugging by surfacing traces, metrics, and logs without context-switching.
-
Grafana 12.1 Brings Built-in Diagnostics and Enhanced Alerting
Grafana 12.1 is here, elevating system reliability and alert management with features like Grafana Advisor for health checks, a revamped alerting interface, and trendline transformations for smarter data visualization. Enhanced dashboard interactivity and improved variable handling empower teams to scale efficiently. Experience the new era of Grafana on Cloud or self-hosted!