Internet infrastructure and security company Cloudflare has documented how it significantly upgraded its logging pipeline by moving from syslog-ng to OpenTelemetry Collector.
The logging pipeline is one of Cloudflare's biggest data pipelines and is critical infrastructure, as it collects and processes millions of log events per second from every server in its network. This pipeline previously relied on syslog-ng, a widely used open-source logging solution, and moving to OpenTelemetry Collector was a significant shift in how Cloudflare handles its vast quantities of log data.
According to engineers Colin Douch and Jayson Cena, who detailed the migration in a company blog post, there were several motivations for this change:
- Language compatibility: OpenTelemetry Collector is written in Go, which is more familiar to Cloudflare's engineering team than the C language used by syslog-ng. This change allows more of Cloudflare's engineers to contribute to and improve the logging system.
- Easier integration with internal libraries: Building syslog-ng with Cloudflare's internal Post-Quantum cryptography libraries was a challenge. The Go-based OpenTelemetry Collector simplifies this process.
- Enhanced metrics: OpenTelemetry Collector supports Prometheus metrics, allowing the team to gather more detailed telemetry data about the collectors' performance.
- Unified telemetry infrastructure: Cloudflare already uses OpenTelemetry Collectors for some of its tracing infrastructure. Consolidating different types of telemetry into a single system reduces the engineering team's complexity.
As part of the migration, the engineers developed several custom components to maintain compatibility with existing systems and meet specific needs. These components included a custom exporter for Cloudflare's own log format, a modified file exporter for additional output formats, processors to incorporate external source JSON data into log entries, and rate-limiters for individual services to prevent them from overwhelming the logging pipeline.
Cloudflare used two strategies to roll out these changes. The core data centres needed a careful hands-on approach because of their custom configurations and diverse workloads, while in their edge data centres, with more straightforward homogeneous configurations, the team could use a gradual rollout with careful monitoring.
The migration process revealed several challenges, such as a failover issue where the new exporter initially failed to detect connectivity problems with the primary logging server, causing a backlog of logs and affecting some services during chaos testing. The transition between stopping syslog-ng and starting the OpenTelemetry Collector also caused brief interruptions in log collection, affecting some services that write logs in blocking mode.
Cloudflare addressed these issues by implementing tighter timeouts in their custom exporter, modifying the failover behaviour, and adjusting their deployment process to minimise downtime during the transition. Future related plans include implementing more sophisticated log sampling techniques, including tail sampling, and contributing some of their custom components back to the open-source community.
Cloudflare is far from alone in moving to OpenTelemetry, with other big companies such as Shopify, Splunk, Google and GitHub also adopting the technology. In a Google Cloud webinar, some of these organisations detailed their OpenTelemetry usage.
- Google is using OpenTelemetry in several products, such as using the collector in Google Kubernetes Engine and Google Compute Engine, and replacing OpenCensus SDKs in Cloud Monitoring and Cloud Trace.
- Splunk is adopting OpenTelemetry internally and contributing extensively to the project by using the collector for infrastructure monitoring, moving to OpenTelemetry client libraries, and contributing to collector and auto-instrumentation development.
- Shopify is migrating its trace collection infrastructure to OpenTelemetry, and implementing PII redaction, sampling, and span renaming in the collector.
Similarly, GitHub is adopting OpenTelemetry to standardise its telemetry practices. In a blog post, GitHub engineers detailed how the company used tools such as statsd for metrics, syslog for text logs, and OpenTracing for request traces, which led to interoperability challenges and repeated solutions for each new system.
GitHub is implementing OpenTelemetry in several key ways, by using OTLP (OpenTelemetry Protocol) as a standard, vendor-neutral format for telemetry signals. It also leverages auto-instrumentation for Ruby and Postgres to add distributed tracing automatically. The open standards also allow GitHub to create automatic correlation between different signals, using OpenTelemetry tracing as the root.
GitHub believes this approach will allow them to automatically derive additional signals once tracing is in place, such as automatically calculating metrics and converting tracing events to detailed logs. They're also contributing to the OpenTelemetry project to benefit the wider community.