Cloudflare Identifies Query Planning Bottleneck in ClickHouse

Cloudflare recently described how a slowdown in its billing pipeline was traced to contention inside the query planning stage of ClickHouse. The team profiled the bottleneck and patched ClickHouse to replace an exclusive lock with a shared lock, drop the per-query copy of the parts list, and improve part filtering.

Cloudflare's pipeline supports billing and fraud systems, so delays can affect multiple downstream services. After a migration that significantly increased the number of data parts, the daily aggregation jobs in ClickHouse became much slower, even though standard performance metrics such as I/O, memory usage, and scanned rows appeared normal.

Average SELECT query durations. Source: Cloudflare blog

Cloudflare has used ClickHouse since before it supported built-in data expiration, and handles hundreds of petabytes of data. As a result, it created its own retention system by splitting data in the Ready-Analytics table into daily partitions and deleting anything older than 31 days. James Morrison and Christian Endres, senior distributed systems engineers at Cloudflare, explain:

This system is popular, with hundreds of applications using it. It had already grown to more than 2PiB of data by December 2024, and an ingestion rate of millions of rows per second. But it had one critical flaw: its retention policy.

During the migration, Cloudflare changed its ClickHouse partitioning scheme to include customer namespaces, allowing data retention to be managed separately per tenant. ClickHouse is an open-source analytical database designed for fast analysis of large data volumes, commonly used for logs, metrics, analytics, and real-time reporting. While the number of parts accessed per query was expected to remain the same with the new sharding approach, Morrison and Endres summarize the issue and its root cause:

A huge amount of time was being spent in query planning. This is the phase before execution when ClickHouse decides which parts to read (...) 45% of the sampled CPU time was being spent in a single function called filterPartsByPartition (...) The problem wasn't CPU-bound work; it was massive lock contention. More than half of our query duration was spent waiting to acquire a single mutex (MergeTreeData) that protects the table's list of parts.

The team applied three changes to reduce the slowdown: it patched ClickHouse, replacing the exclusive lock with a shared lock and removing the copy of the full list of data parts for every query. Furthermore, they improved part filtering to avoid scanning the entire list each time. Together, the changes significantly reduced query latency and stabilized performance as part counts continued to grow. Morrison and Endres conclude:

After deploying this patch in March 2026, query durations dropped by 50%. More importantly, this finally breaks correlation of query durations with the number of parts.

Source: Cloudflare blog

While most practitioners on Reddit focus on the recent layoffs and the "single massive table" design, Edydh Marquez Avila, field engineer at Park Place Technologies, comments on LinkedIn:

Cloudflare’s ClickHouse investigation is a good reminder that modern infrastructure failures increasingly happen in coordination layers, not obvious resource limits (...) The interesting signal is broader than ClickHouse itself (...) High-level telemetry alone is no longer enough for diagnosing large-scale systems under concurrency. Low-level execution visibility still matters.

While the fixes stabilized query performance and resolved the immediate billing slowdown, the underlying partitioning design may still create operational problems as the number of data parts continues to grow. According to the authors, the increasing metadata load has also affected ZooKeeper, which manages ClickHouse cluster coordination, and discussing an "uneasy truce", the team questions whether the current architecture will remain sustainable long-term.

Cloudflare contributed the changes to the ClickHouse project, where they were merged upstream and became available starting with version 25.11.

About the Author

Renato Losio

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Renato Losio

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter