This article details Cloudflare's journey in optimizing their petabyte-scale ClickHouse analytics platform to support per-tenant data retention. It highlights a critical performance bottleneck discovered after a partitioning key change, caused by lock contention and inefficient part list processing during query planning. The engineering team's systematic approach to diagnosis using flame graphs and their subsequent open-source contributions to ClickHouse offer valuable lessons in debugging and optimizing complex distributed database systems.
Read original on Cloudflare BlogCloudflare faced a significant challenge with their Ready-Analytics platform, a petabyte-scale ClickHouse deployment serving hundreds of internal teams. The initial design used a single, massive table partitioned by day, leading to a "one-size-fits-all" 31-day data retention policy. This limitation prevented many use cases requiring longer retention periods from utilizing the streamlined Ready-Analytics system.
To enable per-tenant retention, Cloudflare considered two main approaches: a table-per-namespace model (which would require extensive automation) or modifying the existing table's partitioning key. They chose the latter, changing the partitioning key from `(day)` to `(namespace, day)`. This allowed their existing retention system to manage partitions with per-namespace granularity, albeit with the expected increase in the total number of data parts. A key assumption was that performance would be unaffected for individual queries, as they are always filtered by `namespace`.
Despite the initial assumption, billing jobs began to slow down progressively after the migration. Traditional ClickHouse performance metrics (I/O, memory, rows scanned) showed no issues. The root cause was discovered through detailed investigation using ClickHouse's `trace_log` and flame graphs, which revealed massive lock contention during the query planning phase. Every concurrent query thread was acquiring an exclusive lock on the `MergeTreeData` mutex to copy the entire list of parts before filtering, creating a single-file line bottleneck.
Lesson Learned: Beyond Traditional Metrics
This case highlights that traditional performance metrics (CPU, I/O, memory) might not always pinpoint subtle bottlenecks like lock contention. Advanced profiling tools like flame graphs, especially with 'Real' traces that capture waiting threads, are crucial for diagnosing such issues in complex distributed systems.
These optimizations not only resolved Cloudflare's immediate crisis but were also contributed back to the ClickHouse open-source project, benefiting the wider community. This demonstrates the value of deep internal system debugging and collaboration in improving core infrastructure components.