Cloudflare Blog·May 14, 2026

Optimizing ClickHouse Partitioning for Multi-Tenant Analytics at Petabyte Scale

This article details Cloudflare's journey in optimizing their petabyte-scale ClickHouse analytics platform to support per-tenant data retention. It highlights a critical performance bottleneck discovered after a partitioning key change, caused by lock contention and inefficient part list processing during query planning. The engineering team's systematic approach to diagnosis using flame graphs and their subsequent open-source contributions to ClickHouse offer valuable lessons in debugging and optimizing complex distributed database systems.

Databases & Storage Distributed Systems Performance & Scaling

Read original on Cloudflare Blog

Cloudflare faced a significant challenge with their Ready-Analytics platform, a petabyte-scale ClickHouse deployment serving hundreds of internal teams. The initial design used a single, massive table partitioned by day, leading to a "one-size-fits-all" 31-day data retention policy. This limitation prevented many use cases requiring longer retention periods from utilizing the streamlined Ready-Analytics system.

Architectural Shift: Per-Tenant Retention

To enable per-tenant retention, Cloudflare considered two main approaches: a table-per-namespace model (which would require extensive automation) or modifying the existing table's partitioning key. They chose the latter, changing the partitioning key from `(day)` to `(namespace, day)`. This allowed their existing retention system to manage partitions with per-namespace granularity, albeit with the expected increase in the total number of data parts. A key assumption was that performance would be unaffected for individual queries, as they are always filtered by `namespace`.

The Hidden Bottleneck: Query Planning Lock Contention

Despite the initial assumption, billing jobs began to slow down progressively after the migration. Traditional ClickHouse performance metrics (I/O, memory, rows scanned) showed no issues. The root cause was discovered through detailed investigation using ClickHouse's `trace_log` and flame graphs, which revealed massive lock contention during the query planning phase. Every concurrent query thread was acquiring an exclusive lock on the `MergeTreeData` mutex to copy the entire list of parts before filtering, creating a single-file line bottleneck.

💡

Lesson Learned: Beyond Traditional Metrics

This case highlights that traditional performance metrics (CPU, I/O, memory) might not always pinpoint subtle bottlenecks like lock contention. Advanced profiling tools like flame graphs, especially with 'Real' traces that capture waiting threads, are crucial for diagnosing such issues in complex distributed systems.

Systematic Optimization and Open-Source Contributions

Optimization 1: Shared Lock for Read Operations: The query planner only reads the parts list, so changing the exclusive lock to a `std::shared_lock` allowed concurrent access, significantly reducing contention and query duration.
Optimization 2: Deferred Vector Copy: The next bottleneck was the continuous copying of the massive parts vector. Cloudflare introduced a "shared copy" mechanism, where read-only operations access a cached copy, and the cache is regenerated only when the set of parts changes. Planners now only copy the filtered, relevant parts.
Optimization 3: Binary Search for Part Pruning: Even after the first two optimizations, performance still degraded with increasing part counts. Recognizing that the parts list is sorted by the partitioning key (starting with `namespace`), a binary search was implemented to efficiently prune parts based on the namespace filter, drastically reducing the search space and breaking the correlation between query duration and total part count.

These optimizations not only resolved Cloudflare's immediate crisis but were also contributed back to the ClickHouse open-source project, benefiting the wider community. This demonstrates the value of deep internal system debugging and collaboration in improving core infrastructure components.

ClickHouseAnalytics DatabasePartitioningPerformance OptimizationLock ContentionFlame GraphsMulti-tenancyDistributed Systems

Comments

Loading comments...

Architecture Design

Design this yourself

Design a petabyte-scale, multi-tenant analytics platform using a distributed columnar database like ClickHouse or Apache Druid. Focus on how you would architect the data partitioning strategy to support dynamic, per-tenant data retention policies while maintaining query performance and avoiding bottlenecks related to metadata access or query planning in the presence of billions of data parts and high concurrency. Detail your approach to monitoring and debugging performance issues.

Practice Interview

Other design angles

· Design an analytics platform that supports per-tenant retention using a table-per-tenant model. Discuss the challenges of managing thousands of tables and the automation required.· Design a real-time analytics system for network observability, similar to Cloudflare's use case, ensuring low-latency queries and efficient data ingestion and retention at scale.· Detail the architectural considerations for migrating a large-scale data platform from a single retention policy to a flexible, per-tenant retention model, covering data migration, schema changes, and potential performance pitfalls.