This article details how Cloudflare scaled its Security Insights system by over 10x, addressing critical performance bottlenecks in Kafka processing, database writes, API latency, and scheduler efficiency. It serves as a practical case study for optimizing distributed systems under heavy load, highlighting common challenges and their solutions in real-world scenarios.
Read original on Cloudflare BlogCloudflare's Security Insights platform provides automated security recommendations by scanning customer accounts. Facing issues with infrequent scans and an inability to scan all free accounts, the team needed to increase scanning throughput by 10x, from 10 to 100 scans per second. The existing system was struggling with high Kafka backlog, API timeouts, and process crashes, necessitating a comprehensive architectural review and optimization effort.
The original system utilized Apache Kafka as a message bus. A scheduler triggered scans by publishing messages to Kafka, which were then consumed by specialized Go microservices called 'checkers'. These checkers performed specific scans and sent results to an internal API, which persisted the insights into a PostgreSQL database.
Kafka's partitioned event stream processes messages in order within a partition, limiting consumers to one active per partition. To overcome this without increasing Kafka partitions (due to shared broker resources), Cloudflare introduced parallel processing within checkers by consuming messages in batches and processing each in a separate goroutine. To address head-of-line blocking caused by slow-processing messages, they implemented a 'slow lane' and 'fast lane' consumer group split, allowing quick identification and skipping of slow messages by fast lane checkers.
The initial database write strategy involved individual `INSERT ... ON CONFLICT DO UPDATE` statements for each insight, leading to half a million round trips for large batches. The solution adopted a hybrid approach: using PostgreSQL's `UNNEST` for smaller sets of insights (faster for small batches) and the `COPY` command into a temporary table for larger sets (efficient for bulk inserts), balancing performance and avoiding system table bloat.
Significant API timeouts and performance degradation were traced to an active-active API deployment with instances in Portland (primary database location) and Amsterdam. The long round-trip latency between Amsterdam API instances and the Portland database led to slow queries, connection pool exhaustion, and uneven Kafka partition processing. The fix was a simple but effective architectural change: switching the API to an active-passive configuration, ensuring the active API instance co-located with the primary database.
The original scheduler caused spiky loads due to accounts having similar `last_scheduled_at` times and large accounts triggering cascades of zone scans. Solutions included scheduling zones independently of accounts, randomizing `last_scheduled_at` for existing entries to smooth distribution, and implementing adaptive rate limiting. The adaptive rate limiter dynamically recalculates the scan rate based on total accounts/zones and desired frequency, preventing spikes when increasing scan frequencies or onboarding new customers, and ensuring uniform load distribution over time.
Key Takeaways for System Design
This case study demonstrates the importance of holistic system optimization. Bottlenecks often appear at different layers (messaging, database, API, scheduling) and require targeted solutions. Co-locating services with their primary data stores, understanding message queue semantics, and designing adaptive scheduling mechanisms are crucial for scalable distributed systems.