Menu
Cloudflare Blog·June 12, 2026

Scaling Cloudflare's Security Insights: A Case Study in Distributed System Optimization

This article details how Cloudflare scaled its Security Insights system by over 10x, addressing critical performance bottlenecks in Kafka processing, database writes, API latency, and scheduler efficiency. It serves as a practical case study for optimizing distributed systems under heavy load, highlighting common challenges and their solutions in real-world scenarios.

Read original on Cloudflare Blog

Cloudflare's Security Insights platform provides automated security recommendations by scanning customer accounts. Facing issues with infrequent scans and an inability to scan all free accounts, the team needed to increase scanning throughput by 10x, from 10 to 100 scans per second. The existing system was struggling with high Kafka backlog, API timeouts, and process crashes, necessitating a comprehensive architectural review and optimization effort.

Initial Architecture Overview

The original system utilized Apache Kafka as a message bus. A scheduler triggered scans by publishing messages to Kafka, which were then consumed by specialized Go microservices called 'checkers'. These checkers performed specific scans and sent results to an internal API, which persisted the insights into a PostgreSQL database.

Key Scaling Challenges and Solutions

Kafka Throughput and Head-of-Line Blocking

Kafka's partitioned event stream processes messages in order within a partition, limiting consumers to one active per partition. To overcome this without increasing Kafka partitions (due to shared broker resources), Cloudflare introduced parallel processing within checkers by consuming messages in batches and processing each in a separate goroutine. To address head-of-line blocking caused by slow-processing messages, they implemented a 'slow lane' and 'fast lane' consumer group split, allowing quick identification and skipping of slow messages by fast lane checkers.

Database Write Performance

The initial database write strategy involved individual `INSERT ... ON CONFLICT DO UPDATE` statements for each insight, leading to half a million round trips for large batches. The solution adopted a hybrid approach: using PostgreSQL's `UNNEST` for smaller sets of insights (faster for small batches) and the `COPY` command into a temporary table for larger sets (efficient for bulk inserts), balancing performance and avoiding system table bloat.

API Latency and Connection Exhaustion

Significant API timeouts and performance degradation were traced to an active-active API deployment with instances in Portland (primary database location) and Amsterdam. The long round-trip latency between Amsterdam API instances and the Portland database led to slow queries, connection pool exhaustion, and uneven Kafka partition processing. The fix was a simple but effective architectural change: switching the API to an active-passive configuration, ensuring the active API instance co-located with the primary database.

Scheduler Uniformity and Spiky Loads

The original scheduler caused spiky loads due to accounts having similar `last_scheduled_at` times and large accounts triggering cascades of zone scans. Solutions included scheduling zones independently of accounts, randomizing `last_scheduled_at` for existing entries to smooth distribution, and implementing adaptive rate limiting. The adaptive rate limiter dynamically recalculates the scan rate based on total accounts/zones and desired frequency, preventing spikes when increasing scan frequencies or onboarding new customers, and ensuring uniform load distribution over time.

💡

Key Takeaways for System Design

This case study demonstrates the importance of holistic system optimization. Bottlenecks often appear at different layers (messaging, database, API, scheduling) and require targeted solutions. Co-locating services with their primary data stores, understanding message queue semantics, and designing adaptive scheduling mechanisms are crucial for scalable distributed systems.

KafkaPostgreSQLMicroservicesScalabilityPerformance OptimizationDistributed MessagingAPI DesignScheduler

Comments

Loading comments...