Menu
Cloudflare Blog·June 24, 2026

Scaling Cloudflare's OAuth System: A Multi-Stage Upgrade Case Study

This article details Cloudflare's journey in scaling its OAuth infrastructure to support a broader developer ecosystem, transitioning from a limited, manually managed system to a self-managed, robust platform. It highlights the architectural challenges and migration strategies involved in upgrading their underlying OAuth engine, Hydra, through a complex blue-green deployment process to ensure minimal downtime and data integrity. The case study offers valuable insights into managing critical infrastructure upgrades in a live production environment.

Read original on Cloudflare Blog

Cloudflare embarked on a significant architectural upgrade to its OAuth system, moving from a solution adequate for a small number of partners to a self-managed, scalable platform accessible to all developers. This shift was driven by the growth of their developer platform and the increasing demand for delegated access for SaaS integrations, internal developer tools, and agentic workflows. The upgrade focused on enhancing the permissions model, consent experience, revocation mechanisms, and security features to support a rapidly expanding ecosystem.

Challenges of Upgrading the OAuth Engine

The existing OAuth engine, an older deployment of open-source Hydra, proved insufficient for the growing demands. The upgrade presented several significant challenges, particularly concerning database schema migrations and maintaining service availability during the transition. Initial plans for a single large upgrade were split into two sequential phases (1.x and 2.x) to mitigate risk.

  • Exclusive Database Locks: Schema migrations for the 1.x upgrade required exclusive locks on critical tables, which would prevent active users from performing OAuth operations.
  • SQL Query Incompatibility: The older Hydra SDK performed `SELECT *` queries, leading to deserialization issues with new schema changes.
  • Major Schema Changes in 2.x: The 2.x upgrade introduced extensive schema changes, making an in-place upgrade impossible without prolonged downtime.
  • Data Loss Risk During Blue-Green: Traditional blue-green deployments often involve disabling writes or accepting potential data loss during the cutover window, which was unacceptable for critical OAuth operations like token issuance and revocation.

Migration Strategy: Blue-Green with Write Preservation

To address the challenges, Cloudflare devised a sophisticated blue-green deployment strategy that allowed writes to continue during the multi-hour upgrade window. Key tactics included:

  • Custom SQL Migrations: Rewrote SQL migrations to use `CREATE INDEX CONCURRENTLY` and custom Hydra builds to select explicit columns, avoiding exclusive locks and deserialization issues.
  • Increased Token Expiry: Temporarily increased token expiry times to reduce the frequency of new token writes during the upgrade window, minimizing potential losses.
  • Revocation Queue System: Implemented Cloudflare Queues to capture all revocation events during the upgrade. After the cutover to the new (green) database, this queue was drained, replaying all missed revocations to ensure applications no longer had unauthorized access.
  • Refresh Token Coalescing: Introduced a Worker to cache refresh token requests, preventing duplicate requests from invalidating entire access/refresh token chains due to stricter new Hydra behaviors, especially for high-volume clients.
ℹ️

Performance Improvements Post-Upgrade

The successful upgrade led to significant performance enhancements, including a 45% reduction in API P95 latency, 14% less RSS memory usage, 40% less Go heap allocation, and 37% less CPU utilization, demonstrating the architectural benefits of the new system.

OAuthHydraBlue-Green DeploymentDatabase MigrationCloudflareAPI GatewayScalabilityDistributed Systems

Comments

Loading comments...