This article details Coupang's journey to replace legacy database sequences with a highly available, low-latency distributed ID generation system without breaking over 100 existing services. The solution leverages local application caching, server-side caching, and DynamoDB as the source of truth, optimizing for performance and availability over strict global ordering and gap-free IDs. It highlights practical design principles for large-scale migrations, emphasizing simplicity and backward compatibility.
Read original on InfoQ CloudMigrating a large-scale platform from a relational database to NoSQL often surfaces hidden dependencies, especially around database-native features like sequences. Coupang faced this challenge when deprecating a legacy database that provided unique, monotonically increasing IDs. With over one hundred teams and ten thousand distinct counters relying on these sequences, a drop-in replacement was essential to avoid massive application rewrites and service disruptions. The core problem was designing a system that could generate unique IDs at scale with high availability and low latency, while maintaining compatibility with existing assumptions without being overly complex.
The team deliberately chose simplicity over sophisticated distributed coordination protocols (like consensus or vector clocks) after validating actual requirements. Many teams did not need strict global ordering or gap-free sequences, simplifying the problem significantly. This led to several guiding principles:
The final architecture comprises three layers: DynamoDB as the persistent source of truth, a server-side caching layer, and thick clients with local in-memory caches. This multi-layered caching strategy ensures that most ID requests are served with zero network calls, drastically improving latency and availability.
Availability over Strict Ordering
The design prioritizes availability and low latency over strict global monotonicity across all consumers or gap-free sequences. While IDs are unique and monotonically increasing within a single service instance, they may not be globally ordered across different instances. This trade-off significantly simplifies the distributed system and caters to the actual needs of most consuming services.