This article details how CockroachDB integrated vector indexing directly into its distributed SQL database by developing C-SPANN. It highlights the architectural constraints faced by a distributed, transactional database when adding a new feature like vector search, emphasizing the need for no central coordinator, real-time updates, sharding compatibility, and hot spot avoidance. The solution treats the vector index as ordinary table data, leveraging CockroachDB's existing distributed mechanisms for scalability and reliability.
Read original on ByteByteGoVector embeddings are crucial for semantic search and AI applications, representing data (images, text) as multi-dimensional numbers. Finding Approximate Nearest Neighbors (ANN) in billions of vectors quickly is the core problem. Unlike traditional indexes (B-trees) which rely on inherent data order, vectors lack this, making brute-force search impractical for large datasets. Vector indexes solve this by sacrificing exact accuracy for significant performance gains, a trade-off fundamental to their design. Integrating such an index into a distributed transactional database like CockroachDB presents unique architectural hurdles.
CockroachDB's nature as a distributed SQL database with transactional consistency and real-time updates imposed strict requirements on any new feature. The team identified six critical architectural constraints that ruled out most existing vector index algorithms (like HNSW, IVF, or standalone vector databases):
C-SPANN, the custom solution, adapts ideas from SPANN, SPFresh, and ScaNN, but integrates them directly into CockroachDB's distributed SQL architecture. Its core is a hierarchical K-means tree that groups vectors into partitions based on similarity. This results in a wide, shallow tree (e.g., 3 levels for 1 million vectors, 5 for 10 billion), allowing parallel processing at each level during search, keeping latency low.
C-SPANN's Architectural Advantage
The key design decision was storing each vector index partition as a self-contained set of key-value rows within CockroachDB ranges. This means the vector index isn't a separate system, but ordinary table data. This allows C-SPANN to automatically leverage CockroachDB's existing distributed features for range splitting, rebalancing, and caching without new infrastructure code, providing immediate readiness after node restarts.
C-SPANN handles index maintenance incrementally. Partitions split when too large and merge when too small, with vectors reassigned using a 'nearest partition assignment' strategy to maintain accuracy. This prevents the need for full index rebuilds. For size optimization, vectors are compressed using RaBitQ, a lossy quantization technique reducing vector size by 94%. A reranking step compensates for this loss of precision, ensuring accurate results from the compressed candidates.