ByteByteGo·May 25, 2026

Building Vector Indexing in a Distributed SQL Database: CockroachDB's C-SPANN

This article details how CockroachDB integrated vector indexing directly into its distributed SQL database by developing C-SPANN. It highlights the architectural constraints faced by a distributed, transactional database when adding a new feature like vector search, emphasizing the need for no central coordinator, real-time updates, sharding compatibility, and hot spot avoidance. The solution treats the vector index as ordinary table data, leveraging CockroachDB's existing distributed mechanisms for scalability and reliability.

Databases & Storage Distributed Systems AI & ML Infrastructure

Read original on ByteByteGo

The Challenge of Vector Search in Distributed Systems

Vector embeddings are crucial for semantic search and AI applications, representing data (images, text) as multi-dimensional numbers. Finding Approximate Nearest Neighbors (ANN) in billions of vectors quickly is the core problem. Unlike traditional indexes (B-trees) which rely on inherent data order, vectors lack this, making brute-force search impractical for large datasets. Vector indexes solve this by sacrificing exact accuracy for significant performance gains, a trade-off fundamental to their design. Integrating such an index into a distributed transactional database like CockroachDB presents unique architectural hurdles.

Key Architectural Constraints for C-SPANN

CockroachDB's nature as a distributed SQL database with transactional consistency and real-time updates imposed strict requirements on any new feature. The team identified six critical architectural constraints that ruled out most existing vector index algorithms (like HNSW, IVF, or standalone vector databases):

No Central Coordinator: To avoid bottlenecks and single points of failure, any node should handle reads/writes.
No Large In-Memory Structures: Index state must be persistent, avoiding large RAM caches and enabling fast warm-up for serverless deployments.
Minimal Network Hops: Reduce unpredictable latency from inter-node communication.
Sharding-Compatible Data Layout: Index data must map naturally to CockroachDB's key-value storage for splitting, merging, and rebalancing.
Avoid Hot Spots: Load must distribute evenly across the cluster as traffic scales.
Incremental Updates: Support real-time inserts/deletes without blocking queries, batch rebuilds, or degrading search quality.

C-SPANN: A Distributed-Native Vector Index

C-SPANN, the custom solution, adapts ideas from SPANN, SPFresh, and ScaNN, but integrates them directly into CockroachDB's distributed SQL architecture. Its core is a hierarchical K-means tree that groups vectors into partitions based on similarity. This results in a wide, shallow tree (e.g., 3 levels for 1 million vectors, 5 for 10 billion), allowing parallel processing at each level during search, keeping latency low.

ℹ️

C-SPANN's Architectural Advantage

The key design decision was storing each vector index partition as a self-contained set of key-value rows within CockroachDB ranges. This means the vector index isn't a separate system, but ordinary table data. This allows C-SPANN to automatically leverage CockroachDB's existing distributed features for range splitting, rebalancing, and caching without new infrastructure code, providing immediate readiness after node restarts.

Maintenance and Optimization

C-SPANN handles index maintenance incrementally. Partitions split when too large and merge when too small, with vectors reassigned using a 'nearest partition assignment' strategy to maintain accuracy. This prevents the need for full index rebuilds. For size optimization, vectors are compressed using RaBitQ, a lossy quantization technique reducing vector size by 94%. A reranking step compensates for this loss of precision, ensuring accurate results from the compressed candidates.

vector searchANNCockroachDBdistributed databaseindexingscalabilityreal-timeSQL database