This article details the challenges and solutions encountered while scaling an in-memory metadata layer for the Veltrix feature store, highlighting critical performance bottlenecks related to garbage collection and disk I/O with RocksDB. It presents a successful architectural pivot to a custom mmap-based sharded hash map, showcasing specific optimizations for latency, memory management, and NUMA awareness to achieve high throughput and low latency.
Read original on Dev.to #architectureThe Veltrix feature store experienced severe performance degradation at Black-Friday scale, stalling after 1.2 million requests despite ample CPU and network resources. The core issue was an in-memory metadata cache (vx-meta) exceeding 42 GB RSS on a single pod. This led to significant Go garbage collection (GC) pauses (up to 290 ms) and a drastic increase in p99 latency (from 3 ms to 320 ms), highlighting a common pitfall: assuming in-memory components are always fast without considering their underlying runtime characteristics and memory pressure.
The team initially implemented an off-heap RocksDB tier (v6.27) behind gRPC, expecting a ristretto (v0.12) in-memory cache to handle 99% of reads. While synthetic tests at 5 million keys achieved 1.8M TPS with 1.9 ms p99 latency, scaling to 25 million keys exposed RocksDB's limitations. Read amplification spiked due to repeated compaction of small sstables containing feature metadata hashes. Even with block-cache adjustments and level-style compaction, write stalls occurred for up to 47 minutes, making the 10 ms p99 SLO unattainable.
Lesson Learned: Beware of Hidden I/O and Compaction Costs
Even with an effective caching layer, underlying persistent storage can become a bottleneck if its I/O patterns and compaction strategies are not suitable for the workload. High read amplification and prolonged write stalls can severely impact overall system performance.
The team re-architected vx-meta into a custom sharded hash map. Key design decisions included:
This re-design significantly improved performance, enabling the cluster to handle 6.5 M TPS at 95% CPU usage with p99 latency consistently at 2.8 ms, even with a 380 GB dataset. GC pauses on feature-store pods dropped from 290 ms to 12 ms as the RSS now fit within transparent huge pages. Monitoring `vx_meta_shard_gc_pause_ms` and `vx_meta_dirty_ratio` became critical canary metrics.
Further improvements identified include capping mlock memory at 90% of physical RAM to allow the kernel to swap less critical dirty pages, and moving NUMA pinning into the vx-meta process itself using cgroups v2 cpuset controller for more robust thread management. This reflects the continuous need for careful resource management and fine-tuning in high-performance distributed systems, balancing direct memory control with kernel-managed efficiencies.