Menu
Dev.to #architecture·May 25, 2026

Scaling an In-Memory Metadata Layer: Lessons from Veltrix Feature Store

This article details the challenges and solutions encountered while scaling an in-memory metadata layer for the Veltrix feature store, highlighting critical performance bottlenecks related to garbage collection and disk I/O with RocksDB. It presents a successful architectural pivot to a custom mmap-based sharded hash map, showcasing specific optimizations for latency, memory management, and NUMA awareness to achieve high throughput and low latency.

Read original on Dev.to #architecture

The Challenge: In-Memory Metadata at Scale

The Veltrix feature store experienced severe performance degradation at Black-Friday scale, stalling after 1.2 million requests despite ample CPU and network resources. The core issue was an in-memory metadata cache (vx-meta) exceeding 42 GB RSS on a single pod. This led to significant Go garbage collection (GC) pauses (up to 290 ms) and a drastic increase in p99 latency (from 3 ms to 320 ms), highlighting a common pitfall: assuming in-memory components are always fast without considering their underlying runtime characteristics and memory pressure.

Initial Approach: RocksDB with Ristretto Cache

The team initially implemented an off-heap RocksDB tier (v6.27) behind gRPC, expecting a ristretto (v0.12) in-memory cache to handle 99% of reads. While synthetic tests at 5 million keys achieved 1.8M TPS with 1.9 ms p99 latency, scaling to 25 million keys exposed RocksDB's limitations. Read amplification spiked due to repeated compaction of small sstables containing feature metadata hashes. Even with block-cache adjustments and level-style compaction, write stalls occurred for up to 47 minutes, making the 10 ms p99 SLO unattainable.

⚠️

Lesson Learned: Beware of Hidden I/O and Compaction Costs

Even with an effective caching layer, underlying persistent storage can become a bottleneck if its I/O patterns and compaction strategies are not suitable for the workload. High read amplification and prolonged write stalls can severely impact overall system performance.

The Solution: Custom Sharded mmap-based Hash Map

The team re-architected vx-meta into a custom sharded hash map. Key design decisions included:

  • Memory Mapping (mmap): Each shard was mmap'd, leveraging the kernel's page cache for most reads and reducing disk I/O.
  • Memory Locking (mlock): Dirty pages were pinned with `mlock` to prevent them from being swapped to disk, effectively keeping critical metadata in physical RAM and eliminating compaction-related disk access.
  • Sharding: 128 shards were chosen to fit within the L3 cache of the AMD 7763 CPUs, ensuring optimal access patterns.
  • Client-Side Locality: A `X-Vx-Meta-Shard` gRPC header was introduced to allow clients to pin requests to specific shards, minimizing cross-node traffic and improving cache locality.
  • NUMA Awareness: The shard rebalancer used `sched_setaffinity` to avoid NUMA migrations, which previously caused significant increases in shard lock latency (from 18 µs to 900 µs).

This re-design significantly improved performance, enabling the cluster to handle 6.5 M TPS at 95% CPU usage with p99 latency consistently at 2.8 ms, even with a 380 GB dataset. GC pauses on feature-store pods dropped from 290 ms to 12 ms as the RSS now fit within transparent huge pages. Monitoring `vx_meta_shard_gc_pause_ms` and `vx_meta_dirty_ratio` became critical canary metrics.

Future Optimizations and Trade-offs

Further improvements identified include capping mlock memory at 90% of physical RAM to allow the kernel to swap less critical dirty pages, and moving NUMA pinning into the vx-meta process itself using cgroups v2 cpuset controller for more robust thread management. This reflects the continuous need for careful resource management and fine-tuning in high-performance distributed systems, balancing direct memory control with kernel-managed efficiencies.

in-memory cacheRocksDBGo GCmmapmlockshardingNUMAlatency optimization

Comments

Loading comments...