Pinterest engineered a "Feature Trimmer" to overcome network bandwidth bottlenecks in their online ML serving system's root-leaf architecture. This involved moving from sending all features to only transmitting those explicitly required by a model, leveraging model signatures and integrating with existing deployment pipelines. The solution significantly reduced network usage, enabling infrastructure cost savings and better GPU utilization.
Read original on Pinterest EngineeringPinterest's online ML serving system uses a root-leaf architecture designed to efficiently score Pins for user recommendations. The Root component handles initial feature processing, fetching features from a store, preprocessing them, and fanning out requests. The Leaf components, typically running on GPUs, perform the actual model inference. This separation allows for optimized resource utilization, dedicating CPUs to feature processing and GPUs to model inference, and centralizes feature caching to reduce QPS to the feature store. However, this architectural choice introduced a critical challenge: network bandwidth between root and leaf became a bottleneck, limiting GPU utilization and forcing the use of expensive, network-optimized instances for the root.
The initial approach to mitigate network pressure involved enabling LZ4 compression on the RPC traffic between root and leaf. This provided a 20% reduction in network usage at the cost of a 5% CPU increase and a 10% p90 latency increase. While an early win, it didn't solve the fundamental problem of shipping unused data.
The "Send What You Use" Principle
This principle, akin to "include what you use" in C++ development, advocates for only transmitting or processing the data that is strictly necessary. In distributed systems, this often translates to significant savings in network bandwidth, CPU cycles, and memory, though it may introduce complexity in data synchronization and schema management.
The core solution was the Feature Trimmer, a component designed to implement a "Send What You Use" strategy. Instead of the root sending the union of all features to every leaf, the Trimmer ensures only the features required by a specific leaf model are transmitted. This required the root to accurately know each model's feature requirements, which are derived from the model signature (input/output definitions) exported with the model artifact. A crucial convention is that model signatures remain immutable for a given model version; changes necessitate forking a new model.
This approach enabled substantial network bandwidth reduction, allowing Pinterest to utilize GPU resources more effectively and switch root instances to cheaper, standard types, yielding significant infrastructure cost savings.