Pinterest Engineering·May 1, 2026

Optimizing ML Serving Network Efficiency: Pinterest's Feature Trimmer

Pinterest engineered a "Feature Trimmer" to overcome network bandwidth bottlenecks in their online ML serving system's root-leaf architecture. This involved moving from sending all features to only transmitting those explicitly required by a model, leveraging model signatures and integrating with existing deployment pipelines. The solution significantly reduced network usage, enabling infrastructure cost savings and better GPU utilization.

AI & ML Infrastructure Performance & Scaling Distributed Systems

Read original on Pinterest Engineering

Pinterest's ML Serving Root-Leaf Architecture

Pinterest's online ML serving system uses a root-leaf architecture designed to efficiently score Pins for user recommendations. The Root component handles initial feature processing, fetching features from a store, preprocessing them, and fanning out requests. The Leaf components, typically running on GPUs, perform the actual model inference. This separation allows for optimized resource utilization, dedicating CPUs to feature processing and GPUs to model inference, and centralizes feature caching to reduce QPS to the feature store. However, this architectural choice introduced a critical challenge: network bandwidth between root and leaf became a bottleneck, limiting GPU utilization and forcing the use of expensive, network-optimized instances for the root.

Addressing Network Bottlenecks

The initial approach to mitigate network pressure involved enabling LZ4 compression on the RPC traffic between root and leaf. This provided a 20% reduction in network usage at the cost of a 5% CPU increase and a 10% p90 latency increase. While an early win, it didn't solve the fundamental problem of shipping unused data.

💡

The "Send What You Use" Principle

This principle, akin to "include what you use" in C++ development, advocates for only transmitting or processing the data that is strictly necessary. In distributed systems, this often translates to significant savings in network bandwidth, CPU cycles, and memory, though it may introduce complexity in data synchronization and schema management.

The Feature Trimmer: Sending Only Required Features

The core solution was the Feature Trimmer, a component designed to implement a "Send What You Use" strategy. Instead of the root sending the union of all features to every leaf, the Trimmer ensures only the features required by a specific leaf model are transmitted. This required the root to accurately know each model's feature requirements, which are derived from the model signature (input/output definitions) exported with the model artifact. A crucial convention is that model signatures remain immutable for a given model version; changes necessitate forking a new model.

Model Signature as Source of Truth: Each model's input/output schema is stored in a `module_info.json` file alongside its TorchScript artifact.
Bundle-Level Mapping: During deployment, individual model signatures are aggregated into a bundle-level mapping. This mapping is deployed to the root as a configuration artifact.
Staged Delivery Synchronization: The root configurations for the Feature Trimmer follow the same staged delivery (canary, production, rollback) as model deployments, ensuring consistency and leveraging existing operational safeguards. Root configs are deployed *before* new model versions to prevent trimming gaps during rollouts.

This approach enabled substantial network bandwidth reduction, allowing Pinterest to utilize GPU resources more effectively and switch root instances to cheaper, standard types, yielding significant infrastructure cost savings.

MLOpsMachine LearningDistributed InferenceNetwork OptimizationCost SavingsMicroservicesPinterestRoot-Leaf Architecture

Comments

Loading comments...

Architecture Design

View Architecture

Design an online machine learning inference system for personalized recommendations with a root-leaf architecture, focusing on optimizing network efficiency between the feature processing (root) and model inference (leaf) layers. Detail how you would implement a "Feature Trimmer" to send only required features, synchronize feature allowlists with frequently updated models, and manage configuration deployment in a high-throughput, fault-tolerant environment.

Practice Interview

Focus: ML feature trimming and synchronization in a root-leaf architecture

Other design angles

· Design a standalone feature trimming service that can be integrated into existing distributed ML inference pipelines.· How would you design a multi-tenant ML serving platform where each tenant has different feature requirements, and integrate a feature trimming mechanism to optimize costs and performance?· Design a real-time analytics system for feature usage within an ML serving architecture to dynamically adapt feature trimming strategies and identify potential bottlenecks.