Menu
Pinterest Engineering·June 25, 2026

Scaling Foundation Model Training: Pinterest's Approach to Near-Linear Scalability

This article details Pinterest's journey to achieve near-linear scalability for their embedding-heavy foundation models, crucial for powering recommendation systems for over 600 million users. It focuses on identifying and resolving communication bottlenecks in multi-node distributed GPU training, leveraging network optimizations, data quantization, balanced sharding, and a novel 2D parallel communication topology.

Read original on Pinterest Engineering

Pinterest faced significant challenges in scaling the training of their large foundation models, essential for powering recommendation systems like Home feed and Related Pins. Initial attempts at multi-node distributed training yielded poor scalability (e.g., 0.2x throughput at 2 nodes), despite using powerful hardware. The core problem was identified as inefficient inter-node communication for embedding lookups.

Initial Bottlenecks and Diagnosis

The primary bottleneck was distributed embedding collective communication. Due to embedding tables exceeding single-GPU memory, they were sharded across GPUs, requiring extensive NCCL all-to-all operations during the forward pass. This communication cost dominated everything else, especially across nodes. The use of AWS Elastic Fabric Adapter (EFA) helped by providing OS-bypass networking, improving a dire 0.2x scaling to 1.13x at 2 nodes, but still far from ideal. PyTorch Profiler and NCCL traces revealed that GPUs were largely idle, waiting for network data, indicating an SM efficiency of only 54.54% despite high GPU utilization.

Optimization Strategies for Scalability

  1. Quantized Communications (QComms): Reduced data transfer by compressing embedding tensors from FP32 to FP8 for wire format during NCCL collectives. This alone significantly cut the largest NCCL operation by over 75%, boosting 2-node scaling from 1.13x to 1.57x.
  2. Balanced Sharding: Ensured even distribution of embedding tables across GPUs by matching hash partitions to the number of GPUs. This minimized bottlenecks caused by uneven workload distribution, particularly beneficial at 4 nodes (+15.2%).
  3. Bandwidth-Aware Embedding Optimization: Reshaped embeddings by halving the dimension and doubling row count. This maintained total capacity while reducing the bytes transferred per All-to-All operation, further improving scalability to 1.78x (2N) and 2.8x (4N). This insight highlighted that the bottleneck was bytes on the wire, not parameters in memory.
  4. 2D Parallel (All-to-All Optimized): The most impactful change. Instead of standard model parallelism where all GPUs participate in cross-cluster communication, they flipped the topology. Each node now runs its own complete set of sharded tables, keeping expensive All-to-All communication *intra-node* where it benefits from fast interconnects. Inter-node synchronization uses the cheaper AllReduce operation for model replicas. This design reduced All-to-All latency by 83%, achieving 2.0x scaling at 2 nodes and 3.9x at 4 nodes (97.5% of ideal).
💡

Key Takeaway for Distributed Training

When designing distributed systems, especially for ML training, communication patterns and data locality are paramount. Minimizing inter-node communication, especially for high-volume operations, and optimizing intra-node data flow can unlock significant performance gains. Profiling tools are indispensable for accurately diagnosing bottlenecks rather than guessing.

Collectively, these optimizations transformed Pinterest's multi-node training, achieving 7.5x scaling at 8 nodes (64 GPUs) with 93.75% of theoretical ideal. The final solution is hardware-agnostic, focusing on structural reduction of cross-node traffic rather than relying solely on faster interconnects.

Machine LearningDistributed TrainingGPUScalabilityPerformance OptimizationAWS EFANCCLEmbedding Tables

Comments

Loading comments...