This article details Pinterest's journey to achieve near-linear scalability for their embedding-heavy foundation models, crucial for powering recommendation systems for over 600 million users. It focuses on identifying and resolving communication bottlenecks in multi-node distributed GPU training, leveraging network optimizations, data quantization, balanced sharding, and a novel 2D parallel communication topology.
Read original on Pinterest EngineeringPinterest faced significant challenges in scaling the training of their large foundation models, essential for powering recommendation systems like Home feed and Related Pins. Initial attempts at multi-node distributed training yielded poor scalability (e.g., 0.2x throughput at 2 nodes), despite using powerful hardware. The core problem was identified as inefficient inter-node communication for embedding lookups.
The primary bottleneck was distributed embedding collective communication. Due to embedding tables exceeding single-GPU memory, they were sharded across GPUs, requiring extensive NCCL all-to-all operations during the forward pass. This communication cost dominated everything else, especially across nodes. The use of AWS Elastic Fabric Adapter (EFA) helped by providing OS-bypass networking, improving a dire 0.2x scaling to 1.13x at 2 nodes, but still far from ideal. PyTorch Profiler and NCCL traces revealed that GPUs were largely idle, waiting for network data, indicating an SM efficiency of only 54.54% despite high GPU utilization.
Key Takeaway for Distributed Training
When designing distributed systems, especially for ML training, communication patterns and data locality are paramount. Minimizing inter-node communication, especially for high-volume operations, and optimizing intra-node data flow can unlock significant performance gains. Profiling tools are indispensable for accurately diagnosing bottlenecks rather than guessing.
Collectively, these optimizations transformed Pinterest's multi-node training, achieving 7.5x scaling at 8 nodes (64 GPUs) with 93.75% of theoretical ideal. The final solution is hardware-agnostic, focusing on structural reduction of cross-node traffic rather than relying solely on faster interconnects.