Meta's RCCLX project, an open-source enhancement of RCCL, focuses on optimizing GPU communication for AI models on AMD platforms. It introduces features like Direct Data Access (DDA) and Low Precision Collectives to reduce latency and increase throughput, addressing critical bottlenecks in large language model inference and training. The article details architectural innovations for efficient inter-GPU communication.
Read original on Meta EngineeringThe article introduces RCCLX, Meta's open-source library designed to enhance GPU communications on AMD platforms, specifically for AI workloads. It builds upon RCCL and integrates with Torchcomms, providing a unified API for distributed communication across different hardware backends. The core motivation is to rapidly iterate on collective communication patterns, transports, and features to keep pace with evolving AI model requirements and hardware capabilities.
Large language model inference involves two stages: prefill (compute-bound) and decoding (memory-bound). AllReduce operations, crucial for tensor parallelism, can contribute significantly to end-to-end latency. To mitigate this, RCCLX introduces two DDA algorithms:
Performance Impact of DDA
DDA significantly improves performance, especially on AMD MI300X GPUs, yielding 10-50% speedup for decode and 10-30% for prefill. This translates to approximately a 10% reduction in time-to-incremental-token (TTIT), enhancing user experience during decoding.
Low-precision (LP) collectives are optimized distributed communication algorithms (AllReduce, AllGather, AlltoAll, ReduceScatter) for AMD Instinct MI300/MI350 GPUs. They support FP32 and BF16 data types, leveraging FP8 quantization for up to 4:1 compression, which drastically reduces communication overhead for large message sizes (>=16MB).
Internal evaluations showed significant speedup for FP32 and BF16, with ~9-10% decrease in latency and ~7% increase in throughput in end-to-end inference workloads when LP collectives were selectively enabled.