Menu
🔵Meta Engineering·February 24, 2026

Optimizing GPU Communications for AI Workloads with RCCLX

Meta's RCCLX project, an open-source enhancement of RCCL, focuses on optimizing GPU communication for AI models on AMD platforms. It introduces features like Direct Data Access (DDA) and Low Precision Collectives to reduce latency and increase throughput, addressing critical bottlenecks in large language model inference and training. The article details architectural innovations for efficient inter-GPU communication.

Read original on Meta Engineering

The article introduces RCCLX, Meta's open-source library designed to enhance GPU communications on AMD platforms, specifically for AI workloads. It builds upon RCCL and integrates with Torchcomms, providing a unified API for distributed communication across different hardware backends. The core motivation is to rapidly iterate on collective communication patterns, transports, and features to keep pace with evolving AI model requirements and hardware capabilities.

Direct Data Access (DDA) for Intra-node Collectives

Large language model inference involves two stages: prefill (compute-bound) and decoding (memory-bound). AllReduce operations, crucial for tensor parallelism, can contribute significantly to end-to-end latency. To mitigate this, RCCLX introduces two DDA algorithms:

  • DDA Flat: Improves small message-size allreduce by allowing ranks to directly load memory from others, performing local reduces. This reduces latency from O(N) to O(1) by increasing data exchange from O(N) to O(N^2).
  • DDA Tree: Breaks allreduce into reduce-scatter and all-gather phases, using direct data access in each step. It moves the same amount of data as the ring algorithm but achieves constant factor latency reduction for slightly larger messages.
ℹ️

Performance Impact of DDA

DDA significantly improves performance, especially on AMD MI300X GPUs, yielding 10-50% speedup for decode and 10-30% for prefill. This translates to approximately a 10% reduction in time-to-incremental-token (TTIT), enhancing user experience during decoding.

Low Precision Collectives for Communication Efficiency

Low-precision (LP) collectives are optimized distributed communication algorithms (AllReduce, AllGather, AlltoAll, ReduceScatter) for AMD Instinct MI300/MI350 GPUs. They support FP32 and BF16 data types, leveraging FP8 quantization for up to 4:1 compression, which drastically reduces communication overhead for large message sizes (>=16MB).

  • Communication is done using parallel peer-to-peer (P2P) mesh communication, fully utilizing AMD's Infinity Fabric.
  • Compute steps maintain numerical stability by operating in high precision (FP32).
  • Precision loss is managed by minimizing quantization operations and ensuring data can be represented in FP8 range. Users can dynamically enable LP collectives to maximize throughput while maintaining acceptable numerical accuracy.

Internal evaluations showed significant speedup for FP32 and BF16, with ~9-10% decrease in latency and ~7% increase in throughput in end-to-end inference workloads when LP collectives were selectively enabled.

GPUAMDDeep LearningInterconnectCommunication CollectivesAllReduceLow PrecisionHardware Acceleration

Comments

Loading comments...
Optimizing GPU Communications for AI Workloads with RCCLX | SysDesAi