🔵Meta Engineering·February 24, 2026

Optimizing GPU Communications for AI Workloads with RCCLX

Meta's RCCLX project, an open-source enhancement of RCCL, focuses on optimizing GPU communication for AI models on AMD platforms. It introduces features like Direct Data Access (DDA) and Low Precision Collectives to reduce latency and increase throughput, addressing critical bottlenecks in large language model inference and training. The article details architectural innovations for efficient inter-GPU communication.

AI & ML Infrastructure Performance & Scaling Distributed Systems

Read original on Meta Engineering

The article introduces RCCLX, Meta's open-source library designed to enhance GPU communications on AMD platforms, specifically for AI workloads. It builds upon RCCL and integrates with Torchcomms, providing a unified API for distributed communication across different hardware backends. The core motivation is to rapidly iterate on collective communication patterns, transports, and features to keep pace with evolving AI model requirements and hardware capabilities.

Direct Data Access (DDA) for Intra-node Collectives

Large language model inference involves two stages: prefill (compute-bound) and decoding (memory-bound). AllReduce operations, crucial for tensor parallelism, can contribute significantly to end-to-end latency. To mitigate this, RCCLX introduces two DDA algorithms:

DDA Flat: Improves small message-size allreduce by allowing ranks to directly load memory from others, performing local reduces. This reduces latency from O(N) to O(1) by increasing data exchange from O(N) to O(N^2).
DDA Tree: Breaks allreduce into reduce-scatter and all-gather phases, using direct data access in each step. It moves the same amount of data as the ring algorithm but achieves constant factor latency reduction for slightly larger messages.

ℹ️

Performance Impact of DDA

DDA significantly improves performance, especially on AMD MI300X GPUs, yielding 10-50% speedup for decode and 10-30% for prefill. This translates to approximately a 10% reduction in time-to-incremental-token (TTIT), enhancing user experience during decoding.

Low Precision Collectives for Communication Efficiency

Low-precision (LP) collectives are optimized distributed communication algorithms (AllReduce, AllGather, AlltoAll, ReduceScatter) for AMD Instinct MI300/MI350 GPUs. They support FP32 and BF16 data types, leveraging FP8 quantization for up to 4:1 compression, which drastically reduces communication overhead for large message sizes (>=16MB).

Communication is done using parallel peer-to-peer (P2P) mesh communication, fully utilizing AMD's Infinity Fabric.
Compute steps maintain numerical stability by operating in high precision (FP32).
Precision loss is managed by minimizing quantization operations and ensuring data can be represented in FP8 range. Users can dynamically enable LP collectives to maximize throughput while maintaining acceptable numerical accuracy.

Internal evaluations showed significant speedup for FP32 and BF16, with ~9-10% decrease in latency and ~7% increase in throughput in end-to-end inference workloads when LP collectives were selectively enabled.

GPUAMDDeep LearningInterconnectCommunication CollectivesAllReduceLow PrecisionHardware Acceleration

Comments

Loading comments...

Architecture Design

Design this yourself

Design a distributed AI training and inference system for large language models on an AMD GPU cluster, emphasizing optimized inter-GPU communication. Detail how you would integrate custom low-latency collective operations like Direct Data Access (DDA) and Low Precision Collectives (FP8 quantization) to handle varying message sizes and maintain numerical accuracy during both prefill (compute-bound) and decoding (memory-bound) stages. Consider the architectural implications of these optimizations on overall system throughput and latency.

Focus: optimized GPU communication primitives for distributed AI training and inference

Other design angles

· Design a generic distributed deep learning framework that allows pluggable backend communication libraries, including support for specialized hardware-specific optimizations like RCCLX for AMD GPUs.· Architect a high-performance serving infrastructure for large language models that leverages intra-node GPU communication optimizations to minimize inference latency and maximize throughput for real-time applications.· Evaluate the trade-offs between communication overhead reduction using low-precision collectives and potential numerical accuracy degradation in a distributed training setup for a critical AI model.

Optimizing GPU Communications for AI Workloads with RCCLX

Direct Data Access (DDA) for Intra-node Collectives

Low Precision Collectives for Communication Efficiency

Comments

Architecture Design

Related Lessons