Meta's Adaptive Ranking Model tackles the "inference trilemma" for LLM-scale models in real-time ad recommendations, balancing computational complexity, low latency, and cost efficiency. It achieves this through a request-centric architecture, deep model-system co-design, and reimagined serving infrastructure, effectively bending the inference scaling curve.
Read original on Meta EngineeringServing LLM-scale models for real-time ad recommendations at Meta's scale presents a significant challenge: the "inference trilemma." This involves simultaneously achieving high model complexity for better personalization, sub-second latency for user experience, and cost efficiency to remain economically viable. Traditional "one-size-fits-all" inference approaches are unsustainable, leading Meta to develop the Adaptive Ranking Model (ARM) to dynamically align model complexity with user context and intent.
Request-Oriented Optimization
Instead of processing each user-ad pair independently, ARM computes high-density user signals once per request and shares them across ad candidates. This is achieved through Request-Oriented Computation Sharing and In-Kernel Broadcast optimization, drastically reducing computational redundancy and memory bandwidth pressure.
To maximize structural throughput, ARM employs Wukong Turbo, an optimized runtime evolution of Meta Ads' internal architecture. It uses a "No-Bias" approach to remove unstable terms and small parameter delegation to reduce network and memory overhead. For latency, preprocessing is offloaded from client CPUs to remote GPU hosts, utilizing compact formats and GPU-native kernels to prevent data starvation and improve end-to-end execution.
ARM’s deep model-system co-design is critical for maximizing computational ROI. It uses Selective FP8 Quantization, applying lower precision only in layers with high precision-loss tolerance to maintain model quality while boosting throughput. Hardware-Aware Graph and Kernel Specialization fuses operators and consolidates small operations into compute-dense kernels, minimizing memory access and increasing effective hardware utilization on modern GPUs.
Recommendation models heavily rely on sparse, categorical features mapped to embedding tables. ARM enables O(1T) parameter scale by optimizing embedding hash sizes based on feature sparsity, pruning unused embeddings, and using unified embeddings to share tables across multiple features. When embeddings exceed single GPU memory, a multi-card sharding mechanism distributes them across an optimized hardware cluster.