Meta Engineering·March 31, 2026

Meta's Adaptive Ranking Model: Scaling LLM-Scale Inference for Real-time Ads

Meta's Adaptive Ranking Model tackles the "inference trilemma" for LLM-scale models in real-time ad recommendations, balancing computational complexity, low latency, and cost efficiency. It achieves this through a request-centric architecture, deep model-system co-design, and reimagined serving infrastructure, effectively bending the inference scaling curve.

AI & ML Infrastructure Performance & Scaling Distributed Systems

Read original on Meta Engineering

The Inference Trilemma in LLM-Scale Ad Ranking

Serving LLM-scale models for real-time ad recommendations at Meta's scale presents a significant challenge: the "inference trilemma." This involves simultaneously achieving high model complexity for better personalization, sub-second latency for user experience, and cost efficiency to remain economically viable. Traditional "one-size-fits-all" inference approaches are unsustainable, leading Meta to develop the Adaptive Ranking Model (ARM) to dynamically align model complexity with user context and intent.

Core Innovations of the Adaptive Ranking Model

Inference-Efficient Model Scaling: Shifts from linear to sub-linear scaling costs by optimizing for request-oriented computation, reducing redundancy, and enabling longer user behavior sequences.
Model/System Co-Design: Achieves high Model FLOPs Utilization (MFU) by developing hardware-aware model architectures and selectively applying FP8 quantization.
Reimagined Serving Infrastructure: Leverages multi-card GPU architectures and memory optimizations to scale model parameters to O(1T) and overcome physical memory limits.

ℹ️

Request-Oriented Optimization

Instead of processing each user-ad pair independently, ARM computes high-density user signals once per request and shares them across ad candidates. This is achieved through Request-Oriented Computation Sharing and In-Kernel Broadcast optimization, drastically reducing computational redundancy and memory bandwidth pressure.

Optimizing for Throughput and Latency

To maximize structural throughput, ARM employs Wukong Turbo, an optimized runtime evolution of Meta Ads' internal architecture. It uses a "No-Bias" approach to remove unstable terms and small parameter delegation to reduce network and memory overhead. For latency, preprocessing is offloaded from client CPUs to remote GPU hosts, utilizing compact formats and GPU-native kernels to prevent data starvation and improve end-to-end execution.

Deep Co-design for Heterogeneous Hardware

ARM’s deep model-system co-design is critical for maximizing computational ROI. It uses Selective FP8 Quantization, applying lower precision only in layers with high precision-loss tolerance to maintain model quality while boosting throughput. Hardware-Aware Graph and Kernel Specialization fuses operators and consolidates small operations into compute-dense kernels, minimizing memory access and increasing effective hardware utilization on modern GPUs.

Scaling Trillion-Parameter Embeddings

Recommendation models heavily rely on sparse, categorical features mapped to embedding tables. ARM enables O(1T) parameter scale by optimizing embedding hash sizes based on feature sparsity, pruning unused embeddings, and using unified embeddings to share tables across multiple features. When embeddings exceed single GPU memory, a multi-card sharding mechanism distributes them across an optimized hardware cluster.

LLM inferencerecommendation systemsdistributed AIGPU optimizationreal-time systemsinference servingmodel servinglow latency