ByteByteGo·March 30, 2026

Roblox's Real-time AI Translation System Architecture

Roblox engineered a real-time, unified AI translation system for its platform, handling 16 languages and 5,000+ chats per second with 100ms latency. The system leverages a Mixture of Experts (MoE) model for language flexibility, combined with aggressive model optimization techniques like distillation and quantization. Crucially, it incorporates robust serving infrastructure including multi-level caching and dynamic batching to meet stringent performance requirements.

AI & ML Infrastructure Distributed Systems Performance & Scaling

Read original on ByteByteGo

The Challenge of Real-time Multilingual Chat at Scale

Roblox, with over 70 million daily users across 180 countries, faced the complex problem of providing real-time chat translation across 16 languages. The naive approach of building 256 separate translation models (one for each language pair) would lead to an unmanageable system due to quadratic growth in models and associated infrastructure, training data, and maintenance overhead. This highlighted the need for an efficient, unified architectural solution.

Unified Model Architecture: Mixture of Experts (MoE)

To overcome the N*M model explosion, Roblox opted for a single, unified transformer-based translation model. The core innovation here is the Mixture of Experts (MoE) architecture. Instead of processing every request through the entire model, a routing mechanism directs the input to a specialized subset of 'expert' subnetworks within the model. These experts are specialized in groups of similar languages, allowing the overall model to be vast (1 billion parameters) but efficient during inference, as only a fraction of its parameters are activated for any single translation.

Benefits of MoE: Shared learning across similar languages, improved translation quality, auto-detection of source language, and handling of mixed-language input.
Trade-off: The unified model is significantly larger (1 billion parameters), posing performance challenges for real-time inference.

Optimizing for Low Latency and High Throughput

Achieving 100ms latency for a 1-billion-parameter model at 5,000+ chats per second required a multi-pronged approach beyond just the MoE architecture.

Model Compression: Roblox employed knowledge distillation (training a smaller 'student' model to mimic a larger 'teacher' model's outputs and confidence distributions), quantization (reducing numerical precision of weights), and model compilation (optimizing for specific hardware). This reduced the model from 1 billion to under 650 million parameters.
Serving Infrastructure: Critical latency optimizations happen outside the model. Requests first hit a translation cache for exact matches. If a miss occurs, a dynamic batcher groups multiple requests for efficient GPU processing. An embedding cache is used between the model's encoder and decoder to prevent redundant encoding of the same source message if it needs multiple target translations (e.g., Korean to English, German, and French).

Data and Quality Management for Diverse Languages

Ensuring translation quality across 256 directions, especially for low-resource language pairs, was another significant engineering challenge.

Custom Quality Estimation Model: Standard metrics require human-translated references, which are impossible to scale. Roblox built its own ML model to score translations using only source text and machine output, evaluating accuracy, fluency, and context at a word-level granularity.
Iterative Back-translation: For low-resource pairs (e.g., French-Thai), Roblox used iterative back-translation, translating text from source to target and back to source, then comparing with the original to generate synthetic training data.
Domain-specific Data: Human evaluators provided translations for platform-specific slang and terms (e.g., "obby") to improve relevance and accuracy within the Roblox ecosystem. This is an ongoing process due to the evolving nature of user language.

ℹ️

Key Trade-offs

The unified model inherently trades off some quality for breadth and manageability. Distillation, while crucial for performance, can reduce accuracy. Low-resource language pairs still lag common ones. The 100ms latency ceiling strictly limits model size and potential quality improvements.

AI/MLMachine TranslationReal-time SystemsLow LatencyDistributed ArchitectureCachingModel OptimizationMixture of Experts