Roblox engineered a real-time, unified AI translation system for its platform, handling 16 languages and 5,000+ chats per second with 100ms latency. The system leverages a Mixture of Experts (MoE) model for language flexibility, combined with aggressive model optimization techniques like distillation and quantization. Crucially, it incorporates robust serving infrastructure including multi-level caching and dynamic batching to meet stringent performance requirements.
Read original on ByteByteGoRoblox, with over 70 million daily users across 180 countries, faced the complex problem of providing real-time chat translation across 16 languages. The naive approach of building 256 separate translation models (one for each language pair) would lead to an unmanageable system due to quadratic growth in models and associated infrastructure, training data, and maintenance overhead. This highlighted the need for an efficient, unified architectural solution.
To overcome the N*M model explosion, Roblox opted for a single, unified transformer-based translation model. The core innovation here is the Mixture of Experts (MoE) architecture. Instead of processing every request through the entire model, a routing mechanism directs the input to a specialized subset of 'expert' subnetworks within the model. These experts are specialized in groups of similar languages, allowing the overall model to be vast (1 billion parameters) but efficient during inference, as only a fraction of its parameters are activated for any single translation.
Achieving 100ms latency for a 1-billion-parameter model at 5,000+ chats per second required a multi-pronged approach beyond just the MoE architecture.
Ensuring translation quality across 256 directions, especially for low-resource language pairs, was another significant engineering challenge.
Key Trade-offs
The unified model inherently trades off some quality for breadth and manageability. Distillation, while crucial for performance, can reduce accuracy. Low-resource language pairs still lag common ones. The 100ms latency ceiling strictly limits model size and potential quality improvements.