This article details Snapchat's Bento platform, an ML-powered system designed to serve a billion predictions per second for various ranking and recommendation tasks. It highlights the architectural decisions and engineering challenges in building a low-latency, high-throughput, and fresh ML serving infrastructure. Key aspects include handling asymmetric ranking workloads, a specialized feature store, and optimized inference engines.
Read original on ByteByteGoSnapchat's Bento platform is a sophisticated machine learning infrastructure responsible for delivering over a billion predictions per second, powering features like Discover, Spotlight feeds, ad ranking, and friend suggestions. The core challenge lies in making these decisions in roughly 100 milliseconds at the scale of 477 million daily active users, which involves retrieving candidates, fetching features, running deep learning models, and ranking results.
Unlike typical one-to-one web requests, a ranking request is highly asymmetric. A single user request expands into hundreds or thousands of (user, candidate) pairs that need scoring, creating a massive "fanout" problem. Bento addresses this by splitting the work into two stages: retrieval (cheap models filter millions to hundreds/thousands of candidates) and ranking (expensive models carefully score and order these candidates).
Key Pressures on the Bento Platform
The design of Bento is driven by four primary pressures that often conflict: - Latency Pressure: User abandonment if feeds load slowly. - Scale Pressure: Billions of predictions per second, 1 TB/sec feature reads. - Freshness Pressure: Real-time response to user signals (e.g., a recent like). - Iteration Pressure: ML engineers need to ship hundreds of experiments monthly.
The training half of Bento uses Kubeflow for orchestration and standardizes model development with a layered approach: a shared Core framework (TensorFlow/Keras), individual User model code, and Training configuration (YAML). This layering allows for hundreds of experiments daily by enabling quick changes to data, features, or model specifics. A unique aspect is the model export step, which splits the compute graph to optimize for different inference hardware (dense layers on GPU, embeddings/feature parsing on CPU) to maximize resource utilization. Bento also automates incremental training, ensuring models in production are continuously updated.
The serving half of Bento tackles the hard problems of low-latency feature fetching and model inference. The feature store, called Robusta, is critical, processing 10 trillion events/day and maintaining consistency between offline (Apache Iceberg) and online (fast key-value store) data to prevent "train-serve skew." The online store alone handles 800 TB of data and 1 TB/second of reads.