ByteByteGo·May 19, 2026

Snapchat's Bento: Scaling AI Prediction to a Billion QPS

This article details Snapchat's Bento platform, an ML-powered system designed to serve a billion predictions per second for various ranking and recommendation tasks. It highlights the architectural decisions and engineering challenges in building a low-latency, high-throughput, and fresh ML serving infrastructure. Key aspects include handling asymmetric ranking workloads, a specialized feature store, and optimized inference engines.

AI & ML Infrastructure Distributed Systems Performance & Scaling

Read original on ByteByteGo

Introduction to Snapchat's Bento Platform

Snapchat's Bento platform is a sophisticated machine learning infrastructure responsible for delivering over a billion predictions per second, powering features like Discover, Spotlight feeds, ad ranking, and friend suggestions. The core challenge lies in making these decisions in roughly 100 milliseconds at the scale of 477 million daily active users, which involves retrieving candidates, fetching features, running deep learning models, and ranking results.

The Asymmetric Ranking Workload

Unlike typical one-to-one web requests, a ranking request is highly asymmetric. A single user request expands into hundreds or thousands of (user, candidate) pairs that need scoring, creating a massive "fanout" problem. Bento addresses this by splitting the work into two stages: retrieval (cheap models filter millions to hundreds/thousands of candidates) and ranking (expensive models carefully score and order these candidates).

ℹ️

Key Pressures on the Bento Platform

The design of Bento is driven by four primary pressures that often conflict: - Latency Pressure: User abandonment if feeds load slowly. - Scale Pressure: Billions of predictions per second, 1 TB/sec feature reads. - Freshness Pressure: Real-time response to user signals (e.g., a recent like). - Iteration Pressure: ML engineers need to ship hundreds of experiments monthly.

Training Pipeline: Enabling Rapid Experimentation

The training half of Bento uses Kubeflow for orchestration and standardizes model development with a layered approach: a shared Core framework (TensorFlow/Keras), individual User model code, and Training configuration (YAML). This layering allows for hundreds of experiments daily by enabling quick changes to data, features, or model specifics. A unique aspect is the model export step, which splits the compute graph to optimize for different inference hardware (dense layers on GPU, embeddings/feature parsing on CPU) to maximize resource utilization. Bento also automates incremental training, ensuring models in production are continuously updated.

Serving Path: Feature Store and Inference Strategies

The serving half of Bento tackles the hard problems of low-latency feature fetching and model inference. The feature store, called Robusta, is critical, processing 10 trillion events/day and maintaining consistency between offline (Apache Iceberg) and online (fast key-value store) data to prevent "train-serve skew." The online store alone handles 800 TB of data and 1 TB/second of reads.

Strategy 1: Collocated Document Features: For many workloads, document features are stored directly on inference engine instances. This eliminates network fanout for candidate features, reducing latency. This is cost-effective at Snap's massive scale but might be wasteful for smaller systems.
Strategy 2: Dedicated Retrieval Service: For very large document corpuses, a separate Retrieval service performs Approximate Nearest Neighbor (ANN) search and index lookups, returning a pre-hydrated candidate set with features to the inference engine. This offloads feature fetching and reduces data transfer.

Machine LearningRecommendation SystemsRanking SystemsFeature StoreLow LatencyHigh ThroughputScalabilityDistributed ML

Comments

Loading comments...

Architecture Design

Design this yourself

Design a high-scale, low-latency machine learning prediction platform like Snapchat's Bento, capable of serving billions of predictions per second for personalized content feeds and ad ranking. Your design should specifically address the asymmetric ranking workload, implement a consistent online/offline feature store (Robusta-like), and detail strategies for efficient feature retrieval and model inference (e.g., collocated features, dedicated retrieval services, optimized model export for heterogenous hardware).

Practice Interview

Other design angles

· Design a feature store component (like Robusta) for a real-time ML platform, focusing on consistency between training and serving, data freshness, and high-throughput reads.· Design the inference engine for a recommendation system that handles a high fanout workload, considering optimizations for latency, compute, and memory efficiency across various hardware.· Design an ML model training pipeline that supports rapid experimentation and continuous, incremental training for a production system, using tools like Kubeflow.