Dev.to #architecture·June 16, 2026

Uber RAMEN: Real-time Notification System for Ride-Hailing Apps

This article details Uber's RAMEN system, a robust push messaging infrastructure designed to deliver real-time notifications to millions of rider and driver devices. It explores the architectural choices and engineering challenges in maintaining persistent connections, ensuring at-least-once delivery, and scaling a stateful system for low-latency, reliable communication in unreliable mobile network environments.

Distributed Systems Performance & Scaling API Design

Read original on Dev.to #architecture

The Challenge: Real-time, Reliable Notifications at Scale

Delivering instant ride offers and real-time location updates to millions of active users in a ride-hailing application presents significant system design challenges. The core requirements include extremely low latency (under 100ms), guaranteed delivery even on weak mobile networks, and efficient resource utilization to avoid overwhelming backend systems. Traditional polling mechanisms are highly inefficient for this scale, leading to excessive requests, wasted server resources, high latency, and significant battery drain on client devices.

ℹ️

Polling vs. Push - Why Push Wins

Polling requires clients to repeatedly ask the server for updates, leading to a high volume of empty responses and inherent latency. A push-based system, like RAMEN, maintains open connections with clients, allowing the server to proactively send data instantly when available. This drastically reduces latency, conserves battery life, and optimizes server resources by only transmitting data when necessary.

RAMEN's Three-Tier Architecture

Uber's RAMEN (Real-time Asynchronous Messaging Network) is a custom-built push messaging infrastructure. It employs a three-tier architecture to manage the complexity of decision-making, payload construction, and message delivery:

Fireball Service (Decision Engine): Consumes Kafka events, evaluates business rules, and determines *when* a message needs to be pushed, handling priority and localization.
API Gateway (Payload Builder): Aggregates data from various microservices, builds the actual message payloads, and serializes them (e.g., using Protobuf). This layer determines *what* to push.
RAMEN Server (Delivery Layer): Manages millions of concurrent persistent connections, routes messages to the correct device, and guarantees at-least-once delivery. This layer handles *how* to push.

Protocol Evolution: From SSE to gRPC over QUIC/HTTP/3

The choice of transport protocol is critical for real-time systems. Initially, RAMEN utilized Server-Sent Events (SSE) over HTTP/1.1. While simple, SSE is unidirectional (server-to-client only), requiring separate HTTP POST requests for client acknowledgments (ACKs). It also suffered from head-of-line blocking and heavy JSON payloads. Uber transitioned to gRPC bidirectional streams over QUIC/HTTP/3 for significant improvements:

Full-duplex Communication: Allows both server and client to send data simultaneously on a single connection.
Binary Framing (Protobuf): Enables smaller, more efficient payloads and lower CPU usage.
Multiplexing: Multiple logical streams can share one physical connection, eliminating head-of-line blocking.
QUIC (UDP-based): Offers improved resilience on unstable mobile networks, faster connection establishment, and connection migration (e.g., switching from Wi-Fi to 4G without dropping the stream).
In-stream ACKs: Acknowledgments are sent directly on the gRPC stream, streamlining communication.

Scalability and Reliability for Stateful Connections

RAMEN servers are stateful, meaning each server holds specific TCP/gRPC sockets for particular users. This poses unique challenges for scalability and high availability. To manage millions of connections across a cluster of hundreds of servers, Uber employs Apache Helix and ZooKeeper for sharding and automatic rebalancing. ZooKeeper stores cluster topology, while Helix detects server failures and redistributes shards (groups of user connections) to healthy servers, allowing clients to seamlessly reconnect.

At-Least-Once Delivery: Achieved using a persistence layer (Cassandra for durable storage, Redis for fast caching). Messages are written to Cassandra, cached in Redis, then pushed. If no ACK is received from the client within a timeout, the message is retried from Cassandra.
Sequence Numbers: Each message carries an incrementing sequence number. Upon client reconnection, the client informs the server of the last received sequence number, enabling the server to resend only the missed messages and prevent duplicates.
Graceful Connection Draining: To prevent a "thundering herd" problem during deployments or scaling down, RAMEN servers do not abruptly terminate connections. Instead, they stop accepting new connections and send a "Graceful Disconnect" message to existing clients with a randomized backoff hint, staggering client reconnects over a period.

real-timenotificationsgRPCQUICHTTP/3stateful servicesscalabilityreliability

Comments

Loading comments...

Architecture Design

Design this yourself

Design a real-time notification system for a ride-hailing application that can push messages to millions of devices with sub-100ms latency and at-least-once delivery guarantees. Incorporate a three-tier architecture, discuss the choice of transport protocol (gRPC over QUIC/HTTP/3), and detail how scalability for stateful connections, fault tolerance (using Apache Helix, ZooKeeper, Cassandra, Redis), and graceful connection draining are achieved.

Practice Interview

Other design angles

· Design a generic real-time push notification service that can be adopted by multiple applications, considering multi-tenancy and customizable message delivery semantics.· Focus on the client-side architecture for handling real-time notifications, including connection management, offline synchronization, and battery optimization for mobile devices.· Design a simplified real-time data streaming pipeline for an IoT platform, focusing on ingest, processing, and delivery to connected devices with varying network conditions.