This article delves into the architecture of Netflix's Live Origin, a custom-built server designed to manage and deliver live video segments to millions of devices. It highlights key architectural decisions such as redundant regional pipelines, manifest design with segment templates, and intelligent segment selection for defect handling. The piece also explores the evolution of its storage architecture from AWS S3 to a custom Cassandra-based solution, optimized for the unique demands of high-throughput, low-latency live streaming.
Read original on ByteByteGoNetflix Live Origin is a critical intermediary component between live streaming pipelines and Open Connect, Netflix's global Content Delivery Network (CDN). Unlike Video on Demand (VOD), live streaming demands real-time processing and delivery of video segments, typically within seconds. The Live Origin acts as a quality control gateway, ensuring that only valid video segments reach viewers worldwide. Its design addresses the challenges of time constraints, defect handling, and massive scale inherent in live video distribution.
Open Connect, originally optimized for VOD, required extensions for live streaming. Netflix optimized nginx's proxy-caching functionality with several key features:
Netflix initially used AWS S3 for Live Origin storage, but found its performance inadequate for high-scale, low-latency live streaming. The stringent 2-second retry budget and critical, time-sensitive writes demanded a more robust solution. They identified five key requirements: extremely high write availability within a region with low-latency cross-region replication, high write throughput (hundreds of MB/s), efficient handling of large writes with thousands of keys per partition, strong intra-region consistency for sub-second read latency, and gigabytes of read throughput without affecting writes during 'Origin Storms'.
Live Streaming vs. VOD Storage Needs
The article highlights a crucial distinction: live streaming storage requirements are closer to a global, low-latency, highly available database than traditional object storage, primarily due to the criticality of every write and the strict time budgets.
Their solution leveraged an existing Key-Value Storage Abstraction built on Apache Cassandra. By chunking large payloads and using Cassandra's local-quorum consistency with a write-optimized Log-Structured Merge Tree engine, they met the stringent write availability, throughput, and consistency requirements. Median write latency dropped significantly from 113ms to 25ms. To handle 'Origin Storms' (high read throughput impacting writes), they introduced write-through caching using EVCache (their distributed Memcached-based system), offloading most reads to a highly scalable cache and enabling 200Gbps+ throughput without affecting write performance.