This article details the architectural evolution of Joyn, a German streaming platform, from a fragile single-node setup to a resilient, scalable, multi-region serverless architecture on AWS. It highlights key patterns like Hub and Spoke for data consistency and cell-based isolation for resiliency, along with strategic trade-offs in data replication and cost optimization for high-availability systems.
Read original on InfoQ ArchitectureThe article chronicles the transformation of a streaming application's backend, initially plagued by single points of failure, inconsistent data, and poor scalability due to technical debt and a lack of architectural foresight. The original system featured Kafka workers processing data into a single-node database, behind a GraphQL API that frequently crashed under load. The core challenges revolved around data consistency across multiple unstandardized services and achieving high availability and scalability for millions of users at an affordable cost.
The team adopted a serverless-first approach using AWS managed services to offload operational complexity and inherently gain scalability and availability. This shift significantly reduced deployment times from hours to minutes. The new architecture supports both active-active and active-passive multi-region setups, depending on service criticality, demonstrating a pragmatic approach to resilience and cost. The presentation emphasizes that achieving high availability and scalability at a low cost often involves trade-offs that management needs to understand.
A major issue was inconsistent data due to multiple services subscribing to the same Kafka topics and performing divergent validations and transformations. The solution was the Hub and Spoke pattern, using AWS EventBridge as a local bus for each service and EventBridge Pipes as a middleman between Kafka (the company bus/event store) and EventBridge. This pattern ensures clear boundaries and a single interface for inter-service communication, simplifying routing and enforcing data consistency.
Hub and Spoke Pattern for Event-Driven Systems
Each microservice communicates only with its local EventBridge bus (the "spoke"). EventBridge Pipes connect the main Kafka bus (the "hub") to these local spokes, allowing for centralized validation and transformation before fanning out events to consumers. This decouples services and establishes a single source of truth for events.
Given the 256 KB message size limit for EventBridge, handling large media-related events (up to 30-40 MB in Kafka) necessitated the Claim Check pattern. EventBridge Pipe's enrichment feature is used to intercept messages, store the large payload in S3, and then pass only the S3 key via EventBridge. Consumers retrieve the full data from S3, effectively scaling the API without custom build efforts. This is a crucial trade-off between message size and network overhead.
| Feature | Event-Driven Architecture | Data Replication (e.g., PostgreSQL pglogical) |
|---|
The presentation also contrasted event-driven architecture with traditional data replication (e.g., PostgreSQL pglogical). While event-driven offers strong decoupling and allows services to choose their own data stores, data replication can be simpler for partial data needs if all services can use the same database. However, data replication introduces tighter coupling, potential schema change issues, and significant operational complexity with network isolation and database management. The Joyn team prioritized decoupling and operational simplicity, leaning towards the event-driven approach.