AWS has replaced traditional fat-tree data center network topologies with a new flat architecture called Resilient Network Graphs (RNG), based on quasi-random graph theory. This innovation, deployed in production, significantly reduces networking devices and power consumption while improving throughput and resilience. The design leverages passive optical ShuffleBoxes for physical connectivity and a custom distributed protocol, Spraypoint, for routing.
Read original on InfoQ ArchitectureTraditional data center networks rely on a fat-tree topology, characterized by a hierarchical structure of top-of-rack (ToR), aggregation, and spine switches. This design funnels traffic through shared spine links, which can become bottlenecks under heavy load, leading to reduced throughput. Scaling such a network often requires adding entire switch tiers, a costly and power-intensive approach. AWS's move to Resilient Network Graphs (RNG) represents a fundamental shift away from this hierarchy.
The Core Problem with Fat-Tree Topologies
In a fat-tree, traffic between servers on different racks must traverse up and down the switch hierarchy. Congestion at higher-level spine switches can severely degrade network performance for many racks simultaneously, even if other parts of the network have ample bandwidth. This introduces single points of failure and bottlenecks that are expensive to mitigate.
RNG implements a flat network architecture where the spine and leaf layers are eliminated. Instead, ToR switches are directly connected to a quasi-random set of other ToR nodes. This mesh-like connectivity is based on expander-based network fabrics derived from random graph theory, which mathematicians theorized in the early 1990s as the most efficient and resilient network topology.