Menu
InfoQ Architecture·June 4, 2026

AWS Replaces Fat-Tree Data Center Networks with Random Graph Theory for Enhanced Efficiency

AWS has replaced traditional fat-tree data center network topologies with a new flat architecture called Resilient Network Graphs (RNG), based on quasi-random graph theory. This innovation, deployed in production, significantly reduces networking devices and power consumption while improving throughput and resilience. The design leverages passive optical ShuffleBoxes for physical connectivity and a custom distributed protocol, Spraypoint, for routing.

Read original on InfoQ Architecture

Evolution from Fat-Tree to Random Graph Networks

Traditional data center networks rely on a fat-tree topology, characterized by a hierarchical structure of top-of-rack (ToR), aggregation, and spine switches. This design funnels traffic through shared spine links, which can become bottlenecks under heavy load, leading to reduced throughput. Scaling such a network often requires adding entire switch tiers, a costly and power-intensive approach. AWS's move to Resilient Network Graphs (RNG) represents a fundamental shift away from this hierarchy.

ℹ️

The Core Problem with Fat-Tree Topologies

In a fat-tree, traffic between servers on different racks must traverse up and down the switch hierarchy. Congestion at higher-level spine switches can severely degrade network performance for many racks simultaneously, even if other parts of the network have ample bandwidth. This introduces single points of failure and bottlenecks that are expensive to mitigate.

Resilient Network Graphs (RNG) Architecture

RNG implements a flat network architecture where the spine and leaf layers are eliminated. Instead, ToR switches are directly connected to a quasi-random set of other ToR nodes. This mesh-like connectivity is based on expander-based network fabrics derived from random graph theory, which mathematicians theorized in the early 1990s as the most efficient and resilient network topology.

Key Architectural Components

  • ShuffleBox: A passive optical device that provides the physical quasi-random interconnections between racks. It shuffles fiber wiring internally, allowing for logical randomness while maintaining straightforward physical cabling. Being passive, it adds no latency, consumes no power, and introduces no new failure modes.
  • Spraypoint: A custom distributed routing protocol designed for the flat RNG topology. With no hierarchy, Spraypoint sprays traffic simultaneously across neighboring routers and uses designated waypoints to guide packets to their destinations. This approach, while seemingly inefficient by sending duplicate packets, leverages multi-path redundancy for increased resilience and full utilization of available bandwidth.

Benefits and Trade-offs

  • Efficiency: 69% fewer networking devices, up to 33% higher throughput, and a projected 40% reduction in network equipment power consumption compared to fat-tree designs.
  • Resilience: A major advantage of RNG is its graceful degradation. Losing 1% of routers results in roughly a 1% loss of capacity, unlike the catastrophic bottlenecks seen in fat-tree topologies when a spine switch fails. This is because random graphs distribute connectivity evenly, preventing single points of bottleneck.
  • Cost Savings: The arXiv paper reports cost savings between 9% and 45% with equal or better performance.
  • Scope Limitation: RNG is optimized for general-purpose compute with random traffic patterns. It is not suitable for AI training workloads, which generate coordinated, centralized traffic, where AWS continues to use its UltraServer architecture.
AWSData CenterNetwork TopologyFat-TreeRandom GraphExpander GraphShuffleBoxSpraypoint

Comments

Loading comments...