This article details Pinterest's complex journey to identify and resolve intermittent network connectivity issues in their Ray-based ML training jobs running on Kubernetes clusters backed by AWS EC2. The investigation uncovered CPU starvation affecting AWS ENA network drivers, leading to device resets and job crashes. The process highlights systematic debugging, profiling techniques, and the challenges of diagnosing transient performance bottlenecks in large-scale distributed systems.
Read original on Pinterest EngineeringPinterest's ML platform experienced frequent crashes in their Ray-based training jobs, which are critical for scaling ML workloads. These crashes were attributed to intermittent network connectivity loss within the highly network-active Ray clusters, leading to issues like job hanging, object fetch timeouts, and actor deaths. The non-deterministic nature of the failures made initial debugging efforts challenging, requiring a deeper dive into the underlying infrastructure.
Why CPU Starvation Matters for Network Drivers
Network drivers rely on CPU time to process packets. If the threads responsible for managing network queues are starved of CPU cycles for too long (e.g., 5 seconds in ENA drivers), the driver can become unresponsive, leading to device resets and packet loss. This is critical in latency-sensitive, high-throughput applications like distributed ML training.
Traditional `perf` snapshots initially showed generic CPU usage, masking the root cause. The team realized that given the high number of vCPU cores (96 on GPU machines), a single core experiencing 100% system CPU utilization for extended periods could starve critical network threads. They leveraged `mpstat` for per-core, per-second utilization monitoring, which revealed sporadic 100% system CPU usage on individual cores correlating with network resets.
# Bash program to generate CPU stacks snapshots on a machine.
# Record the start time in the filename for 'time traveling' later!
for i in {1..360}
do
sudo perf record -F 97 -g -a -o perf-$(hostname)-$(date +"%Y%m%d-%H-%M-%S")-120s.data -- sleep 120
done
# Generate perf stacks
for datafile in `ls perf-*`
do
perf script --header -i $datafile > $datafile.stacks
doneTo capture the sporadic nature of the bottleneck, a temporal profiling setup was devised. This involved running `perf` in short, timed increments on dedicated test machines, saving profiles with timestamps. This allowed for 'time travel' debugging, correlating specific network reset events with detailed CPU usage profiles to identify the offending process, which they dubbed a "zombie" process.