Menu
Pinterest Engineering·April 15, 2026

Diagnosing CPU Bottlenecks and Network Driver Resets in Kubernetes on AWS

This article details Pinterest's complex journey to identify and resolve intermittent network connectivity issues in their Ray-based ML training jobs running on Kubernetes clusters backed by AWS EC2. The investigation uncovered CPU starvation affecting AWS ENA network drivers, leading to device resets and job crashes. The process highlights systematic debugging, profiling techniques, and the challenges of diagnosing transient performance bottlenecks in large-scale distributed systems.

Read original on Pinterest Engineering

The Challenge: Intermittent Network Instability in Ray ML Jobs

Pinterest's ML platform experienced frequent crashes in their Ray-based training jobs, which are critical for scaling ML workloads. These crashes were attributed to intermittent network connectivity loss within the highly network-active Ray clusters, leading to issues like job hanging, object fetch timeouts, and actor deaths. The non-deterministic nature of the failures made initial debugging efforts challenging, requiring a deeper dive into the underlying infrastructure.

Initial Symptoms and Hypotheses

  • Symptom 1: ENA Network Driver Resets: System logs correlated job failures with 'ENA device resets', a self-healing mechanism triggered by unexpected device behavior, such as Tx completions timeouts or unresponsive devices. The documentation pointed to CPU starvation as a common cause.
  • Symptom 2: High System CPU Utilization: Impacted machines showed high system CPU usage, aligning with the CPU starvation theory. Initial attempts to mitigate included using Huge Pages, jemalloc, and CPU affinity, but these did not resolve the network resets.
  • Symptom 3: Availability Zone Differences: The issue was isolated to a single AWS Availability Zone, even though cluster configurations appeared identical across zones, suggesting an environmental rather than a configuration or hardware problem at the AWS level.
ℹ️

Why CPU Starvation Matters for Network Drivers

Network drivers rely on CPU time to process packets. If the threads responsible for managing network queues are starved of CPU cycles for too long (e.g., 5 seconds in ENA drivers), the driver can become unresponsive, leading to device resets and packet loss. This is critical in latency-sensitive, high-throughput applications like distributed ML training.

Advanced Profiling Techniques

Traditional `perf` snapshots initially showed generic CPU usage, masking the root cause. The team realized that given the high number of vCPU cores (96 on GPU machines), a single core experiencing 100% system CPU utilization for extended periods could starve critical network threads. They leveraged `mpstat` for per-core, per-second utilization monitoring, which revealed sporadic 100% system CPU usage on individual cores correlating with network resets.

bash
# Bash program to generate CPU stacks snapshots on a machine.
# Record the start time in the filename for 'time traveling' later!
for i in {1..360}
do
  sudo perf record -F 97 -g -a -o perf-$(hostname)-$(date +"%Y%m%d-%H-%M-%S")-120s.data -- sleep 120
done

# Generate perf stacks
for datafile in `ls perf-*`
do
  perf script --header -i $datafile > $datafile.stacks
done

To capture the sporadic nature of the bottleneck, a temporal profiling setup was devised. This involved running `perf` in short, timed increments on dedicated test machines, saving profiles with timestamps. This allowed for 'time travel' debugging, correlating specific network reset events with detailed CPU usage profiles to identify the offending process, which they dubbed a "zombie" process.

KubernetesAWSENANetworkCPU BottleneckProfilingTroubleshootingMachine Learning Infrastructure

Comments

Loading comments...