Pinterest engineers successfully troubleshooted intermittent CPU starvation issues on their Kubernetes-based ML platform, PinCompute. The problem was traced to 'zombie' memory cgroups leaked by a crashlooping, unused agent in the base AMI, which led to network device resets and job failures. The resolution involved disabling the offending agent and highlights the criticality of deep system-level understanding and robust observability in large-scale distributed environments.
Read original on InfoQ ArchitecturePinterest's PinCompute platform, which orchestrates tens of thousands of Ray clusters for machine learning workloads, experienced significant instability. This manifested as intermittent network failures, Elastic Network Adapter (ENA) device resets, dropped packets, and a substantial drop in training job success rates (over 25%). The initial challenge was that aggregate CPU utilization appeared healthy, masking the underlying per-core saturation causing the problems.
Engineers moved beyond high-level dashboards to per-core analysis using `mpstat`. This revealed individual cores hitting 100% system CPU for seconds. Critically, if a core responsible for handling ENA network interrupts became saturated, the network driver's NAPI poll thread would be starved of cycles. This starvation triggered ENA device resets, a self-healing mechanism that paradoxically caused the network connectivity loss and subsequent Ray job crashes.
System Design Lesson
This case highlights that high-level metrics can be misleading in distributed systems. Deeper, per-component or per-core observability is crucial for identifying localized bottlenecks that can have cascading effects across the system. Network interrupt handling is a critical path for performance; its starvation can cripple a node.
Using rolling `perf` captures visualized with Netflix's Flamescope, the team pinpointed moments of network resets. They observed the `kubelet` process, usually low-CPU, spiking to 6.5% of a core, spending most of its time in the `mem_cgroup_nr_lru_pages` kernel function. The root cause was eventually traced to an unused, default-enabled Amazon ECS agent within their AWS Deep Learning AMI. This agent was crashlooping and, upon each restart, leaking memory cgroups (memcgs). With nearly 70,000 'zombie' memcgs accumulated, the `kubelet` had to traverse this inflated list during cgroup stats synchronization, monopolizing a single core.
The fix was simple: disable the ECS agent's `systemd` unit in the base image and reboot affected machines to purge the accumulated cgroups. This restored stability. The experience underscores that abstractions can obscure true root causes, often residing in redundant userspace daemons leaking kernel state. It also emphasizes the importance of understanding base image configurations and the value of continuous, temporally indexed profiling tools (like gProfiler, eBPF-based platforms) for proactive observability in production.