InfoQ Architecture·May 14, 2026

Resolving CPU Starvation in Distributed ML Platforms: Pinterest's Zombie Cgroup Hunt

Pinterest engineers successfully troubleshooted intermittent CPU starvation issues on their Kubernetes-based ML platform, PinCompute. The problem was traced to 'zombie' memory cgroups leaked by a crashlooping, unused agent in the base AMI, which led to network device resets and job failures. The resolution involved disabling the offending agent and highlights the criticality of deep system-level understanding and robust observability in large-scale distributed environments.

DevOps & SRE Performance & Scaling Distributed Systems

Read original on InfoQ Architecture

The Challenge: Intermittent CPU Starvation and ML Job Failures

Pinterest's PinCompute platform, which orchestrates tens of thousands of Ray clusters for machine learning workloads, experienced significant instability. This manifested as intermittent network failures, Elastic Network Adapter (ENA) device resets, dropped packets, and a substantial drop in training job success rates (over 25%). The initial challenge was that aggregate CPU utilization appeared healthy, masking the underlying per-core saturation causing the problems.

Diving Deeper: Per-Core Analysis and NAPI Starvation

Engineers moved beyond high-level dashboards to per-core analysis using `mpstat`. This revealed individual cores hitting 100% system CPU for seconds. Critically, if a core responsible for handling ENA network interrupts became saturated, the network driver's NAPI poll thread would be starved of cycles. This starvation triggered ENA device resets, a self-healing mechanism that paradoxically caused the network connectivity loss and subsequent Ray job crashes.

💡

System Design Lesson

This case highlights that high-level metrics can be misleading in distributed systems. Deeper, per-component or per-core observability is crucial for identifying localized bottlenecks that can have cascading effects across the system. Network interrupt handling is a critical path for performance; its starvation can cripple a node.

Root Cause: Zombie Memory Cgroups

Using rolling `perf` captures visualized with Netflix's Flamescope, the team pinpointed moments of network resets. They observed the `kubelet` process, usually low-CPU, spiking to 6.5% of a core, spending most of its time in the `mem_cgroup_nr_lru_pages` kernel function. The root cause was eventually traced to an unused, default-enabled Amazon ECS agent within their AWS Deep Learning AMI. This agent was crashlooping and, upon each restart, leaking memory cgroups (memcgs). With nearly 70,000 'zombie' memcgs accumulated, the `kubelet` had to traverse this inflated list during cgroup stats synchronization, monopolizing a single core.

Resolution and Broader Implications

The fix was simple: disable the ECS agent's `systemd` unit in the base image and reboot affected machines to purge the accumulated cgroups. This restored stability. The experience underscores that abstractions can obscure true root causes, often residing in redundant userspace daemons leaking kernel state. It also emphasizes the importance of understanding base image configurations and the value of continuous, temporally indexed profiling tools (like gProfiler, eBPF-based platforms) for proactive observability in production.

Master low-level diagnostic tools: High-level metrics aren't always enough.
Question system defaults: Default configurations can introduce hidden performance issues.
Invest in continuous profiling: Move beyond reactive debugging to proactive identification of bottlenecks.
Understand your base images: Be aware of all pre-installed agents and services, even if seemingly unused.

KubernetesCPU StarvationMemory CgroupsObservabilityProfilingTroubleshootingDistributed ComputingML Infrastructure

Comments

Loading comments...

Architecture Design

Design this yourself

Design a highly available and scalable distributed machine learning training platform like Pinterest's PinCompute, specifically addressing how you would prevent and diagnose issues related to resource isolation (CPU, memory, network) and 'noisy neighbor' problems. Focus on architecture decisions for base image management, observability (especially low-level system metrics and profiling), and fault tolerance mechanisms to ensure high job success rates in a multi-tenant environment.

Practice Interview

Other design angles

· Design a robust base AMI provisioning and management system for a distributed computing platform, detailing strategies to minimize unused components and prevent resource leaks.· Design an observability stack for a large-scale Kubernetes cluster, emphasizing how to collect, aggregate, and visualize kernel-level metrics and continuous profiling data to identify performance bottlenecks.· Design a system for proactive resource usage monitoring and anomaly detection on a distributed ML platform to automatically identify and alert on potential CPU starvation or memory cgroup leakage issues.