BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Pinterest Engineers Eliminate CPU Zombies to Resolve Production Bottlenecks

Pinterest Engineers Eliminate CPU Zombies to Resolve Production Bottlenecks

Listen to this article -  0:00

Pinterest has published a detailed technical account of how its engineers tracked down intermittent CPU starvation that was crashing machine learning training jobs. By identifying what the team termed "zombies" (leaked memory cgroups left behind by a crashlooping default agent), the engineers restored stability to their distributed computing platform.

The issue manifested as intermittent network failures and job crashes on PinCompute, the Kubernetes-based platform where Pinterest runs more than half of its offline machine learning workload. Tens of thousands of Ray clusters are provisioned monthly for these tasks, and some use cases saw training job success rates drop by more than 25% due to Elastic Network Adapter (ENA) device resets and dropped packets. Initial investigations were hampered because aggregate CPU utilisation looked healthy, masking the failures underneath.

Forced off high-level dashboards, the infrastructure team dropped to per-core analysis using mpstat. That investigation revealed individual cores hitting 100% system CPU for seconds at a time. This behaviour was particularly problematic because if a core handling ENA network interrupts became saturated, the driver's NAPI poll thread could be starved of cycles, triggering ENA device resets, a self-healing mechanism that fires when Tx completions stall for more than five seconds, and the connectivity loss that crashed Ray jobs.

To pinpoint the source of this core saturation, the team utilised rolling two-minute perf captures run over a 12-hour reproduction window. Visualised in Netflix's Flamescope, the captures let the engineers zoom into the exact moments when network resets are fired. They discovered that the kubelet process, which typically consumes less than 1 per cent of CPU, was spiking to approximately 6.5 per cent. Most of this time was spent in the kernel function mem_cgroup_nr_lru_pages.

The investigation eventually traced the problem to the AWS Deep Learning AMI used for their nodes. This base image included an Amazon ECS agent that was enabled by default but unused by Pinterest. The agent was crashlooping and leaking memory cgroups (memcgs) on every restart. With nearly 70,000 "zombie" memcgs accumulated against only 240 in active use, the kubelet had to walk this inflated list on every cgroup stats sync, monopolising a single core for seconds at a time.

The resolution was relatively simple but required a deep understanding of the system stack. Pinterest resolved the bottleneck by disabling the ECS agent systemd unit in their base image and rebooting affected machines to purge the accumulated cgroups. Since this change, memory cgroup counts have remained stable, and the network resets have ceased. This experience underscores that abstractions between the application, the orchestrator, and the kernel can often obscure the true root cause: in this case, a redundant userspace daemon leaking kernel state.

While Pinterest used manual profiling to solve this instance, the team noted the value of continuous, temporally indexed profiling for production observability. Tools such as gProfiler, which Pinterest is currently rolling out in collaboration with Intel, and eBPF-based platforms like Parca and Grafana Pyroscope, provide the fleet-wide visibility that can shorten the path from symptom to root cause. These tools allow engineers to identify problematic patterns in real time rather than relying on manual captures after a failure occurs.

By sharing their findings, the Pinterest engineering team highlights that performance at scale is often dictated by the default configurations of a base image as much as by application code. Their journey serves as a vital reminder for software engineers to remain sceptical of system defaults and to master low-level diagnostic tools.

About the Author

Rate this Article

Adoption
Style

BT