This article details a Cloudflare case study where slow Kubernetes StatefulSet restarts, caused by recursive file ownership changes on large Persistent Volumes, led to significant engineering downtime. It explores the debugging process to identify the root cause, a default Kubernetes `fsGroupChangePolicy`, and the simple one-line fix that dramatically reduced restart times and improved operational efficiency.
Read original on Cloudflare BlogCloudflare encountered a recurring issue where restarting their Atlantis Kubernetes StatefulSet, responsible for managing Terraform changes, took approximately 30 minutes. With around 100 restarts per month for credential rotations and onboarding, this amounted to over 50 hours of blocked engineering time monthly. The problem stemmed from a Kubernetes default behavior interacting inefficiently with a PersistentVolume containing millions of files.
Initial investigations using `kubectl events` provided limited insight, only showing the pod waiting for an init container. To uncover the true bottleneck, the team analyzed `kubelet` logs on the affected node. This revealed a significant delay between the Persistent Volume being mounted and the pod actually starting, accompanied by `Error syncing pod` messages related to unmounted volumes and context deadlines.
Debugging Kubernetes Deep Dives
When Kubernetes events and basic pod descriptions don't reveal the problem, checking the `kubelet` logs on the node where the pod is scheduled can provide crucial low-level insights into volume mounting, container runtime issues, and other host-level interactions.
Further log analysis, specifically filtering for the Persistent Volume name, exposed a critical log message: `Setting volume ownership for ... and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow`. This immediately pointed to `fsGroupChangePolicy` as the issue. The default `fsGroupChangePolicy: Always` recursively changes the group ownership for every file and directory on the mounted volume to match the `fsGroup` specified in the pod's `securityContext`.
The fix involved changing the `fsGroupChangePolicy` from its default `Always` to `OnRootMismatch` within the pod's `securityContext`. This setting, available since Kubernetes v1.20, ensures that group ownership is only changed if the root directory of the PV doesn't have the correct permissions, avoiding a recursive traversal of millions of files. This simple modification reduced Atlantis restart times from 30 minutes to approximately 30 seconds, saving Cloudflare 600 engineering hours annually.
spec:
template:
spec:
securityContext:
fsGroupChangePolicy: OnRootMismatch