Menu
InfoQ Cloud·June 3, 2026

Avoiding Spark OOM Failures on Kubernetes: A Migration Case Study

This article details a post-migration incident where Spark jobs on Azure Kubernetes Service (AKS) experienced repeated Out-Of-Memory (OOM) failures. It highlights how two infrastructure misconfigurations — RAM-backed local scratch directories and forced executor co-location — interacted under production load to exhaust node memory, offering crucial lessons in validating infrastructure behavior during cloud migrations.

Read original on InfoQ Cloud

This case study examines a common pitfall in cloud migrations: the assumption that lift-and-shift will preserve runtime behavior. In this instance, a Spark batch pipeline, stable for years on-premises, failed repeatedly after migrating to Azure Kubernetes Service (AKS). The root cause wasn't Spark application tuning, but rather subtle changes in infrastructure configuration that fundamentally altered resource handling.

Key Misconfigurations and Their Impact

  • RAM-backed Local Scratch Directories (spark.kubernetes.local.dirs.tmpfs=true): This setting caused Spark to use `tmpfs` (memory-backed filesystem) for shuffle spill instead of disk. For shuffle-intensive jobs, this led to rapid consumption of node RAM.
  • Hard Pod Affinity Rule (podAffinity requiredDuringSchedulingIgnoredDuringExecution): Instead of distributing Spark executors across multiple nodes, a misconfigured `podAffinity` rule forced all executors onto a single Kubernetes node. This concentrated all shuffle-time memory pressure and I/O on one machine.
  • Insufficient Volume Limits: The RAM-backed scratch volumes (`tmp-volume`, `workdir`) were sized at only 1Gi, which was far too small for the actual shuffle data generated by the multi-pass processing job.

The combination of these factors created a perfect storm: all shuffle data from multiple executors was being spilled to the RAM of a *single* node, rapidly exceeding its capacity and triggering Kubernetes OOM kills.

Resolution and Lessons Learned

ParameterBefore (broken)After (fixed)
💡

System Design Takeaways

Cloud migrations are not just about lifting and shifting; they require explicit validation of underlying infrastructure contracts. Storage semantics, scheduling behavior, and resource isolation models can differ significantly between environments. Thoroughly test under production-scale load to uncover compound misconfigurations that manifest only at scale, and consider potential interactions between seemingly isolated settings.

SparkKubernetesOOMCloud MigrationTroubleshootingPerformance TuningDistributed ComputingAzure AKS

Comments

Loading comments...