This article details a case study where migrating Spark batch pipelines to Azure Kubernetes Service (AKS) led to repeated Out-Of-Memory (OOM) failures. The root cause was identified as a combination of infrastructure misconfigurations: RAM-backed local scratch directories and forced executor co-location on a single node. It emphasizes the importance of validating infrastructure contracts during cloud migrations to avoid subtle, load-dependent issues.
Read original on InfoQ ArchitectureMigrating existing workloads to a new infrastructure, especially from on-premises to cloud-native platforms like Kubernetes, often surfaces unexpected interactions between application configurations and the underlying environment. This case study highlights two subtle but critical infrastructure misconfigurations that caused severe Spark OOM failures after a lift-and-shift migration to Azure Kubernetes Service (AKS).
After migrating a shuffle-intensive Spark batch job to AKS, executors began failing with OOM errors during shuffle stages. Initial diagnostics focused on Spark-level memory tuning (increasing `spark.executor.memory`, adjusting executor counts), but these proved ineffective. The job had been stable for years on-premises, suggesting an infrastructure-level change was at play rather than a Spark application bug.
Compounding Effects
Individually, each misconfiguration might have been manageable. Shuffle spill to RAM could be mitigated by distributing executors across nodes. Executor co-location might be tolerable with disk-backed spill. However, their combination created a catastrophic scenario: all shuffle memory pressure, backed by RAM, concentrated on a single node, leading to rapid node memory exhaustion and OOM kills by the kernel.
The fix involved setting `spark.kubernetes.local.dirs.tmpfs=false` (to use disk-backed storage), increasing `tmp-volume` and `workdir` sizes to 10Gi, and replacing the hard `podAffinity` rule with a softer `podAntiAffinity` rule (`preferredDuringSchedulingIgnoredDuringExecution`) to encourage executor distribution. This immediately resolved the OOM failures, highlighting critical lessons for cloud migrations: