Menu
InfoQ Architecture·June 3, 2026

Preventing Spark OOM Failures on Kubernetes: Lessons from Cloud Migration Misconfigurations

This article details a case study where migrating Spark batch pipelines to Azure Kubernetes Service (AKS) led to repeated Out-Of-Memory (OOM) failures. The root cause was identified as a combination of infrastructure misconfigurations: RAM-backed local scratch directories and forced executor co-location on a single node. It emphasizes the importance of validating infrastructure contracts during cloud migrations to avoid subtle, load-dependent issues.

Read original on InfoQ Architecture

Migrating existing workloads to a new infrastructure, especially from on-premises to cloud-native platforms like Kubernetes, often surfaces unexpected interactions between application configurations and the underlying environment. This case study highlights two subtle but critical infrastructure misconfigurations that caused severe Spark OOM failures after a lift-and-shift migration to Azure Kubernetes Service (AKS).

The Problem: Persistent Spark Executor OOMs

After migrating a shuffle-intensive Spark batch job to AKS, executors began failing with OOM errors during shuffle stages. Initial diagnostics focused on Spark-level memory tuning (increasing `spark.executor.memory`, adjusting executor counts), but these proved ineffective. The job had been stable for years on-premises, suggesting an infrastructure-level change was at play rather than a Spark application bug.

Key Misconfigurations Identified

  1. RAM-backed local scratch directories (`spark.kubernetes.local.dirs.tmpfs=true`): This setting caused Spark to use `tmpfs` (memory-backed filesystem) for shuffle spill rather than persistent disk. Combined with small `emptyDir` `sizeLimit` (1Gi), this quickly exhausted node RAM during shuffle-heavy operations.
  2. Hard `podAffinity` rule forcing executor co-location: A `podAffinity:requiredDuringSchedulingIgnoredDuringExecution` rule unintentionally constrained all Spark executor pods to the same Kubernetes node. This concentrated all memory and I/O pressure, including the RAM-backed shuffle spill, onto a single machine.
⚠️

Compounding Effects

Individually, each misconfiguration might have been manageable. Shuffle spill to RAM could be mitigated by distributing executors across nodes. Executor co-location might be tolerable with disk-backed spill. However, their combination created a catastrophic scenario: all shuffle memory pressure, backed by RAM, concentrated on a single node, leading to rapid node memory exhaustion and OOM kills by the kernel.

Resolution and Architectural Lessons

The fix involved setting `spark.kubernetes.local.dirs.tmpfs=false` (to use disk-backed storage), increasing `tmp-volume` and `workdir` sizes to 10Gi, and replacing the hard `podAffinity` rule with a softer `podAntiAffinity` rule (`preferredDuringSchedulingIgnoredDuringExecution`) to encourage executor distribution. This immediately resolved the OOM failures, highlighting critical lessons for cloud migrations:

  • Explicitly validate infrastructure contracts: Assumptions about storage semantics and scheduling behavior from on-premises environments may not hold in cloud-native platforms. Thoroughly review and validate configuration changes.
  • Test under production-scale load: Subtle infrastructure misconfigurations often only manifest under significant load and realistic data profiles.
  • Monitor at the node level: Node-level memory utilization, especially for ephemeral storage like `tmpfs`, can be a more accurate indicator of resource exhaustion than application-level metrics, especially for OOMKilled events.
SparkKubernetesOOMCloud MigrationShuffle SpillPod AffinitytmpfsDistributed Systems

Comments

Loading comments...