InfoQ Cloud·June 3, 2026

Avoiding Spark OOM Failures on Kubernetes: A Migration Case Study

This article details a post-migration incident where Spark jobs on Azure Kubernetes Service (AKS) experienced repeated Out-Of-Memory (OOM) failures. It highlights how two infrastructure misconfigurations — RAM-backed local scratch directories and forced executor co-location — interacted under production load to exhaust node memory, offering crucial lessons in validating infrastructure behavior during cloud migrations.

Distributed Systems Performance & Scaling Cloud & Infrastructure

Read original on InfoQ Cloud

This case study examines a common pitfall in cloud migrations: the assumption that lift-and-shift will preserve runtime behavior. In this instance, a Spark batch pipeline, stable for years on-premises, failed repeatedly after migrating to Azure Kubernetes Service (AKS). The root cause wasn't Spark application tuning, but rather subtle changes in infrastructure configuration that fundamentally altered resource handling.

Key Misconfigurations and Their Impact

RAM-backed Local Scratch Directories (spark.kubernetes.local.dirs.tmpfs=true): This setting caused Spark to use `tmpfs` (memory-backed filesystem) for shuffle spill instead of disk. For shuffle-intensive jobs, this led to rapid consumption of node RAM.
Hard Pod Affinity Rule (podAffinity requiredDuringSchedulingIgnoredDuringExecution): Instead of distributing Spark executors across multiple nodes, a misconfigured `podAffinity` rule forced all executors onto a single Kubernetes node. This concentrated all shuffle-time memory pressure and I/O on one machine.
Insufficient Volume Limits: The RAM-backed scratch volumes (`tmp-volume`, `workdir`) were sized at only 1Gi, which was far too small for the actual shuffle data generated by the multi-pass processing job.

The combination of these factors created a perfect storm: all shuffle data from multiple executors was being spilled to the RAM of a *single* node, rapidly exceeding its capacity and triggering Kubernetes OOM kills.

Resolution and Lessons Learned

Parameter	Before (broken)	After (fixed)

💡

System Design Takeaways

Cloud migrations are not just about lifting and shifting; they require explicit validation of underlying infrastructure contracts. Storage semantics, scheduling behavior, and resource isolation models can differ significantly between environments. Thoroughly test under production-scale load to uncover compound misconfigurations that manifest only at scale, and consider potential interactions between seemingly isolated settings.

SparkKubernetesOOMCloud MigrationTroubleshootingPerformance TuningDistributed ComputingAzure AKS

Comments

Loading comments...

Architecture Design

Design this yourself

Design a fault-tolerant, scalable data processing platform on Kubernetes for large-scale batch analytics jobs using Apache Spark. The platform must prevent Out-Of-Memory (OOM) failures during shuffle-heavy stages by ensuring robust memory management for shuffle spill and intelligent executor placement across nodes, while also accommodating diverse data workloads.

Practice Interview

Other design angles

· Design a Spark-on-Kubernetes architecture focusing specifically on optimizing shuffle performance and preventing OOMs for ETL jobs with varying data sizes and processing complexities.· Propose a robust CI/CD and testing strategy for Spark applications on Kubernetes that includes infrastructure-level validation to prevent misconfigurations like RAM-backed scratch directories and forced co-location from reaching production.· Architect a multi-tenant Spark platform on Kubernetes where resource allocation and isolation are critical to prevent noisy neighbor issues and OOM failures due to misconfigured individual jobs.

Avoiding Spark OOM Failures on Kubernetes: A Migration Case Study

Key Misconfigurations and Their Impact

Resolution and Lessons Learned

Comments

Architecture Design

Related Lessons