InfoQ Architecture·June 3, 2026

Preventing Spark OOM Failures on Kubernetes: Lessons from Cloud Migration Misconfigurations

This article details a case study where migrating Spark batch pipelines to Azure Kubernetes Service (AKS) led to repeated Out-Of-Memory (OOM) failures. The root cause was identified as a combination of infrastructure misconfigurations: RAM-backed local scratch directories and forced executor co-location on a single node. It emphasizes the importance of validating infrastructure contracts during cloud migrations to avoid subtle, load-dependent issues.

Cloud & Infrastructure Performance & Scaling Case Studies & Postmortems

Read original on InfoQ Architecture

Migrating existing workloads to a new infrastructure, especially from on-premises to cloud-native platforms like Kubernetes, often surfaces unexpected interactions between application configurations and the underlying environment. This case study highlights two subtle but critical infrastructure misconfigurations that caused severe Spark OOM failures after a lift-and-shift migration to Azure Kubernetes Service (AKS).

The Problem: Persistent Spark Executor OOMs

After migrating a shuffle-intensive Spark batch job to AKS, executors began failing with OOM errors during shuffle stages. Initial diagnostics focused on Spark-level memory tuning (increasing `spark.executor.memory`, adjusting executor counts), but these proved ineffective. The job had been stable for years on-premises, suggesting an infrastructure-level change was at play rather than a Spark application bug.

Key Misconfigurations Identified

RAM-backed local scratch directories (`spark.kubernetes.local.dirs.tmpfs=true`): This setting caused Spark to use `tmpfs` (memory-backed filesystem) for shuffle spill rather than persistent disk. Combined with small `emptyDir` `sizeLimit` (1Gi), this quickly exhausted node RAM during shuffle-heavy operations.
Hard `podAffinity` rule forcing executor co-location: A `podAffinity:requiredDuringSchedulingIgnoredDuringExecution` rule unintentionally constrained all Spark executor pods to the same Kubernetes node. This concentrated all memory and I/O pressure, including the RAM-backed shuffle spill, onto a single machine.

⚠️

Compounding Effects

Individually, each misconfiguration might have been manageable. Shuffle spill to RAM could be mitigated by distributing executors across nodes. Executor co-location might be tolerable with disk-backed spill. However, their combination created a catastrophic scenario: all shuffle memory pressure, backed by RAM, concentrated on a single node, leading to rapid node memory exhaustion and OOM kills by the kernel.

Resolution and Architectural Lessons

The fix involved setting `spark.kubernetes.local.dirs.tmpfs=false` (to use disk-backed storage), increasing `tmp-volume` and `workdir` sizes to 10Gi, and replacing the hard `podAffinity` rule with a softer `podAntiAffinity` rule (`preferredDuringSchedulingIgnoredDuringExecution`) to encourage executor distribution. This immediately resolved the OOM failures, highlighting critical lessons for cloud migrations:

Explicitly validate infrastructure contracts: Assumptions about storage semantics and scheduling behavior from on-premises environments may not hold in cloud-native platforms. Thoroughly review and validate configuration changes.
Test under production-scale load: Subtle infrastructure misconfigurations often only manifest under significant load and realistic data profiles.
Monitor at the node level: Node-level memory utilization, especially for ephemeral storage like `tmpfs`, can be a more accurate indicator of resource exhaustion than application-level metrics, especially for OOMKilled events.

SparkKubernetesOOMCloud MigrationShuffle SpillPod AffinitytmpfsDistributed Systems

Comments

Loading comments...

Architecture Design

Design this yourself

Design a data processing platform on Kubernetes for large-scale batch analytics, ensuring robust memory management for shuffle-intensive Spark jobs. Address how to configure resource isolation, prevent executor co-location, and manage temporary storage effectively to avoid OOM failures during peak load. Include considerations for monitoring node-level resource utilization and architecting for observability across Spark and Kubernetes.

Practice Interview

Focus: Spark executor memory management and Kubernetes scheduling

Other design angles

· Design an automated validation framework for migrating distributed data processing workloads to Kubernetes, specifically focusing on identifying and preventing resource-related misconfigurations like those described.· Architect a multi-tenant Spark-on-Kubernetes platform, detailing how to enforce resource quotas and placement policies to prevent a single tenant's shuffle-heavy job from impacting others or exhausting shared node resources.· Design a system for continuous performance testing and incident detection for Spark applications on Kubernetes, focusing on identifying subtle, load-dependent infrastructure misconfigurations before they impact production.