This article details a post-migration incident where Spark jobs on Azure Kubernetes Service (AKS) experienced repeated Out-Of-Memory (OOM) failures. It highlights how two infrastructure misconfigurations — RAM-backed local scratch directories and forced executor co-location — interacted under production load to exhaust node memory, offering crucial lessons in validating infrastructure behavior during cloud migrations.
Read original on InfoQ CloudThis case study examines a common pitfall in cloud migrations: the assumption that lift-and-shift will preserve runtime behavior. In this instance, a Spark batch pipeline, stable for years on-premises, failed repeatedly after migrating to Azure Kubernetes Service (AKS). The root cause wasn't Spark application tuning, but rather subtle changes in infrastructure configuration that fundamentally altered resource handling.
The combination of these factors created a perfect storm: all shuffle data from multiple executors was being spilled to the RAM of a *single* node, rapidly exceeding its capacity and triggering Kubernetes OOM kills.
| Parameter | Before (broken) | After (fixed) |
|---|
System Design Takeaways
Cloud migrations are not just about lifting and shifting; they require explicit validation of underlying infrastructure contracts. Storage semantics, scheduling behavior, and resource isolation models can differ significantly between environments. Thoroughly test under production-scale load to uncover compound misconfigurations that manifest only at scale, and consider potential interactions between seemingly isolated settings.