Menu
Slack Engineering·May 5, 2026

Modernizing Data Pipelines: Migrating from SSH to REST at Slack

This article details Slack's large-scale migration of over 700 SSH-based EMR data pipelines to a REST-based architecture. It highlights the significant security, operational, and architectural limitations of relying on SSH for job orchestration and presents a robust solution utilizing YARN Distributed Shell and a custom gateway service called Quarry. The migration demonstrates a critical shift towards more secure, reliable, and scalable data processing infrastructure.

Read original on Slack Engineering

Slack faced a significant technical debt and security risk with over 700 production data pipelines relying on direct SSH access to AWS EMR clusters. This legacy approach, born from early simplicity, became a major impediment to infrastructure modernization, introducing security vulnerabilities, operational complexities, and blocking advancements like migrating to EMR on EKS.

The Problems with SSH-based Job Execution

  • Security Risks: Direct SSH exposed a large attack surface, complicated key management, lacked fine-grained auditing, and made permission management unwieldy.
  • Operational Pain Points: Jobs running directly on master nodes caused resource contention. Kubernetes pod restarts broke SSH connections, leading to job failures and "zombie" processes. Determining job success or failure after connection drops was unreliable.
  • Architectural Blockers: The SSH dependency prevented migration to modern platforms like Spark on Kubernetes/EMR on EKS and inhibited internal initiatives for improved security and network isolation (Whitecastle).

The Solution: REST-based Architecture with YARN Distributed Shell and Quarry

The core of the modernization was to shift from stateful SSH connections to stateless REST-based job submissions. Modern compute engines offer HTTP APIs for this, allowing clients to submit a job, get an ID, and then query its status independently. This decouples the client from the job's lifecycle, improving reliability.

💡

YARN Distributed Shell: The Game Changer

For jobs without native REST APIs (like arbitrary shell commands), YARN Distributed Shell was crucial. This little-known YARN feature allows any shell script to run within a proper YARN container, leveraging YARN's resource management, isolation, fault tolerance, and logging. This enabled migration of *all* SSH-based jobs, not just native Hadoop workloads.

Quarry: Slack's Universal Job Submission Gateway

Quarry acts as an abstraction layer, sitting between orchestrators (like Airflow) and various compute engines (YARN, Trino, Snowflake). It handles authentication, job submission, state tracking, lifecycle management, and observability through a unified REST API. This enabled Airflow operators to make simple HTTP requests to Quarry, which then managed the job submission and status polling with YARN, completely eliminating direct SSH interactions and their associated problems.

data pipelineETLAWS EMRYARNREST APISSHorchestrationmodernization

Comments

Loading comments...