Menu
InfoQ Architecture·June 12, 2026

Slack's Migration from SSH to REST-Based Architecture for EMR Pipelines

Slack modernized its data platform by replacing SSH-based job execution with a REST-driven orchestration layer for its Amazon EMR pipelines. This large-scale migration of over 700 Airflow jobs significantly improved security, reliability, and observability by eliminating direct SSH access and centralizing job submission and lifecycle management. The new architecture decouples job execution from client connectivity, addressing scalability and operational challenges inherent in the previous SSH approach.

Read original on InfoQ Architecture

Slack undertook a significant architectural shift in its data platform, moving away from a legacy SSH-based job execution model on Amazon EMR to a more robust REST-based architecture. This modernization effort primarily aimed to enhance security, reliability, and observability for its extensive data processing pipelines, which included critical workloads like search indexing and analytics.

Challenges with SSH-Based Execution

The previous system relied on Apache Airflow operators directly establishing SSH connections to EMR master nodes to execute jobs. While simple for initial setups, this approach faced severe limitations as the number of production workflows grew into the hundreds. Key problems included:

  • Expanded Attack Surface: Direct SSH access to production clusters created significant security vulnerabilities.
  • Operational Overhead: Managing and rotating SSH keys for hundreds of workflows was complex and resource-intensive.
  • Auditing Difficulties: Correlating logs across multiple systems for execution auditing was challenging.
  • Reliability Issues: Jobs could silently fail or continue running after SSH connection drops, leading to inconsistent state and difficult debugging.

The New REST-Based Architecture with Quarry

Slack implemented a new internal orchestration layer called Quarry to facilitate a REST-based job submission model. Instead of persistent SSH sessions, Airflow now interacts with Quarry via HTTP APIs. This model introduces a server-side job lifecycle, allowing jobs to be submitted, tracked via unique IDs, and canceled in a controlled manner, effectively decoupling execution from client connectivity.

💡

Architectural Principle: Decoupling

Decoupling job submission from execution client connectivity improves system robustness. If the client (e.g., Airflow scheduler) goes down after submitting a job, the job can still complete independently on the server. This design also centralizes control and observability, simplifying monitoring and management.

EMRAWSAirflowREST APISSHData PipelinesModernizationDistributed Computing

Comments

Loading comments...