Slack modernized its data platform by replacing SSH-based job execution with a REST-driven orchestration layer for its Amazon EMR pipelines. This large-scale migration of over 700 Airflow jobs significantly improved security, reliability, and observability by eliminating direct SSH access and centralizing job submission and lifecycle management. The new architecture decouples job execution from client connectivity, addressing scalability and operational challenges inherent in the previous SSH approach.
Read original on InfoQ ArchitectureSlack undertook a significant architectural shift in its data platform, moving away from a legacy SSH-based job execution model on Amazon EMR to a more robust REST-based architecture. This modernization effort primarily aimed to enhance security, reliability, and observability for its extensive data processing pipelines, which included critical workloads like search indexing and analytics.
The previous system relied on Apache Airflow operators directly establishing SSH connections to EMR master nodes to execute jobs. While simple for initial setups, this approach faced severe limitations as the number of production workflows grew into the hundreds. Key problems included:
Slack implemented a new internal orchestration layer called Quarry to facilitate a REST-based job submission model. Instead of persistent SSH sessions, Airflow now interacts with Quarry via HTTP APIs. This model introduces a server-side job lifecycle, allowing jobs to be submitted, tracked via unique IDs, and canceled in a controlled manner, effectively decoupling execution from client connectivity.
Architectural Principle: Decoupling
Decoupling job submission from execution client connectivity improves system robustness. If the client (e.g., Airflow scheduler) goes down after submitting a job, the job can still complete independently on the server. This design also centralizes control and observability, simplifying monitoring and management.