Slack has completed a large-scale modernization of its data platform by replacing SSH-based job execution with a REST-driven orchestration layer across its Amazon EMR pipelines. The migration removed direct SSH access to production clusters and shifted more than 700 Airflow operators to a centralized job submission system, aiming to improve security, reliability, and observability across eight data regions.
Slack’s data platform previously relied on Airflow operators that executed jobs by opening SSH connections directly to Amazon EMR master nodes. While simple initially, this approach became harder to scale as hundreds of production workflows, including search indexing and analytics pipelines, began depending on it. By 2024, SSH-based execution was widely used across production clusters, introducing operational and security concerns.
The primary challenge was an expanded attack surface due to direct production access. SSH key distribution and rotation increased operational overhead, while execution auditing required correlating logs across multiple systems. Reliability also suffered, with jobs sometimes continuing after connection drops or failing silently under infrastructure instability.
To address these issues, Slack introduced a REST-based job submission model built on an internal orchestration layer called Quarry. Instead of persistent SSH sessions, Airflow now submits jobs through HTTP APIs. Each job follows a server-side lifecycle with submission, tracking via job IDs, and controlled cancellation, decoupling execution from client connectivity and improving centralized observability and control.

Before and after architecture comparison (Source: Slack Blog Post)
The migration required additional engineering to support different workload types. While Spark and Hive workloads were transitioned using existing REST interfaces such as Livy and HiveServer2, a significant portion of workloads consisted of arbitrary shell commands. To support these cases, Slack used Apache Hadoop YARN’s Distributed Shell capability, which enables execution of shell commands inside managed containers with resource isolation and fault tolerance.
The migration was executed incrementally across development, staging, and production environments spanning eight data regions. Each region introduced additional complexity due to network segmentation and compliance constraints. During the transition, Slack identified several issues, including virtual memory enforcement behavior in YARN that had previously been obscured by SSH-based execution, as well as cross-account network connectivity gaps that revealed previously hidden dependencies between services.
Sudip Ghosh, Senior Software Engineer @ Walmart, says,
This isn't just a security win; it's a massive operational debt payoff. SSH is easy to start with, impossible to scale securely or audit consistently across a large organization.
Slack completed the migration over three quarters without downtime for critical workloads. The company eliminated SSH access across production EMR clusters, improved job reliability through server-side execution tracking in Quarry, and enhanced observability via structured logging and centralized metrics. The REST-based approach reduced coupling between Airflow and EMR and standardized job submission across teams, while also enabling downstream efforts such as Spark on Kubernetes preparation.
The rollout was executed incrementally using phased operator deprecations and staged validations across environments. Airflow metadata dashboards tracked remaining SSH-dependent workflows, and cross-team coordination helped reduce migration risk. Key lessons included early network topology discovery, validating resource limits across execution models, and improving communication during operator restrictions.