This article details Slack's architectural journey in building and scaling its AI platform for Large Language Models (LLMs) across multi-cloud environments. It highlights critical decisions and trade-offs made in phases, from initial AWS SageMaker deployments to Amazon Bedrock for managed services, and ultimately towards a hybrid multi-cloud strategy to address challenges like GPU scarcity, operational overhead, and model freshness. The narrative focuses on ensuring security, reliability, and performance while optimizing cost and engineering efficiency.
Read original on Slack EngineeringSlack faced the complex task of serving LLMs at enterprise scale, demanding high security, reliability, and performance. The goal wasn't just to integrate new models, but to build a resilient system capable of mitigating regional outages and GPU scarcity. This led to an evolutionary path across distinct architectural phases, shifting from reactive infrastructure management to proactive, multi-vendor orchestration.
Initially, AWS SageMaker was chosen for its managed ML serving capabilities, offering essential security, FedRamp compliance, and model control. A key architectural decision was the implementation of an escrow Virtual Private Cloud (VPC) strategy to ensure a strict zero-knowledge environment, maintaining data privacy while enabling access to provider models. Deploying across multiple AWS regions provided uptime for a global user base, but introduced significant operational overhead related to IAM roles, load balancing, capacity planning, and auto-scaling.
SageMaker Operational Taxes
While SageMaker offered security, it incurred significant operational costs: scaling latency due to slow initialization, hardware scarcity for high-end GPUs like A100/H100, and over-provisioning to meet peak SLAs, leading to wasted engineering cycles.
The migration to Amazon Bedrock was a strategic pivot driven by the need for operational simplicity, immediate access to the latest LLM models, and infrastructure efficiency. Bedrock abstracted away GPU instance management, allowing Slack to focus on throughput (Model Units or MUs). It offered both Provisioned Throughput (PT) for predictable, latency-sensitive workloads and On-Demand (OD) for bursty, scheduled tasks. The migration itself was a "zero-incident" process, emphasizing compliance, extensive load testing for capacity mapping, A/B testing for quality/latency parity, and gradual, feature-flag-controlled rollouts.
Despite the benefits of PT, persistent over-provisioning during off-peak hours and commitment lock-in (slowing model upgrades) remained challenges. Transitioning to Bedrock On-Demand resolved these by aligning compute costs with variable usage patterns and enabling rapid model switching. A Hybrid Routing strategy was adopted: high-volume, latency-sensitive features remained on PT for consistent performance, while asynchronous, bursty workloads moved to OD. A crucial Spillover Pattern was engineered to automatically route excess requests to on-demand endpoints during surges, preventing request drops.
To mitigate risks associated with On-Demand (service level variability, regional capacity orchestration, concentration risk), Slack developed an intelligent AI Platform abstraction with a model hierarchy. This system automatically falls back to different models or reroutes requests to healthy endpoints in other regions if a primary model degrades, ensuring a seamless user experience. This resilience strategy sets the stage for a more robust multi-cloud architecture, further reducing reliance on a single provider and enhancing fault tolerance.