InfoQ Architecture·June 25, 2026

Slack's Multi-Cloud AI Serving Platform Evolution

Slack transitioned its AI serving platform through four phases, starting with self-managed SageMaker and evolving to a multi-cloud architecture utilizing AWS Bedrock and Google Cloud Vertex AI. This journey addressed challenges like operational overhead, single-provider dependency, and traffic variability, ultimately improving AI model quality, reducing latency, and enhancing resilience.

AI & ML Infrastructure Distributed Systems Cloud & Infrastructure

Read original on InfoQ Architecture

Slack's evolution of its AI serving platform provides a compelling case study in managing distributed AI workloads, highlighting critical system design considerations for organizations adopting large language models (LLMs). The architectural journey moved from a tightly coupled, single-cloud setup to a more flexible and resilient multi-cloud environment, driven by needs for reduced operational burden, improved performance, and strategic independence from a single vendor.

Phase 1: Self-Managed SageMaker

Initially, Slack ran its AI serving platform on Amazon SageMaker within an escrow VPC. While this offered strong isolation, it came with significant operational overhead. System architects had to manually forecast capacity, schedule cluster expansions, and pre-plan for scarce GPU resources (A100/H100). This approach was prone to capacity shortfalls and infrastructure issues, directly impacting customer experience due to the reliance on AI-powered features for millions of users. This phase underscores the trade-off between control and operational complexity.

Phase 2 & 3: Migrating to Amazon Bedrock

To alleviate the operational burden, Slack migrated to Amazon Bedrock, a managed service. This move effectively eliminated infrastructure management overhead, allowed faster access to newer Anthropic models, and reduced feature lag. Engineers could now focus on model performance and product quality rather than GPU reservations. The migration involved compliance reviews, extensive load testing, and feature-flag-driven rollouts, achieving the transition without customer-facing incidents. This highlights the value of managed services for reducing operational toil and accelerating feature delivery.

ℹ️

Addressing Traffic Variability

AI workloads often exhibit high variability, with Slack reporting 10x fluctuations between peak and off-peak times. To handle this, they combined Bedrock's Provisioned Throughput (PT) for lower-latency interactive traffic and On-Demand offerings for bursty background workloads. This hybrid capacity model is a critical pattern for cost-effectively scaling AI inference services that experience unpredictable load.

Phase 4: Multi-Cloud Strategy with Google Cloud Vertex AI

Despite the benefits of Bedrock, reliance on a single cloud provider remained a concern for resilience and access to diverse model ecosystems. This led to a multi-cloud strategy by integrating Google Cloud Vertex AI. The key architectural decision here was building a provider-agnostic serving layer. This abstraction layer implemented features such as secretless authentication, API normalization, unified observability, and intelligent routing. Endpoints are continuously monitored for metrics like time-to-first-token and error rates, enabling dynamic traffic redirection away from degraded services. This layer also supports A/B testing and controlled model rollouts. This phase exemplifies building vendor-agnostic systems to achieve higher resilience, broader model access, and reduced vendor lock-in.

Improved Resilience: Geographic failover capabilities and reduced dependence on a single platform.
Enhanced Performance: Achieved a 10% quality improvement on complex workloads and 67% latency reduction for short prompts.
Flexibility and Model Access: Broader access to foundation models across different cloud ecosystems.
Operational Efficiency: Engineers can focus on higher-value tasks like model performance and product innovation.

multi-cloudai-infrastructurellm-servingaws-bedrockgoogle-vertex-aiplatform-engineeringscalabilityresilience

Comments

Loading comments...

Architecture Design

Design this yourself

Design a highly available and scalable multi-cloud AI serving platform capable of routing interactive and batch inference traffic across multiple LLM providers (e.g., AWS Bedrock, Google Cloud Vertex AI). Focus on the core components required for provider abstraction, intelligent routing, unified observability, secretless authentication, and graceful degradation.

Practice Interview

Other design angles

· Design a single-cloud AI inference platform that optimizes for cost and performance while handling variable workloads using a hybrid managed/provisioned capacity model.· Design an API gateway specifically for LLM services, focusing on features like rate limiting, caching, and dynamic routing to different model endpoints and providers.· Architect a system to perform A/B testing and canary deployments for LLM models across multiple cloud providers, including metrics collection and rollback mechanisms.