This article details ProGlove's journey in scaling a multi-account, serverless SaaS platform to over a million AWS Lambda functions. It highlights critical architectural decisions, challenges, and lessons learned across various growth phases, focusing on multi-tenancy, cost optimization, and deployment strategies in a serverless environment.
Read original on AWS Architecture BlogProGlove's Insight platform, a serverless SaaS solution, scaled from a few dozen to over a million AWS Lambda functions across thousands of AWS accounts. This journey revealed significant architectural and operational challenges, leading to refined strategies for multi-tenancy, cost management, and deployment automation. The core architecture uses a one AWS account per tenant model, offering strong security, clear ownership, and transparent cost attribution, crucially supporting true scale-to-zero capabilities for dedicated tenant resources.
Each microservice adheres to a consistent structure: 5-15 Lambda functions orchestrated by AWS Step Functions, with Amazon EventBridge for event routing and Amazon DynamoDB as the primary data store. These components are bundled into dedicated AWS CloudFormation stacks. Initially, AWS CloudFormation StackSets were leveraged for parallel infrastructure updates across multiple tenant accounts. While effective at smaller scales, StackSets eventually hit performance ceilings at the millions-of-functions mark, prompting consideration of custom deployment engines before AWS CloudFormation service teams addressed the bottlenecks.
Traditional serverless best practices, like using SQS queues between EventBridge and Lambda for resilience, were found to be costly at extreme scale due to continuous polling, even when idle. To achieve true scale-to-zero, SQS was removed from this path, with safety ensured by monitoring `AsyncEventsDropped` and `ConcurrentExecutions`.
Optimizing Dead Letter Queues (DLQs) for Cost
The "centralized DLQ" pattern emerged as a cost-effective alternative to individual DLQs per queue, routing failures from multiple tenants to a single Dead Letter Queue for recovery. This requires stringent discipline to maintain data isolation, treating the AWS account ID as a tenant ID within the converged events.