This article details Form3's experience in implementing multi-cloud architectures for a high-volume payments platform, driven by regulatory pressure and customer demands. It highlights the technical challenges and architectural decisions made to achieve active/active/active across AWS, GCP, and Azure, including cloud-agnostic technology choices and custom operators for distributed consistency. The article also presents a cautionary tale, illustrating how specific market requirements and latency constraints led them to a simpler active-standby model for the US market, underscoring that multi-cloud is not a universal solution.
Read original on InfoQ ArchitectureForm3, a UK payments platform processing billions of pounds annually, embarked on a multi-cloud journey in response to regulatory concerns about cloud concentration risk and customer mandates. Their initial architecture was tightly coupled to AWS, but new requirements forced a re-evaluation and a shift towards a more resilient, distributed setup.
For the UK market, Form3 engineered a V2 platform designed for active/active/active operation across AWS, Google Cloud, and Azure. Key architectural decisions included:
Achieving true multi-cloud consistency presented significant engineering hurdles:
Key Takeaways for Active/Active Multi-Cloud
Form3's success in the UK relied on three pillars: using cloud-agnostic technology, maintaining single logical data stores across clouds, and treating each cloud provider as an availability zone. This approach enabled them to continue processing payments seamlessly during a major Google Cloud outage.
When expanding to the US market, Form3 discovered that their sophisticated triple-active setup was not suitable. US customers prioritized geographical resilience (East Coast primary, West Coast DR) and found the multi-cloud concept unfamiliar. Latency was a critical constraint; spreading CockroachDB quorum across the continent would violate SLAs due to increased write latency. This led to a pragmatic shift towards a simpler active-standby architecture using AWS (East Coast) and GCP (West Coast), relying on backup-and-restore for disaster recovery rather than real-time replication. They are actively working to enhance this with logical replication for CockroachDB and NATS event streams to improve RTOs.