Menu
InfoQ Architecture·March 16, 2026

Multi-Cloud Strategy for Payments: Form3's Journey and Trade-offs

This article details Form3's experience in implementing multi-cloud architectures for a high-volume payments platform, driven by regulatory pressure and customer demands. It highlights the technical challenges and architectural decisions made to achieve active/active/active across AWS, GCP, and Azure, including cloud-agnostic technology choices and custom operators for distributed consistency. The article also presents a cautionary tale, illustrating how specific market requirements and latency constraints led them to a simpler active-standby model for the US market, underscoring that multi-cloud is not a universal solution.

Read original on InfoQ Architecture

Form3, a UK payments platform processing billions of pounds annually, embarked on a multi-cloud journey in response to regulatory concerns about cloud concentration risk and customer mandates. Their initial architecture was tightly coupled to AWS, but new requirements forced a re-evaluation and a shift towards a more resilient, distributed setup.

Active/Active/Active Multi-Cloud in the UK

For the UK market, Form3 engineered a V2 platform designed for active/active/active operation across AWS, Google Cloud, and Azure. Key architectural decisions included:

  • Kubernetes: Independent clusters in each cloud.
  • NATS JetStream: Chosen as a cross-cloud message broker due to its ability to form a single logical cluster spanning multiple environments.
  • CockroachDB: Selected for distributed data storage, also capable of operating as a single logical cluster across all three clouds.
  • Go Microservices: Migration from Java to Go for smaller footprints and improved readability.

Engineering Challenges and Solutions

Achieving true multi-cloud consistency presented significant engineering hurdles:

  • Cross-Cluster DNS: To bootstrap CockroachDB across independent Kubernetes clusters, they developed a custom DNS hack with pseudo-suffixes to route queries between clusters.
  • Distributed Quorum Protection: A custom operator called XPDB (cross-cluster pod disruption budget) was built to enforce disruption limits across all three clouds, crucial for maintaining CockroachDB quorum during maintenance.
  • Node Pool Updates: The Cluster Lifecycle Operator was created to consolidate and streamline node pool updates across multiple clouds, environments, and geographies, addressing a painful day-two operational problem.
💡

Key Takeaways for Active/Active Multi-Cloud

Form3's success in the UK relied on three pillars: using cloud-agnostic technology, maintaining single logical data stores across clouds, and treating each cloud provider as an availability zone. This approach enabled them to continue processing payments seamlessly during a major Google Cloud outage.

When Multi-Cloud Doesn't Fit: The US Market Experience

When expanding to the US market, Form3 discovered that their sophisticated triple-active setup was not suitable. US customers prioritized geographical resilience (East Coast primary, West Coast DR) and found the multi-cloud concept unfamiliar. Latency was a critical constraint; spreading CockroachDB quorum across the continent would violate SLAs due to increased write latency. This led to a pragmatic shift towards a simpler active-standby architecture using AWS (East Coast) and GCP (West Coast), relying on backup-and-restore for disaster recovery rather than real-time replication. They are actively working to enhance this with logical replication for CockroachDB and NATS event streams to improve RTOs.

multi-cloudactive-activeactive-standbykubernetescockroachdbnats-jetstreampayments-platformdisaster-recovery

Comments

Loading comments...