This article explores the cell-based architecture pattern employed by American Express for its core payments ecosystem. It details how this approach enhances resiliency, reduces latency, and improves scalability by isolating failures and localizing data and processing. Key principles include independent deployable units, deterministic routing, and strict boundary enforcement.
Read original on Hacker NewsCell-based architecture is a powerful pattern in cloud-native distributed systems, especially for mission-critical applications like payment processing. The core idea involves grouping related microservices, databases, and other components into independent, self-contained instances called 'cells'. Each cell is designed to function autonomously, minimizing dependencies on other cells. This design significantly reduces the 'blast radius' of failures; if one cell experiences issues, the impact is isolated, preventing a system-wide outage. While it introduces some architectural complexity and management overhead, the benefits for high-availability systems often outweigh these costs.
Cell Analogy
Think of a cell as a miniature, self-sufficient instance of your entire application stack, capable of handling a subset of traffic independently. This isolation is key to resilience.
Effective data management is crucial for cell independence. Static and semi-static reference data (e.g., currency rates) is replicated to each cell asynchronously, ensuring local availability and avoiding synchronous lookups during transaction processing. For dynamic data that changes frequently, deterministic routing is employed. A Global Transaction Router directs transactions to the specific cell that holds the authoritative, most up-to-date state for that transaction, ensuring strong consistency without requiring complex, real-time cross-cell synchronization in the critical path. Asynchronous replication handles data synchronization for failover purposes outside of the transaction processing flow.
The Global Transaction Router is central to enforcing cell boundaries. All ingress and egress traffic between cells must pass through this router, preventing direct peer-to-peer communication. This strict control avoids strong cross-cell dependencies, even if it sometimes leads to duplicated services. Observability data (logs, metrics, traces) is also initially localized within each cell before asynchronous aggregation for global dashboards, ensuring that a partial observability stack failure only impacts a single cell.
In a cell-based architecture, when a cell fails, transactions in progress are not resumed across cells. Instead, they are rerouted to a healthy cell and restarted from the beginning with the original transaction data. This strategy is viable as long as the transaction has not reached a 'point of no return' (e.g., being sent to an external system). To manage potential duplicates from retries and reroutes, unique transaction identifiers are used. Downstream systems leverage these identifiers to achieve idempotency, detecting and suppressing duplicate requests, thus maintaining data consistency and allowing for safe restarts and retries without introducing inconsistencies.