This article outlines an architectural approach for achieving cyber resilience on AWS, specifically focusing on recovery from ransomware and destructive events. It details a multi-account strategy that isolates recovery environments and backups from production, ensuring that compromised credentials or infrastructure do not jeopardize the ability to restore. Key elements include logically air-gapped backup vaults, a robust validation pipeline for restored data, and a framework for selecting safe recovery points.
Read original on AWS Architecture BlogCyber resilience goes beyond prevention and detection, focusing on the ability to recover workloads to a known-good state even when production environments, credentials, or backups are compromised. This involves architectural isolation and stringent controls to ensure recovery capabilities remain intact during a breach.
A core principle of cyber resilience on AWS is isolating recovery resources from the production environment to prevent a breach in one from affecting the other. This is typically achieved using separate AWS accounts within an AWS Organization, defining distinct trust boundaries:
Logical Air Gap Explained
The AWS Backup logically air-gapped vault ensures deletion protection by storing recovery points in AWS service-owned accounts. The vault object in your Recovery Account acts as the governance and access boundary, where sharing and restore authorization (including Multi-party approval) are configured. This separation makes the air-gap *logical* rather than purely network-based, enforcing immutability through service-level controls rather than physical disconnect.
A restore confirms readability, but validation confirms usability and safety. A multi-layered validation pipeline operates within the IRE to detect threats in restored data before it reaches production. This includes:
| Layer | Capability | What it provides |
|---|
Selecting a safe recovery point is crucial. While the most recent backup is typical for operational recovery, cyber events require identifying the most recent *safe* copy. This means evaluating recovery point candidates against a 'compromise boundary' to avoid restoring data that might still contain the threat, often involving log and audit review across the backup window.