AWS Architecture Blog·May 20, 2026

Designing Cyber-Resilient Architectures on AWS for Ransomware Recovery

This article outlines an architectural approach for achieving cyber resilience on AWS, specifically focusing on recovery from ransomware and destructive events. It details a multi-account strategy that isolates recovery environments and backups from production, ensuring that compromised credentials or infrastructure do not jeopardize the ability to restore. Key elements include logically air-gapped backup vaults, a robust validation pipeline for restored data, and a framework for selecting safe recovery points.

Security Cloud & Infrastructure DevOps & SRE

Read original on AWS Architecture Blog

Cyber resilience goes beyond prevention and detection, focusing on the ability to recover workloads to a known-good state even when production environments, credentials, or backups are compromised. This involves architectural isolation and stringent controls to ensure recovery capabilities remain intact during a breach.

Architectural Isolation: The Three-Account Model

A core principle of cyber resilience on AWS is isolating recovery resources from the production environment to prevent a breach in one from affecting the other. This is typically achieved using separate AWS accounts within an AWS Organization, defining distinct trust boundaries:

Production Accounts: Where active workloads run. These are isolated during a cyber event, with no recovery operations occurring here to prevent re-infection or trust issues.
Recovery Account: Owns the logically air-gapped AWS Backup vault. This account is dedicated to securing backups and controlling restore authorizations. It's restricted to backup operations via Service Control Policies (SCPs) to prevent compromised production identities from altering backup configurations or deleting recovery points.
Isolated Recovery Environment (IRE): A separate, untrusted environment where backups are restored, validated, and a new production environment is rebuilt. It has no network or trust relationship with the Production Account to contain any residual threats in restored data. Infrastructure deployment within the IRE uses VPC endpoints (AWS PrivateLink) for API access without internet connectivity or VPC peering to production.

ℹ️

Logical Air Gap Explained

The AWS Backup logically air-gapped vault ensures deletion protection by storing recovery points in AWS service-owned accounts. The vault object in your Recovery Account acts as the governance and access boundary, where sharing and restore authorization (including Multi-party approval) are configured. This separation makes the air-gap *logical* rather than purely network-based, enforcing immutability through service-level controls rather than physical disconnect.

Key Controls for Backup Protection

Immutable Backups: AWS-native backup mechanisms (e.g., EBS snapshots, RDS snapshots) are inherently immutable after creation. The logically air-gapped vault adds deletion protection.
Multi-Party Approval (MPA): Configured via IAM Identity Center, MPA requires multiple approvers for a restore operation, adding a critical layer of security, especially when the source account might be compromised.
AWS Resource Access Management (AWS RAM): Used to securely share recovery points from the Recovery Account to the IRE for restoration.
Direct Backup to Vault: For fully managed resources (S3, DynamoDB, EFS), backups can be written directly to the logically air-gapped vault as a primary target, streamlining protection. Other resources use an orchestration path to transfer temporary snapshots.

Validation Pipeline and Safe Recovery Point Selection

A restore confirms readability, but validation confirms usability and safety. A multi-layered validation pipeline operates within the IRE to detect threats in restored data before it reaches production. This includes:

Layer	Capability	What it provides

Selecting a safe recovery point is crucial. While the most recent backup is typical for operational recovery, cyber events require identifying the most recent *safe* copy. This means evaluating recovery point candidates against a 'compromise boundary' to avoid restoring data that might still contain the threat, often involving log and audit review across the backup window.

AWSCyber ResilienceRansomware RecoveryData ProtectionBackup StrategyMulti-Account ArchitectureCloud SecurityDisaster Recovery

Comments

Loading comments...

Architecture Design

Design this yourself

Design a highly resilient data protection and recovery system on AWS to mitigate ransomware and destructive events, incorporating a multi-account strategy with isolated recovery environments, logically air-gapped backups, and a comprehensive validation pipeline for restored data.

Practice Interview

Focus: cyber resilience and ransomware recovery architecture

Other design angles

· Design a strategy for continuous data validation and integrity checks within an existing enterprise AWS environment to detect dormant threats in backups.· Architect a secure, automated disaster recovery solution for a critical microservices-based application on AWS, specifically addressing recovery from credential compromise and data deletion, ensuring zero trust principles are applied throughout the recovery process.· Develop a robust incident response and recovery plan focusing on the selection of safe recovery points and the phased restoration of services for a large-scale data lake environment on AWS after a data integrity event.