Course/Deployment & Operations Patterns/Blue-Green Deployment

Blue-Green Deployment

Run two identical production environments: instant rollback, zero-downtime deployment, database migration challenges, and traffic switching strategies.

10 min readHigh interview weight

What Is Blue-Green Deployment?

Blue-green deployment maintains two identical production environments — called Blue and Green — at all times. At any given moment, only one environment is live (serving production traffic), while the other is idle and available for the next deployment. You deploy the new version to the idle environment, run tests, then flip a router or load balancer to send 100% of traffic to the new environment in a single, near-instant cut.

The key insight is that the traffic switch is instantaneous and reversible. If the new version has a critical bug, you flip the router back in seconds — no re-deploy, no rollback script, no waiting for containers to spin up.

Loading diagram...

Before the switch: Blue is live, Green holds the new version awaiting validation.

Deployment Flow Step by Step

Identify the idle environment. If Blue is currently live, you deploy to Green.
Deploy the new version to Green. This is a full deploy: new container images, updated config, migrations (if forward-compatible).
Run smoke tests and health checks against Green. Because Green is not receiving traffic, you can test safely with production infrastructure and data.
Switch the router. Update the load balancer, DNS weighted routing, or service mesh virtual service to direct 100% of traffic to Green.
Monitor the new environment closely for the first 15–30 minutes.
Keep Blue warm. Do not decommission Blue immediately. It remains your instant rollback target.
After confidence is established, Blue becomes the new idle environment for the next cycle.

Traffic Switching Mechanisms

Mechanism	How It Works	Cutover Speed	Rollback Speed
DNS TTL flip	Update DNS A record to Green's IP	Minutes (TTL-dependent)	Minutes
Load balancer target swap	Change ALB/NLB target group	Seconds	Seconds
Service mesh virtual service	Update Istio VirtualService weights	Sub-second	Sub-second
Feature flag / routing header	Route based on header or flag	Instant	Instant

⚠️

DNS TTL Trap

Switching via DNS is the slowest and trickiest option. DNS records are cached by ISPs and clients based on the TTL value. Even if you set TTL to 60 seconds, some resolvers ignore low TTLs. For true zero-downtime cutover, use load balancer target group swaps or a service mesh — not DNS.

Database Migration: The Hard Part

Blue-green deployment works beautifully for stateless application tiers. The hard part is database schema changes. Both environments typically share a single database (or replicated cluster). If v2.0 requires a schema change that is incompatible with v1.0 — such as dropping a column v1.0 still reads — you cannot flip traffic back without breaking the old environment.

The solution is the expand-contract pattern (also called parallel change or multi-phase migration):

Expand: Deploy v2.0 with a migration that only *adds* new columns or tables. v1.0 still runs fine — it ignores new columns.
Migrate data: Backfill data into the new schema while both environments can run.
Cut over: Switch traffic to v2.0.
Contract: After v2.0 is fully validated and v1.0 is decommissioned, run a second migration to drop the old columns/tables.

ℹ️

Separate Database Environments

Some teams run completely separate databases per environment. This eliminates the migration conflict problem but requires synchronizing data before the cut — often done with change data capture (CDC) replication from Blue's database to Green's during the deployment window.

Costs and Trade-offs

Dimension	Blue-Green Advantage	Blue-Green Disadvantage
Rollback speed	Instant (seconds)	—
Downtime	Zero downtime during switch	—
Infrastructure cost	—	Doubles your compute bill at all times
Database migrations	—	Requires expand-contract discipline
Stateful sessions	—	In-flight sessions may drop at cut-over
Testing fidelity	Test against real infra before going live	—

AWS Implementation Example

yaml

# AWS CodeDeploy appspec.yml for Blue-Green ECS deployment
version: 0.0
Resources:
  - TargetService:
      Type: AWS::ECS::Service
      Properties:
        TaskDefinition: <TASK_DEFINITION>
        LoadBalancerInfo:
          ContainerName: "my-app"
          ContainerPort: 8080

# CodeDeploy creates a replacement task set (Green),
# shifts traffic when health checks pass,
# then terminates the original task set (Blue).
# Termination delay: 5 minutes (gives rollback window).
Hooks:
  AfterAllowTestTraffic:
    - Location: scripts/run-smoke-tests.sh
      Timeout: 300

On Kubernetes, blue-green is achieved by maintaining two `Deployment` objects with different labels. The `Service` selector is updated to point to the new deployment. Tools like Argo Rollouts and Flagger automate this pattern, including automatic rollback if health checks fail post-switch.

When to Use Blue-Green

High-risk releases where instant rollback capability is non-negotiable
Regulatory environments requiring zero-downtime maintenance windows
Infrequent, large batch releases (weekly or monthly) rather than continuous delivery
Stateless services with manageable database migration strategies

💡

Interview Tip

In interviews, when you propose blue-green deployment, immediately address the database problem — it's the follow-up every interviewer expects. Say: 'The tricky part is schema changes. I'd use the expand-contract pattern: first deploy a backward-compatible migration, then cut traffic, then run the cleanup migration after the old environment is decommissioned.' This shows depth beyond just knowing the pattern name.

API Key Management

Canary Release