AWS US-EAST-1 DynamoDB Outage

On June 13, 2023, Amazon Web Services experienced a significant outage in its US-EAST-1 region that impacted DynamoDB and several other services, causing widespread disruptions across the internet. This incident serves as a critical case study in cloud infrastructure resilience, single points of failure, and the importance of multi-region architecture.

The Incident Overview

The outage began at approximately 2:40 PM EDT and lasted for several hours, with some services experiencing degraded performance for even longer. US-EAST-1, located in Northern Virginia, is AWS’s largest and oldest region, hosting a substantial portion of the internet’s infrastructure.

Timeline of Events

2:40 PM EDT - Initial issues detected with DynamoDB API calls returning elevated error rates 2:55 PM EDT - AWS updates service health dashboard acknowledging issues 3:15 PM EDT - Multiple services including Lambda, API Gateway, and CloudWatch begin showing degraded performance 4:30 PM EDT - DynamoDB write operations significantly impacted across availability zones 6:00 PM EDT - Partial recovery begins for some services 8:45 PM EDT - AWS declares most services restored, though some degradation continues Next Day - Full recovery and post-incident analysis begins

Affected Services

The cascading nature of the outage affected numerous AWS services:

  • DynamoDB: Primary service impacted, with both read and write operations affected
  • Lambda: Functions failed to execute or experienced high latency
  • API Gateway: Request failures and timeouts
  • CloudWatch: Metrics and logging delays
  • Cognito: Authentication failures
  • Step Functions: Workflow execution failures
  • EventBridge: Event delivery delays
  • S3: Some operations experienced elevated error rates
  • EC2: Instance launches and API calls affected

Technical Root Cause

While AWS’s official post-mortem provided limited technical details, industry analysis and observable symptoms pointed to several contributing factors:

Power Infrastructure Failure

The primary trigger appears to have been a power-related event affecting multiple availability zones simultaneously. Unlike typical AZ-isolated failures, this event impacted shared infrastructure:

Normal Operation:
AZ-1 [Independent Power] ━━━━ Service A
AZ-2 [Independent Power] ━━━━ Service A  
AZ-3 [Independent Power] ━━━━ Service A

During Outage:
AZ-1 [Power Anomaly] ━━━━ Service A (Degraded)
AZ-2 [Power Anomaly] ━━━━ Service A (Degraded)
AZ-3 [Power Anomaly] ━━━━ Service A (Degraded)

Control Plane Overload

As systems attempted to recover, the DynamoDB control plane became overloaded:

# Simplified representation of what happened
class ControlPlane:
    def __init__(self):
        self.max_requests_per_second = 10000
        self.current_load = 0
    
    def handle_recovery_wave(self):
        # Thousands of services simultaneously trying to reconnect
        recovery_requests = 50000  # Overwhelming demand
        
        if recovery_requests > self.max_requests_per_second:
            # Control plane becomes bottleneck
            return "THROTTLED"
        
        # Normal operations
        return "ACCEPTED"

Cascading Dependencies

DynamoDB’s role as infrastructure for other AWS services created a cascade:

DynamoDB Outage
    ↓
Lambda (uses DynamoDB for state management)
    ↓
API Gateway (relies on Lambda)
    ↓
Customer Applications

This dependency chain meant that even services not directly affected by the power event experienced failures.

Metadata Service Impact

AWS’s internal metadata services, which many AWS services depend on for configuration and coordination, were also impacted:

## Example dependency chain
Service: Lambda
Dependencies:
  - DynamoDB: function state
  - IAM: authentication/authorization
  - CloudWatch: logging
  - VPC: networking configuration
  - S3: code artifacts

## When DynamoDB fails:
Lambda Status: DEGRADED
Reason: "Cannot retrieve function configuration"

Impact Analysis

By the Numbers

While AWS doesn’t publish exact figures, industry estimates suggested:

  • Affected Requests: Billions of failed API calls
  • Impacted Organizations: Tens of thousands
  • Financial Impact: Hundreds of millions in combined losses
  • Duration: 6+ hours of significant disruption
  • Recovery Time: 24+ hours for complete restoration

High-Profile Service Disruptions

Many major services experienced outages:

Streaming Services:

  • Netflix: Viewing disruptions for some users
  • Disney+: Login and playback issues
  • Twitch: Stream interruptions

E-Commerce:

  • Various online retailers: Checkout failures
  • Payment processing: Transaction delays
  • Inventory systems: Update failures

Enterprise Applications:

  • SaaS platforms: Service unavailability
  • Corporate applications: Authentication failures
  • Development tools: CI/CD pipeline failures

Gaming:

  • Popular online games: Login failures
  • Gaming platforms: Matchmaking issues
  • In-game purchases: Transaction failures

Customer Impact Patterns

// Example of how customers experienced the outage
const outageImpact = {
  singleRegionDeployments: {
    availability: "0%",
    impact: "Complete service outage",
    duration: "6+ hours"
  },
  
  multiRegionWithFailover: {
    availability: "80-95%",
    impact: "Brief disruption during failover",
    duration: "5-15 minutes"
  },
  
  activeActiveMultiRegion: {
    availability: "95-100%",
    impact: "Minimal - traffic routed to healthy regions",
    duration: "0-5 minutes"
  },
  
  noDisasterRecovery: {
    availability: "0%",
    impact: "Data loss risk, extended recovery",
    duration: "Days to weeks"
  }
};

Why US-EAST-1 Matters

US-EAST-1 holds a special position in AWS’s infrastructure:

Historical Significance

  • First AWS Region: Launched in 2006
  • Most Services: New AWS features typically launch here first
  • Largest Capacity: Hosts the most infrastructure
  • Default Region: Many tutorials and documentation use it
  • Cost Advantage: Often the lowest-priced region

The Concentration Problem

US-EAST-1 Market Share (Estimated):
├── 40-50% of all AWS workloads
├── 60%+ of all DynamoDB tables
├── 70%+ of Lambda functions
└── Major portion of internet infrastructure

This concentration creates:
- Single point of failure risk
- Cascading failure potential
- Recovery complexity
- Systemic internet risk

Why Companies Choose US-EAST-1

Despite the risks, companies continue to concentrate in this region:

Cost Factors:

## Price comparison (example, approximate)
us_east_1_cost = {
    'dynamodb_write': 1.25,  # per million writes
    'lambda_duration': 0.0000166667,  # per GB-second
    's3_storage': 0.023  # per GB
}

us_west_2_cost = {
    'dynamodb_write': 1.25,  # same as US-EAST-1
    'lambda_duration': 0.0000166667,  # same
    's3_storage': 0.023  # same
}

eu_central_1_cost = {
    'dynamodb_write': 1.43,  # 14% more expensive
    'lambda_duration': 0.0000186667,  # 12% more
    's3_storage': 0.025  # 9% more
}

Latency Considerations:

Average latency to major US population centers:
US-EAST-1 (Virginia):
  - New York: 10-15ms
  - Chicago: 25-30ms
  - Los Angeles: 65-70ms
  
US-WEST-2 (Oregon):
  - New York: 70-75ms
  - Chicago: 50-55ms
  - Los Angeles: 15-20ms

For east coast-focused businesses, US-EAST-1 offers optimal performance.

Technical Lessons and Best Practices

Multi-Region Architecture

The outage reinforced the importance of multi-region deployment:

## Basic multi-region architecture pattern
class MultiRegionApplication:
    def __init__(self):
        self.regions = {
            'primary': 'us-east-1',
            'secondary': 'us-west-2',
            'tertiary': 'eu-west-1'
        }
        self.health_check_interval = 30  # seconds
    
    def route_request(self, request):
        # Try primary region first
        primary_healthy = self.check_region_health(self.regions['primary'])
        
        if primary_healthy:
            return self.send_to_region(request, self.regions['primary'])
        
        # Fallback to secondary
        secondary_healthy = self.check_region_health(self.regions['secondary'])
        
        if secondary_healthy:
            return self.send_to_region(request, self.regions['secondary'])
        
        # Last resort: tertiary
        return self.send_to_region(request, self.regions['tertiary'])
    
    def check_region_health(self, region):
        # Implement health checking logic
        # Check DynamoDB, Lambda, and other critical services
        try:
            response = self.dynamodb_health_check(region)
            return response.status_code == 200
        except:
            return False

Active-Active vs Active-Passive

Active-Passive (Traditional DR):

Configuration:
  Primary: us-east-1
    - Handles 100% of traffic
    - Full production workload
    - Real-time data
  
  Secondary: us-west-2
    - Standby mode
    - Receives data replication
    - No production traffic

Failover:
  - Manual or automated trigger
  - DNS update (5-15 minutes)
  - Application restart
  - Total failover time: 10-30 minutes
  
Costs: Lower (only primary active)
Complexity: Medium
RTO: 10-30 minutes
RPO: Minutes to hours

Active-Active (Modern Approach):

Configuration:
  Region 1: us-east-1
    - Handles 50% of traffic
    - Full production workload
    - Writes to both regions
  
  Region 2: us-west-2
    - Handles 50% of traffic
    - Full production workload
    - Writes to both regions

Failover:
  - Automatic detection (seconds)
  - Traffic rebalancing (instant)
  - No DNS changes needed
  - Total failover time: < 1 minute
  
Costs: Higher (both regions active)
Complexity: High
RTO: < 1 minute
RPO: Near-zero

DynamoDB Global Tables

AWS’s solution for multi-region DynamoDB:

import boto3

## Create Global Table
dynamodb = boto3.client('dynamodb')

## Define table structure
table_name = 'critical-application-data'

response = dynamodb.create_global_table(
    GlobalTableName=table_name,
    ReplicationGroup=[
        {'RegionName': 'us-east-1'},
        {'RegionName': 'us-west-2'},
        {'RegionName': 'eu-west-1'}
    ]
)

## Benefits:
## - Automatic multi-region replication
## - Sub-second replication lag
## - Local read/write in each region
## - Automatic conflict resolution

## Trade-offs:
## - Higher costs (3x storage, cross-region data transfer)
## - Eventual consistency between regions
## - Increased complexity

Circuit Breaker Pattern

Protect your application from cascading failures:

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.last_failure_time = None
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN
    
    def call(self, func):
        if self.state == "OPEN":
            if self.should_attempt_reset():
                self.state = "HALF_OPEN"
            else:
                raise Exception("Circuit breaker is OPEN")
        
        try:
            result = func()
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise e
    
    def on_success(self):
        self.failure_count = 0
        self.state = "CLOSED"
    
    def on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        
        if self.failure_count >= self.failure_threshold:
            self.state = "OPEN"
    
    def should_attempt_reset(self):
        return (time.time() - self.last_failure_time) > self.timeout

## Usage
dynamodb_breaker = CircuitBreaker(failure_threshold=5, timeout=60)

def get_data_from_dynamodb(key):
    return dynamodb_breaker.call(
        lambda: dynamodb.get_item(TableName='my-table', Key={'id': key})
    )

Graceful Degradation

Design applications to degrade gracefully:

class ResilientApplication {
  async getData(userId) {
    try {
      // Primary data source: DynamoDB
      return await this.getDynamoDBData(userId);
    } catch (dynamoError) {
      console.error('DynamoDB unavailable:', dynamoError);
      
      try {
        // Fallback 1: Read replica or cache
        return await this.getCachedData(userId);
      } catch (cacheError) {
        console.error('Cache unavailable:', cacheError);
        
        try {
          // Fallback 2: Secondary database
          return await this.getBackupDBData(userId);
        } catch (backupError) {
          // Final fallback: Degraded mode with limited data
          return this.getDegradedData(userId);
        }
      }
    }
  }
  
  getDegradedData(userId) {
    // Return essential data only
    return {
      userId: userId,
      status: 'limited_functionality',
      message: 'Some features temporarily unavailable'
    };
  }
}

Chaos Engineering

Proactively test failure scenarios:

## AWS Fault Injection Simulator example
import boto3

fis = boto3.client('fis')

## Create experiment template
experiment_template = {
    'description': 'Simulate DynamoDB throttling',
    'actions': {
        'ThrottleDynamoDB': {
            'actionId': 'aws:dynamodb:throttle-requests',
            'parameters': {
                'throttlePercentage': '90',
                'duration': 'PT10M'  # 10 minutes
            },
            'targets': {
                'Tables': 'DynamoDBTables'
            }
        }
    },
    'targets': {
        'DynamoDBTables': {
            'resourceType': 'aws:dynamodb:table',
            'selectionMode': 'ALL',
            'resourceTags': {
                'Environment': 'staging'
            }
        }
    },
    'stopConditions': [
        {
            'source': 'aws:cloudwatch:alarm',
            'value': 'critical-error-rate-alarm'
        }
    ]
}

## Regular testing schedule
## - Weekly: Minor disruptions (10% throttling)
## - Monthly: Moderate disruptions (50% throttling)
## - Quarterly: Severe disruptions (90% throttling)
## - Annually: Full region failure simulation

Monitoring and Alerting

Multi-Region Health Checks

## CloudWatch Synthetics Canary
canary_configuration:
  name: multi-region-health-check
  regions:
    - us-east-1
    - us-west-2
    - eu-west-1
  
  checks:
    - name: dynamodb_write
      frequency: 1_minute
      timeout: 10_seconds
      alert_threshold: 2_consecutive_failures
    
    - name: dynamodb_read
      frequency: 1_minute
      timeout: 5_seconds
      alert_threshold: 2_consecutive_failures
    
    - name: lambda_execution
      frequency: 1_minute
      timeout: 30_seconds
      alert_threshold: 3_consecutive_failures
  
  alerting:
    critical:
      - all_regions_failing
      - primary_region_down_over_5_minutes
    
    warning:
      - single_region_degraded
      - elevated_error_rates

Dependency Mapping

class ServiceDependencyMonitor:
    def __init__(self):
        self.dependencies = {
            'api-service': [
                'dynamodb-users',
                'dynamodb-sessions',
                'lambda-auth',
                'cognito'
            ],
            'lambda-auth': [
                'dynamodb-tokens',
                'cognito'
            ],
            'data-pipeline': [
                'dynamodb-events',
                's3-data-lake',
                'lambda-processor'
            ]
        }
    
    def check_service_health(self, service):
        # Check direct service health
        service_healthy = self.direct_health_check(service)
        
        if not service_healthy:
            return {'status': 'unhealthy', 'reason': 'direct_failure'}
        
        # Check all dependencies
        for dependency in self.dependencies.get(service, []):
            dep_healthy = self.check_service_health(dependency)
            
            if not dep_healthy['status'] == 'healthy':
                return {
                    'status': 'degraded',
                    'reason': f'dependency_failure: {dependency}'
                }
        
        return {'status': 'healthy'}

Cost Considerations

Multi-Region Cost Analysis

## Cost comparison for a typical application
class MultiRegionCostCalculator:
    def __init__(self):
        self.dynamodb_write_price = 1.25  # per million writes
        self.data_transfer_price = 0.02   # per GB
        self.lambda_price = 0.0000166667  # per GB-second
    
    def calculate_single_region_cost(self, writes_per_month, data_gb):
        dynamodb_cost = (writes_per_month / 1_000_000) * self.dynamodb_write_price
        return {
            'dynamodb': dynamodb_cost,
            'data_transfer': 0,  # No cross-region transfer
            'total': dynamodb_cost
        }
    
    def calculate_multi_region_cost(self, writes_per_month, data_gb, regions=3):
        # DynamoDB Global Tables cost
        dynamodb_cost = (writes_per_month / 1_000_000) * self.dynamodb_write_price * regions
        
        # Replication data transfer between regions
        replication_transfer = data_gb * regions * self.data_transfer_price
        
        return {
            'dynamodb': dynamodb_cost,
            'data_transfer': replication_transfer,
            'total': dynamodb_cost + replication_transfer
        }

## Example calculation
calculator = MultiRegionCostCalculator()

## Application with 100M writes/month, 1TB data
single_region = calculator.calculate_single_region_cost(100_000_000, 1000)
multi_region = calculator.calculate_multi_region_cost(100_000_000, 1000, 3)

print(f"Single Region: ${single_region['total']:.2f}/month")
print(f"Multi Region: ${multi_region['total']:.2f}/month")
print(f"Additional Cost: ${multi_region['total'] - single_region['total']:.2f}/month")

## Typical output:
## Single Region: $125/month
## Multi Region: $435/month
## Additional Cost: $310/month (248% increase)

Cost vs Availability Trade-offs

Availability Level | Architecture | Monthly Cost | Downtime/Year
-------------------|--------------|--------------|---------------
99.0% (2 nines)   | Single AZ    | $1,000      | 87.6 hours
99.9% (3 nines)   | Multi-AZ     | $1,500      | 8.76 hours
99.95%            | Multi-Region | $3,500      | 4.38 hours
99.99% (4 nines)  | Multi-Region | $5,000      | 52.6 minutes
                  | Active-Active|             |
99.999% (5 nines) | Multi-Region | $10,000+    | 5.26 minutes
                  | Active-Active|             |
                  | Multiple      |             |
                  | Providers     |             |

Industry Response and Changes

AWS Improvements

Following the incident, AWS announced several improvements:

  1. Enhanced Power Infrastructure

    • Increased isolation between availability zones
    • Improved backup power systems
    • Better monitoring of electrical systems
  2. Control Plane Capacity

    • Increased capacity for recovery scenarios
    • Better throttling and queuing mechanisms
    • Improved graceful degradation
  3. Communication

    • Faster service health dashboard updates
    • More detailed post-incident reports
    • Better customer notification systems

Customer Behavior Changes

Organizations responded by:

Immediate Actions:
  - Emergency multi-region deployments
  - Increased monitoring and alerting
  - Incident response plan updates
  - Management escalation

Long-term Changes:
  - Multi-region by default for critical services
  - Increased disaster recovery testing
  - Chaos engineering adoption
  - Multi-cloud strategies (some organizations)
  - Reduced dependency on US-EAST-1

Recommendations for Engineering Teams

Tier-Based Approach

Not all services require the same level of resilience:

class ServiceTierStrategy:
    def __init__(self):
        self.tiers = {
            'tier_1_critical': {
                'availability_target': 99.99,
                'architecture': 'active-active multi-region',
                'rto': '< 1 minute',
                'rpo': '< 1 minute',
                'examples': ['authentication', 'payment processing', 'core APIs']
            },
            'tier_2_important': {
                'availability_target': 99.9,
                'architecture': 'active-passive multi-region',
                'rto': '< 15 minutes',
                'rpo': '< 5 minutes',
                'examples': ['user profiles', 'content delivery', 'analytics']
            },
            'tier_3_standard': {
                'availability_target': 99.0,
                'architecture': 'multi-AZ single region',
                'rto': '< 1 hour',
                'rpo': '< 1 hour',
                'examples': ['reporting', 'batch jobs', 'internal tools']
            }
        }
    
    def get_recommendation(self, service_type):
        # Return appropriate architecture based on criticality
        pass

Testing Checklist

## Disaster Recovery Testing Checklist

### Monthly Tests
- [ ] Verify backup integrity
- [ ] Test monitoring and alerting
- [ ] Review incident response procedures
- [ ] Check failover automation

### Quarterly Tests
- [ ] Full failover to secondary region
- [ ] Restore from backup in different region
- [ ] Simulate partial service degradation
- [ ] Test cross-region replication lag
- [ ] Validate data consistency after failover

### Annual Tests
- [ ] Complete region failure simulation
- [ ] Multi-service cascade failure test
- [ ] Extended outage scenario (4+ hours)
- [ ] Test with full production load
- [ ] Validate financial impact estimates

### Continuous
- [ ] Monitor replication lag
- [ ] Track cross-region latency
- [ ] Review error rates and patterns
- [ ] Update runbooks and documentation

Conclusion

The US-EAST-1 DynamoDB outage of June 2023 serves as a stark reminder that even the most reliable cloud infrastructure can fail. The incident highlighted several critical lessons:

  1. Regional Concentration Risk: Over-reliance on a single region creates systemic risk
  2. Cascading Failures: Dependencies between services can amplify outages
  3. Control Plane Limitations: Recovery can be hindered by overwhelmed control systems
  4. Multi-Region is Essential: Critical services require true multi-region architecture
  5. Testing is Crucial: Regular disaster recovery testing reveals weaknesses
  6. Cost vs Reliability: Higher availability requires significant investment

Key Takeaways for Teams

Immediate Actions:

  • Audit current single-region dependencies
  • Implement basic multi-region failover for critical services
  • Enhance monitoring and alerting
  • Document and test incident response procedures

Long-term Strategy:

  • Design for multi-region from the start
  • Implement chaos engineering practices
  • Balance costs with availability requirements
  • Continuously test and improve resilience

Cultural Changes:

  • Treat outages as learning opportunities
  • Invest in observability and monitoring
  • Prioritize reliability alongside features
  • Foster a culture of operational excellence

The outage affected thousands of organizations and millions of users, but it also provided invaluable lessons in building resilient distributed systems. By learning from this incident and implementing robust multi-region architectures, development teams can better protect their applications and users from future disruptions.

As cloud infrastructure becomes even more central to modern applications, the lessons from this outage become increasingly important. The question is no longer whether failures will occur, but how quickly and gracefully we can recover when they inevitably do.

Thank you for reading! If you have any feedback or comments, please send them to [email protected].