On June 13, 2023, Amazon Web Services experienced a significant outage in its US-EAST-1 region that impacted DynamoDB and several other services, causing widespread disruptions across the internet. This incident serves as a critical case study in cloud infrastructure resilience, single points of failure, and the importance of multi-region architecture.
The Incident Overview
The outage began at approximately 2:40 PM EDT and lasted for several hours, with some services experiencing degraded performance for even longer. US-EAST-1, located in Northern Virginia, is AWS’s largest and oldest region, hosting a substantial portion of the internet’s infrastructure.
Timeline of Events
2:40 PM EDT - Initial issues detected with DynamoDB API calls returning elevated error rates 2:55 PM EDT - AWS updates service health dashboard acknowledging issues 3:15 PM EDT - Multiple services including Lambda, API Gateway, and CloudWatch begin showing degraded performance 4:30 PM EDT - DynamoDB write operations significantly impacted across availability zones 6:00 PM EDT - Partial recovery begins for some services 8:45 PM EDT - AWS declares most services restored, though some degradation continues Next Day - Full recovery and post-incident analysis begins
Affected Services
The cascading nature of the outage affected numerous AWS services:
- DynamoDB: Primary service impacted, with both read and write operations affected
- Lambda: Functions failed to execute or experienced high latency
- API Gateway: Request failures and timeouts
- CloudWatch: Metrics and logging delays
- Cognito: Authentication failures
- Step Functions: Workflow execution failures
- EventBridge: Event delivery delays
- S3: Some operations experienced elevated error rates
- EC2: Instance launches and API calls affected
Technical Root Cause
While AWS’s official post-mortem provided limited technical details, industry analysis and observable symptoms pointed to several contributing factors:
Power Infrastructure Failure
The primary trigger appears to have been a power-related event affecting multiple availability zones simultaneously. Unlike typical AZ-isolated failures, this event impacted shared infrastructure:
Normal Operation:
AZ-1 [Independent Power] ━━━━ Service A
AZ-2 [Independent Power] ━━━━ Service A
AZ-3 [Independent Power] ━━━━ Service A
During Outage:
AZ-1 [Power Anomaly] ━━━━ Service A (Degraded)
AZ-2 [Power Anomaly] ━━━━ Service A (Degraded)
AZ-3 [Power Anomaly] ━━━━ Service A (Degraded)
Control Plane Overload
As systems attempted to recover, the DynamoDB control plane became overloaded:
# Simplified representation of what happened
class ControlPlane:
def __init__(self):
self.max_requests_per_second = 10000
self.current_load = 0
def handle_recovery_wave(self):
# Thousands of services simultaneously trying to reconnect
recovery_requests = 50000 # Overwhelming demand
if recovery_requests > self.max_requests_per_second:
# Control plane becomes bottleneck
return "THROTTLED"
# Normal operations
return "ACCEPTED"
Cascading Dependencies
DynamoDB’s role as infrastructure for other AWS services created a cascade:
DynamoDB Outage
↓
Lambda (uses DynamoDB for state management)
↓
API Gateway (relies on Lambda)
↓
Customer Applications
This dependency chain meant that even services not directly affected by the power event experienced failures.
Metadata Service Impact
AWS’s internal metadata services, which many AWS services depend on for configuration and coordination, were also impacted:
## Example dependency chain
Service: Lambda
Dependencies:
- DynamoDB: function state
- IAM: authentication/authorization
- CloudWatch: logging
- VPC: networking configuration
- S3: code artifacts
## When DynamoDB fails:
Lambda Status: DEGRADED
Reason: "Cannot retrieve function configuration"
Impact Analysis
By the Numbers
While AWS doesn’t publish exact figures, industry estimates suggested:
- Affected Requests: Billions of failed API calls
- Impacted Organizations: Tens of thousands
- Financial Impact: Hundreds of millions in combined losses
- Duration: 6+ hours of significant disruption
- Recovery Time: 24+ hours for complete restoration
High-Profile Service Disruptions
Many major services experienced outages:
Streaming Services:
- Netflix: Viewing disruptions for some users
- Disney+: Login and playback issues
- Twitch: Stream interruptions
E-Commerce:
- Various online retailers: Checkout failures
- Payment processing: Transaction delays
- Inventory systems: Update failures
Enterprise Applications:
- SaaS platforms: Service unavailability
- Corporate applications: Authentication failures
- Development tools: CI/CD pipeline failures
Gaming:
- Popular online games: Login failures
- Gaming platforms: Matchmaking issues
- In-game purchases: Transaction failures
Customer Impact Patterns
// Example of how customers experienced the outage
const outageImpact = {
singleRegionDeployments: {
availability: "0%",
impact: "Complete service outage",
duration: "6+ hours"
},
multiRegionWithFailover: {
availability: "80-95%",
impact: "Brief disruption during failover",
duration: "5-15 minutes"
},
activeActiveMultiRegion: {
availability: "95-100%",
impact: "Minimal - traffic routed to healthy regions",
duration: "0-5 minutes"
},
noDisasterRecovery: {
availability: "0%",
impact: "Data loss risk, extended recovery",
duration: "Days to weeks"
}
};
Why US-EAST-1 Matters
US-EAST-1 holds a special position in AWS’s infrastructure:
Historical Significance
- First AWS Region: Launched in 2006
- Most Services: New AWS features typically launch here first
- Largest Capacity: Hosts the most infrastructure
- Default Region: Many tutorials and documentation use it
- Cost Advantage: Often the lowest-priced region
The Concentration Problem
US-EAST-1 Market Share (Estimated):
├── 40-50% of all AWS workloads
├── 60%+ of all DynamoDB tables
├── 70%+ of Lambda functions
└── Major portion of internet infrastructure
This concentration creates:
- Single point of failure risk
- Cascading failure potential
- Recovery complexity
- Systemic internet risk
Why Companies Choose US-EAST-1
Despite the risks, companies continue to concentrate in this region:
Cost Factors:
## Price comparison (example, approximate)
us_east_1_cost = {
'dynamodb_write': 1.25, # per million writes
'lambda_duration': 0.0000166667, # per GB-second
's3_storage': 0.023 # per GB
}
us_west_2_cost = {
'dynamodb_write': 1.25, # same as US-EAST-1
'lambda_duration': 0.0000166667, # same
's3_storage': 0.023 # same
}
eu_central_1_cost = {
'dynamodb_write': 1.43, # 14% more expensive
'lambda_duration': 0.0000186667, # 12% more
's3_storage': 0.025 # 9% more
}
Latency Considerations:
Average latency to major US population centers:
US-EAST-1 (Virginia):
- New York: 10-15ms
- Chicago: 25-30ms
- Los Angeles: 65-70ms
US-WEST-2 (Oregon):
- New York: 70-75ms
- Chicago: 50-55ms
- Los Angeles: 15-20ms
For east coast-focused businesses, US-EAST-1 offers optimal performance.
Technical Lessons and Best Practices
Multi-Region Architecture
The outage reinforced the importance of multi-region deployment:
## Basic multi-region architecture pattern
class MultiRegionApplication:
def __init__(self):
self.regions = {
'primary': 'us-east-1',
'secondary': 'us-west-2',
'tertiary': 'eu-west-1'
}
self.health_check_interval = 30 # seconds
def route_request(self, request):
# Try primary region first
primary_healthy = self.check_region_health(self.regions['primary'])
if primary_healthy:
return self.send_to_region(request, self.regions['primary'])
# Fallback to secondary
secondary_healthy = self.check_region_health(self.regions['secondary'])
if secondary_healthy:
return self.send_to_region(request, self.regions['secondary'])
# Last resort: tertiary
return self.send_to_region(request, self.regions['tertiary'])
def check_region_health(self, region):
# Implement health checking logic
# Check DynamoDB, Lambda, and other critical services
try:
response = self.dynamodb_health_check(region)
return response.status_code == 200
except:
return False
Active-Active vs Active-Passive
Active-Passive (Traditional DR):
Configuration:
Primary: us-east-1
- Handles 100% of traffic
- Full production workload
- Real-time data
Secondary: us-west-2
- Standby mode
- Receives data replication
- No production traffic
Failover:
- Manual or automated trigger
- DNS update (5-15 minutes)
- Application restart
- Total failover time: 10-30 minutes
Costs: Lower (only primary active)
Complexity: Medium
RTO: 10-30 minutes
RPO: Minutes to hours
Active-Active (Modern Approach):
Configuration:
Region 1: us-east-1
- Handles 50% of traffic
- Full production workload
- Writes to both regions
Region 2: us-west-2
- Handles 50% of traffic
- Full production workload
- Writes to both regions
Failover:
- Automatic detection (seconds)
- Traffic rebalancing (instant)
- No DNS changes needed
- Total failover time: < 1 minute
Costs: Higher (both regions active)
Complexity: High
RTO: < 1 minute
RPO: Near-zero
DynamoDB Global Tables
AWS’s solution for multi-region DynamoDB:
import boto3
## Create Global Table
dynamodb = boto3.client('dynamodb')
## Define table structure
table_name = 'critical-application-data'
response = dynamodb.create_global_table(
GlobalTableName=table_name,
ReplicationGroup=[
{'RegionName': 'us-east-1'},
{'RegionName': 'us-west-2'},
{'RegionName': 'eu-west-1'}
]
)
## Benefits:
## - Automatic multi-region replication
## - Sub-second replication lag
## - Local read/write in each region
## - Automatic conflict resolution
## Trade-offs:
## - Higher costs (3x storage, cross-region data transfer)
## - Eventual consistency between regions
## - Increased complexity
Circuit Breaker Pattern
Protect your application from cascading failures:
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.timeout = timeout
self.last_failure_time = None
self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN
def call(self, func):
if self.state == "OPEN":
if self.should_attempt_reset():
self.state = "HALF_OPEN"
else:
raise Exception("Circuit breaker is OPEN")
try:
result = func()
self.on_success()
return result
except Exception as e:
self.on_failure()
raise e
def on_success(self):
self.failure_count = 0
self.state = "CLOSED"
def on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
def should_attempt_reset(self):
return (time.time() - self.last_failure_time) > self.timeout
## Usage
dynamodb_breaker = CircuitBreaker(failure_threshold=5, timeout=60)
def get_data_from_dynamodb(key):
return dynamodb_breaker.call(
lambda: dynamodb.get_item(TableName='my-table', Key={'id': key})
)
Graceful Degradation
Design applications to degrade gracefully:
class ResilientApplication {
async getData(userId) {
try {
// Primary data source: DynamoDB
return await this.getDynamoDBData(userId);
} catch (dynamoError) {
console.error('DynamoDB unavailable:', dynamoError);
try {
// Fallback 1: Read replica or cache
return await this.getCachedData(userId);
} catch (cacheError) {
console.error('Cache unavailable:', cacheError);
try {
// Fallback 2: Secondary database
return await this.getBackupDBData(userId);
} catch (backupError) {
// Final fallback: Degraded mode with limited data
return this.getDegradedData(userId);
}
}
}
}
getDegradedData(userId) {
// Return essential data only
return {
userId: userId,
status: 'limited_functionality',
message: 'Some features temporarily unavailable'
};
}
}
Chaos Engineering
Proactively test failure scenarios:
## AWS Fault Injection Simulator example
import boto3
fis = boto3.client('fis')
## Create experiment template
experiment_template = {
'description': 'Simulate DynamoDB throttling',
'actions': {
'ThrottleDynamoDB': {
'actionId': 'aws:dynamodb:throttle-requests',
'parameters': {
'throttlePercentage': '90',
'duration': 'PT10M' # 10 minutes
},
'targets': {
'Tables': 'DynamoDBTables'
}
}
},
'targets': {
'DynamoDBTables': {
'resourceType': 'aws:dynamodb:table',
'selectionMode': 'ALL',
'resourceTags': {
'Environment': 'staging'
}
}
},
'stopConditions': [
{
'source': 'aws:cloudwatch:alarm',
'value': 'critical-error-rate-alarm'
}
]
}
## Regular testing schedule
## - Weekly: Minor disruptions (10% throttling)
## - Monthly: Moderate disruptions (50% throttling)
## - Quarterly: Severe disruptions (90% throttling)
## - Annually: Full region failure simulation
Monitoring and Alerting
Multi-Region Health Checks
## CloudWatch Synthetics Canary
canary_configuration:
name: multi-region-health-check
regions:
- us-east-1
- us-west-2
- eu-west-1
checks:
- name: dynamodb_write
frequency: 1_minute
timeout: 10_seconds
alert_threshold: 2_consecutive_failures
- name: dynamodb_read
frequency: 1_minute
timeout: 5_seconds
alert_threshold: 2_consecutive_failures
- name: lambda_execution
frequency: 1_minute
timeout: 30_seconds
alert_threshold: 3_consecutive_failures
alerting:
critical:
- all_regions_failing
- primary_region_down_over_5_minutes
warning:
- single_region_degraded
- elevated_error_rates
Dependency Mapping
class ServiceDependencyMonitor:
def __init__(self):
self.dependencies = {
'api-service': [
'dynamodb-users',
'dynamodb-sessions',
'lambda-auth',
'cognito'
],
'lambda-auth': [
'dynamodb-tokens',
'cognito'
],
'data-pipeline': [
'dynamodb-events',
's3-data-lake',
'lambda-processor'
]
}
def check_service_health(self, service):
# Check direct service health
service_healthy = self.direct_health_check(service)
if not service_healthy:
return {'status': 'unhealthy', 'reason': 'direct_failure'}
# Check all dependencies
for dependency in self.dependencies.get(service, []):
dep_healthy = self.check_service_health(dependency)
if not dep_healthy['status'] == 'healthy':
return {
'status': 'degraded',
'reason': f'dependency_failure: {dependency}'
}
return {'status': 'healthy'}
Cost Considerations
Multi-Region Cost Analysis
## Cost comparison for a typical application
class MultiRegionCostCalculator:
def __init__(self):
self.dynamodb_write_price = 1.25 # per million writes
self.data_transfer_price = 0.02 # per GB
self.lambda_price = 0.0000166667 # per GB-second
def calculate_single_region_cost(self, writes_per_month, data_gb):
dynamodb_cost = (writes_per_month / 1_000_000) * self.dynamodb_write_price
return {
'dynamodb': dynamodb_cost,
'data_transfer': 0, # No cross-region transfer
'total': dynamodb_cost
}
def calculate_multi_region_cost(self, writes_per_month, data_gb, regions=3):
# DynamoDB Global Tables cost
dynamodb_cost = (writes_per_month / 1_000_000) * self.dynamodb_write_price * regions
# Replication data transfer between regions
replication_transfer = data_gb * regions * self.data_transfer_price
return {
'dynamodb': dynamodb_cost,
'data_transfer': replication_transfer,
'total': dynamodb_cost + replication_transfer
}
## Example calculation
calculator = MultiRegionCostCalculator()
## Application with 100M writes/month, 1TB data
single_region = calculator.calculate_single_region_cost(100_000_000, 1000)
multi_region = calculator.calculate_multi_region_cost(100_000_000, 1000, 3)
print(f"Single Region: ${single_region['total']:.2f}/month")
print(f"Multi Region: ${multi_region['total']:.2f}/month")
print(f"Additional Cost: ${multi_region['total'] - single_region['total']:.2f}/month")
## Typical output:
## Single Region: $125/month
## Multi Region: $435/month
## Additional Cost: $310/month (248% increase)
Cost vs Availability Trade-offs
Availability Level | Architecture | Monthly Cost | Downtime/Year
-------------------|--------------|--------------|---------------
99.0% (2 nines) | Single AZ | $1,000 | 87.6 hours
99.9% (3 nines) | Multi-AZ | $1,500 | 8.76 hours
99.95% | Multi-Region | $3,500 | 4.38 hours
99.99% (4 nines) | Multi-Region | $5,000 | 52.6 minutes
| Active-Active| |
99.999% (5 nines) | Multi-Region | $10,000+ | 5.26 minutes
| Active-Active| |
| Multiple | |
| Providers | |
Industry Response and Changes
AWS Improvements
Following the incident, AWS announced several improvements:
Enhanced Power Infrastructure
- Increased isolation between availability zones
- Improved backup power systems
- Better monitoring of electrical systems
Control Plane Capacity
- Increased capacity for recovery scenarios
- Better throttling and queuing mechanisms
- Improved graceful degradation
Communication
- Faster service health dashboard updates
- More detailed post-incident reports
- Better customer notification systems
Customer Behavior Changes
Organizations responded by:
Immediate Actions:
- Emergency multi-region deployments
- Increased monitoring and alerting
- Incident response plan updates
- Management escalation
Long-term Changes:
- Multi-region by default for critical services
- Increased disaster recovery testing
- Chaos engineering adoption
- Multi-cloud strategies (some organizations)
- Reduced dependency on US-EAST-1
Recommendations for Engineering Teams
Tier-Based Approach
Not all services require the same level of resilience:
class ServiceTierStrategy:
def __init__(self):
self.tiers = {
'tier_1_critical': {
'availability_target': 99.99,
'architecture': 'active-active multi-region',
'rto': '< 1 minute',
'rpo': '< 1 minute',
'examples': ['authentication', 'payment processing', 'core APIs']
},
'tier_2_important': {
'availability_target': 99.9,
'architecture': 'active-passive multi-region',
'rto': '< 15 minutes',
'rpo': '< 5 minutes',
'examples': ['user profiles', 'content delivery', 'analytics']
},
'tier_3_standard': {
'availability_target': 99.0,
'architecture': 'multi-AZ single region',
'rto': '< 1 hour',
'rpo': '< 1 hour',
'examples': ['reporting', 'batch jobs', 'internal tools']
}
}
def get_recommendation(self, service_type):
# Return appropriate architecture based on criticality
pass
Testing Checklist
## Disaster Recovery Testing Checklist
### Monthly Tests
- [ ] Verify backup integrity
- [ ] Test monitoring and alerting
- [ ] Review incident response procedures
- [ ] Check failover automation
### Quarterly Tests
- [ ] Full failover to secondary region
- [ ] Restore from backup in different region
- [ ] Simulate partial service degradation
- [ ] Test cross-region replication lag
- [ ] Validate data consistency after failover
### Annual Tests
- [ ] Complete region failure simulation
- [ ] Multi-service cascade failure test
- [ ] Extended outage scenario (4+ hours)
- [ ] Test with full production load
- [ ] Validate financial impact estimates
### Continuous
- [ ] Monitor replication lag
- [ ] Track cross-region latency
- [ ] Review error rates and patterns
- [ ] Update runbooks and documentation
Related Articles
- Cloudflare Workers: Serverless Web Application
- Penetration Testing Reconnaissance
- Load Balancing Algorithms and Strategies
- Mastering Edge Computing And IoT
Conclusion
The US-EAST-1 DynamoDB outage of June 2023 serves as a stark reminder that even the most reliable cloud infrastructure can fail. The incident highlighted several critical lessons:
- Regional Concentration Risk: Over-reliance on a single region creates systemic risk
- Cascading Failures: Dependencies between services can amplify outages
- Control Plane Limitations: Recovery can be hindered by overwhelmed control systems
- Multi-Region is Essential: Critical services require true multi-region architecture
- Testing is Crucial: Regular disaster recovery testing reveals weaknesses
- Cost vs Reliability: Higher availability requires significant investment
Key Takeaways for Teams
Immediate Actions:
- Audit current single-region dependencies
- Implement basic multi-region failover for critical services
- Enhance monitoring and alerting
- Document and test incident response procedures
Long-term Strategy:
- Design for multi-region from the start
- Implement chaos engineering practices
- Balance costs with availability requirements
- Continuously test and improve resilience
Cultural Changes:
- Treat outages as learning opportunities
- Invest in observability and monitoring
- Prioritize reliability alongside features
- Foster a culture of operational excellence
The outage affected thousands of organizations and millions of users, but it also provided invaluable lessons in building resilient distributed systems. By learning from this incident and implementing robust multi-region architectures, development teams can better protect their applications and users from future disruptions.
As cloud infrastructure becomes even more central to modern applications, the lessons from this outage become increasingly important. The question is no longer whether failures will occur, but how quickly and gracefully we can recover when they inevitably do.