Reliability: Ability to recover from infra and app issues
Adapt to changing demands in load
Best Practices
Automate recovery from failure
Health checks and Auto scaling
Managed services like RDS can automatically switch to standby
Scale horizontally (Reduces impact of single failure)
Maintain Redundancy
Multiple Direct Connect connections
Multiple Regions and Availability Zones
Prefer serverless architectures
Prefer loosely coupled architectures: SQS, SNS
Adhere to Distributed System Best Practices
Use Amazon API Gateway for throttling requests
AWS SDK provides retry with exponential backof