Chaos Kong. This is the biggest outage simulation we’ve done so far. In a real world scenario this would have been triggered by a severe outage that would prompt us to rapidly shift user traffic to another region, but would inevitably result in some users experiencing total loss or lower quality of service. For the outage simulation we did not want to degrade user experience. We augmented what normally would have been an emergency traffic shifting exercise with extra steps so that users that were still routed to a “failed” region would still be redirected to the healthy region. Instead of getting errors, such users would still get appropriate responses. Also, we shifted traffic a bit more gradually than we would normally do under emergency circumstances in order to allow services to scale up appropriately and for caches to warm up gradually. We didn’t switch every single traffic source, but it was a majority and enough to prove we could take the full load of Netflix in US-West-2. We kept traffic in the west region for over 24 hours, and then gradually shifted it back to stable 50/50 state. Below you can see what this exercise looks like. Most traffic shifted from US-East to US-West, while EU-West remains unaffected: