Even before we fully validated multi-regional isolation and redundancy via Chaos Kong, we got a chance to exercise regional failover in real life. One of our middle tier systems in one of the regions experienced a severe degradation that eventually lead to the majority of the cluster becoming unresponsive. Under normal circumstances, this would have resulted in a severe outage with many users affected for some time. This time we had additional tool at our disposal — we decided to exercise the failover and route user requests to the healthy region. Within a short time, quality of service was restored to all the users. We could then spend time triaging the root cause of the problem, deploying the fix, and subsequently routing traffic back to the now healthy region. Here is the timeline of the failover, the black line is a guide from a week before: