All the work described above for Active-Active project is just a beginning. The project itself still has an upcoming Phase 2 — where we’ll focus on operational aspects of all of our multi-regional tooling, and automate as many of current manual steps as we can. We’ll focus on minimizing the time that we need to make a decision to execute a failover, and the time that it takes to fail over all of the traffic.
We’re also continuing to tackle some of the more difficult problems. For example, how do you deal with some of your dependencies responding slowly, or returning errors, but only some of the time? Arguably, this is harder than dealing with Chaos type of scenarios — when something is not there, or consistently failing, it’s much easier to decide what to do. To help us learn how our systems deal with such scenarios, we have a Latency Monkey — it can inject latencies and errors (both client and server-side) at a given frequency / distribution.