In general, failure rates an organization is dealing with depend largely on 2 factors: scale of operational deployment and velocity of change. If both scale and speed are small, then most of the time things just work. Once scale starts to grow, even with slow velocity, the chance of hardware failure will increase. Conversely, even at small scale, if velocity is fast enough, chance of software failure will increase. Ultimately, if you’re running at scale and still pursuing high velocity — things will break all the time.
Of course, not all failures are created equal. The types of failures that we’ll focus on in this post are the most important and difficult to deal with — complete and prolonged service outages with unhappy customers flooding customer service lines, going to twitter to express their frustration, articles popping up across multiple publications announcing “service X is down!”.
At Netflix, our internal availability goals are 99.99% — which does not leave much time for our services to be down. So in addition to deploying our services across multiple instances and Availability Zones, we decided to deploy them across multiple AWS Regions as well. Complete regional infrastructure outage is extremely unlikely, but our pace of change sometimes breaks critical services in a region, and we wanted to make Netflix resilient to any of the underlying dependencies. In doing so, we’re leveraging the principles of Isolation and Redundancy: a failure of any kind in one Region should not affect services running in another, a networking partitioning event should not affect quality of service in either Region.