The terms “fault” and “failure” are often used interchangeably but mean different things. When a part of a system experiences failure but the system as a whole can still continue to operate reliably, we label the system as fault-tolerant or resilient. The system can tolerate components deviating from the spec but still function correctly. A failure occurs when the system as a whole fails. No system can be made fault-tolerant from all types of possible faults and can always potentially fail as a whole.
Netflix’s Chaos Monkey is an example of a tool designed to test the resiliency of services in the Netflix ecosystem. The tool randomly terminates service instances to uncover service failures.