What we have done so far cannot be qualified as an experiment. We simply executed an action that resulted in the destruction of a Pod. The best we can get from that is a satisfying feeling like “oh, look at me, I destroyed stuff.” However, the goal of chaos engineering is not to destroy for the sake of feeling better or for the purpose of destruction itself. The objective is to find weak points in our clusters, applications, data center, and in other parts of our systems. Therefore, we typically start by defining a steady-state that is validated before and after actions.
We define something like this is what that should look like. If the state of that something is as we defined it, we start destroying stuff by introducing some chaos into our cluster. After that, we take another look at the state and check whether it is still the same.
So, if the state is the same both before and after actions, we can conclude that our cluster is fault-tolerant and resilient and that everything is just peachy. In the case of Chaos Toolkit, we accomplish this by defining steady state hypothesis.
Inspecting the definition of terminate-pod-ssh.yaml
We’re going to take a look at yet another definition that specifies the state that will be validated before and after some actions. Let’s take a look.