What if a Node Gets Killed?
If instead of destroying the container, we terminated the server where the Pod is running, the result, from the functional perspective, would be the same. The Pod would be rescheduled to a healthy node. Jenkins would start again and restore its state from the EBS volume. Or, at least, that’s what we’d hope. However, such behavior is not guaranteed to happen in our cluster.
We have only two worker nodes, distributed in two (out of three) availability zones. If the node that hosted Jenkins failed, we’d be left with only one node. To be more precise, we’d have only one worker node running in the cluster until the auto-scaling group detects that an EC2 instance is missing and recreates it. During those few minutes, the single node we’re left with is not in the same zone. As we already mentioned, each EBS instance is tied to a zone, and the one we mounted to the Jenkins Pod would not be associated with the zone where the other EC2 instance is running.
As a result, the PersistentVolume could not re-bound the EBS volume and, therefore, the failed container could not be recreated, until the failed EC2 instance is recreated.
The chances are that the new EC2 instance would not be in the same zone as the one where the failed server was running. Since we’re using three availability zones, and one of them already has an EC2 instance, AWS would recreate the failed server in one of the other two zones. We’d have fifty percent chances that the new EC2 would be in the same zone as the one where the failed server was running. Those are not good odds.