Some days ago I installed a new Kubernetes Cluster based on Rancher Kubernetes Engine. After some tests, I realized that my pods were not rescheduled after a node failed. At least not as fast as they should.
As I knew it
Let’s assume we are running a Kubernetes Cluster based on Kubernetes 1.12 (or lower) provisioned by RKE with no additional configurations. In this case, a pod rescheduling after a Node failure can take up to 5+ minutes (details are described here).
Because 5+ minutes might be too long in some environments there is an easy solution to overwrite the default with lower settings. You need to update your cluster.yml with the following parameters (more details here):
Afterward, you will need to run a “rke up” to update your cluster configuration. After this, your pods will be rescheduled based on your new values.
As it was in this case
This time, I took exactly the same steps as mentioned above, but my pods behaved as if I hadn’t adjusted the values. I redeployed the cluster several times and changed different settings. After some time I realized that I installed the latest available versions which in my case was Kubernetes 1.13.4. After some chats with Sebastiaan, we finally found the root cause as well as the solution.
Kubernetes 1.13 introduced “Taint based Evictions” as a new beta feature which is enabled by default. Based on this taints are automatically added by the NodeController in case of any node failures. The old behavior based on the nodes ready state is not active anymore (more details are available here).
Because of the new feature, the “pod-eviction-timeout” parameter isn’t used anymore and needs to be replaced with two new parameters:
With those two default toleration parameters, Kubernetes will automatically add a toleration configuration to every pod. Therefore it is very important to redeploy all pods to ensure the toleration is added to all of your pods:
- effect: NoExecute
- effect: NoExecute
Thanks again to Sebastiaan for helping with this issue.