Pod rescheduling after a node failure with RKE and Kubernetes

Some days ago I installed a new Kubernetes Cluster based on Rancher Kubernetes Engine. After some tests, I realized that my pods were not rescheduled after a node failed. At least not as fast as they should.

In this post, I would like to share some outcomes as well as insights. Thanks to Sebastiaan van Steenis from Rancher who helped me to debug this issue.

Kubernetes logo

As I knew it

Let’s assume we are running a Kubernetes Cluster based on Kubernetes 1.12 (or lower) provisioned by RKE with no additional configurations. In this case, a pod rescheduling after a Node failure can take up to 5+ minutes (details are described here).

Because 5+ minutes might be too long in some environments there is an easy solution to overwrite the default with lower settings. You need to update your cluster.yml with the following parameters (more details here):

services:
kube-controller:
extra_args:
node-monitor-period: Xs
node-monitor-grace-period: Xs
pod-eviction-timeout: Xs
kubelet:
node-status-update-frequency: Xs

Afterward, you will need to run a “rke up” to update your cluster configuration. After this, your pods will be rescheduled based on your new values.

As it was in this case

This time, I took exactly the same steps as mentioned above, but my pods behaved as if I hadn’t adjusted the values. I redeployed the cluster several times and changed different settings. After some time I realized that I installed the latest available versions which in my case was Kubernetes 1.13.4. After some chats with Sebastiaan, we finally found the root cause as well as the solution.

Kubernetes 1.13 introduced “Taint based Evictions” as a new beta feature which is enabled by default. Based on this taints are automatically added by the NodeController in case of any node failures. The old behavior based on the nodes ready state is not active anymore (more details are available here).

Because of the new feature, the “pod-eviction-timeout” parameter isn’t used anymore and needs to be replaced with two new parameters:

services:
kube-controller:
extra_args:
node-monitor-period: Xs
node-monitor-grace-period: Xs
kubelet:
node-status-update-frequency: Xs
kube-api:
extra_args:
default-not-ready-toleration-seconds: X
default-unreachable-toleration-seconds: X

With those two default toleration parameters, Kubernetes will automatically add a toleration configuration to every pod. Therefore it is very important to redeploy all pods to ensure the toleration is added to all of your pods:

tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 30
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 30

Thanks again to Sebastiaan for helping with this issue.