Improving Kubernetes reaction when nodes fail

Matias Zilli
Nerd For Tech
Published in
4 min readMay 29, 2021

Distributed systems such as Kubernetes are designed to be robust, resilient to failures and, auto recover in such scenarios, and Kubernetes accomplish this very well. However, It‘s common for the worker nodes to get disconnected from their master due to various reasons and in these cases, you would like Kubernetes to react fast to make the system robust in terms of availability and reliability.

By default, you might have noticed that when a node gets down, the pods of the broken node are still running for some time and they still get requests, and those requests fail. In my opinion, that time has to be reduced because by default is too high.

To see how Kubernetes HA works in terms of detecting when a node is down, let’s describe the communication between Kubelet and Controller Manager.

By default the normal behavior looks like:

  1. Kubelet updates it status to apiserver periodically, as specified by --node-status-update-frequency. Default value is 10s.
  2. Controller manager checks the Kubelet status every –-node-monitor-period. Default value is 5s.
  3. In case the status is updated within --node-monitor-grace-period of time, Controller manager considers the Kubelet to be healthy. Default value is 40s.

Failure

In case some node goes down, this is the workflow of what happens:

  1. The Kubelet posts its status to the masters using--node-status-update-frequency=10s.
  2. A node dies.
  3. Controller manager will try to check the node status reported by the Kubelet every --node-monitor-period=5s of time.
  4. Controller manager will see the node is unresponsive, and has a grace period --node-monitor-grace-period=40s until it considers the node to be unhealthy and set it to NotReady status.
  5. Kube Proxy will remove the endpoints that pointed to pods inside of that node from all services so pods from failed node won’t be accessible anymore.

In such a scenario, there will be some request error because the pods will continue to receive traffic until the node is considered as down and NotReady after 45s.

There are a bunch of parameters to tweak in the Kubelet and the Controller Manager.

Fast Update and Fast Reaction

To improve Kubernetes reaction to down nodes, you could modify these parameters, so if -–node-status-update-frequency is set to 1s (10s is by default), --node-monitor-period to 1s (5s is by default) and --node-monitor-grace-period to 4s (40s is by default), the node will be considered as down after approximately 4s. These numbers would reduce the reaction time from 40s to 4s.

Test Kubernetes Fast Reaction

To try it in a test environment, we can create a Kubernetes cluster using Kind or any other tool. We have created a custom Kind Cluster definition with the parameters specified in the previous section so that we can test the actual behavior.

Then we set a deployment of two Nginx pods placed in the control-plane and worker node. We also created a pod with Ubuntu in the control-plane node so we can test the availability of Nginx when the worker is down.

To try the Nginx availability, we accessed the service using curl from the Ubuntu pod placed in the control-plane and, we also watched the service endpoints belonging to the Nginx service from our terminal.

Finally, to simulate a node failure, we stopped the Kind container that runs the worker node. We also printed the timestamps when the node was down and when the node was detected to be NotReady

After running the test, we noticed the node went down at 12:50:22 and the Controller manager detected it as down at 12:50:26, which is exactly what would be expected, 4 seconds later.

Worker down at 12:50:22.285
Worker detected in down state by Control Plane at
time: "12:50:26Z"

We have noticed the same behavior from the tests from the terminal. The service started to return error states at 12:50:23 because the traffic was routed to the failed node. And at 12:50:26.744 the Kube Proxy removed the endpoint pointing to the failed node and the service availability was completely recovered.

...
12:50:23.115 - Status: 200
12:50:23.141 - Status: 200
12:50:23.161 - Status: 200
12:50:23.190 - Status: 000
12:50:23.245 - Status: 200
12:50:23.269 - Status: 200
12:50:23.291 - Status: 000
12:50:23.503 - Status: 200
12:50:23.520 - Status: 000
12:50:23.738 - Status: 000
12:50:23.954 - Status: 000
12:50:24.166 - Status: 000
12:50:24.385 - Status: 200
12:50:24.407 - Status: 000
12:50:24.623 - Status: 000
12:50:24.839 - Status: 000
12:50:25.053 - Status: 000
12:50:25.276 - Status: 200
12:50:25.294 - Status: 000
12:50:25.509 - Status: 200
12:50:25.525 - Status: 200
12:50:25.541 - Status: 200
12:50:25.556 - Status: 200
12:50:25.575 - Status: 000
12:50:25.793 - Status: 200
12:50:25.809 - Status: 200
12:50:25.826 - Status: 200
12:50:25.847 - Status: 200
12:50:25.867 - Status: 200
12:50:25.890 - Status: 000
12:50:26.110 - Status: 000
12:50:26.325 - Status: 000
12:50:26.549 - Status: 000
12:50:26.604 - Status: 200
12:50:26.669 - Status: 000
12:50:27.108 - Status: 200
12:50:27.135 - Status: 200
12:50:27.162 - Status: 200
12:50:27.188 - Status: 200
...
...
------
12:50:26.523
"kind-control-plane"
"kind-worker"
------
12:50:26.618
"kind-control-plane"
"kind-worker"
------
12:50:26.744
"kind-control-plane"
------
12:50:26.878
"kind-control-plane"
------
...

Conclusions

From the latest results, it is evident that the reaction improved considerably. There can be different combinations of parameters to satisfy specific cases and you may be quite tempted to lower the values for the Kubernetes system to react faster but, take into account that this scenario creates overhead on etcd as every node will try to update its status every 1 second. For example, if the cluster has 1000 nodes, there will be 60000 node updates per minute which may require large etcd containers or even dedicated nodes for etcd.
Also, if you set these too small, there would be some risks. For instance, a temporary network failure for a short period would result in traffic shifting frequently.

To sum up, the parameter values entirely depend upon business requirements such as application SLA’s, cluster resource utilization, infrastructure provider, etc.

I hope this helps. As always, I appreciate your comments.

--

--