Kubernetes Tip: What Happens To Pods Running On Node That Become Unreachable?

Bhargav Bhikkaji
Tailwinds-MajorDomo
4 min readJun 9, 2020

It is very common for the worker nodes to get disconnected from their master due to various reasons. In such conditions, there are a bunch of questions we need answers for such as, Do master delete pods running on unreachable nodes?, How do Kubernetes controllers behave?, Does the pod continue to run on the worker nodes?. In short, what we are looking for is How does the Kubernetes system behave when a node becomes unreachable?.

Definition: An unreachable node is referred to as a partitioned node in Kubernetes parlance.

To understand the modus operandi, let’s create a partitioned node scenario and understand the behavior.

Example Cluster.

The example cluster considered has one master node and 3 worker nodes. An Nginx deployment is created having 2 replicas. These replicas run on different nodes; kind-worker2 & kind-worker3. Figure-1 captures the state of the example cluster.

Figure-1: State of example cluster.

Create a Node Partition.

A simple way to create a node partition is to remove the IP address of the node. That is what we do for kind-worker2. Figure-2 provides the necessary steps.

Figure-2: Create Node Partition.

How The Kubernetes System Behave?

The worker node(kind-worker2) is set to NotReady state immediately but the pod continuous to run. The reason being, the node-controller part of the kube-controller-manager responsible for node waits for pod-eviction-timeout to make sure the node is completely unreachable before scheduling the pods for deletion.

pod-eviction-timeout is set to 5 minutes by default and modified as part of the kube-controller-manager bootup process.

After the pod-eviction-timeout interval (5 mins in our case), the node-controller schedules the pods running on the partitioned node to Terminating state. The deployment controller part of the kube-controller-manager kicks in to create new replicas and schedules on the different nodes. In our example, an Nginx replica is created on the kind-worker node.

Figure-3 captures all the state changes on the Kubernetes system.

Figure-3: Situation on master-node.

What Happens To Pods Running On Partitioned Worker Node?.

Let’s get into the partitioned worker node to see what is going on. From Figure-4, we can observe that the pod continuous to run. Interesting!. The reason is the API server is unable to communicate with the partitioned node’s Kubelet to delete the pod. Also, Kubelet is not a controller to decide on what pods to run.

Figure-4: Pod continues to run on partitioned worker-node.

The pod gets removed once the partitioned node joins the cluster.

Summary.

Given that quite a few things happening under the hoods when a node gets partitioned. Here is a summary.

When the node becomes unreachable, The master sets the node to NotReady state.

The master waits for pod-eviction-timeout before taking any action. The pod-eviction-timeout is configurable parameter is set to 5 minutes by default as part of the kube-controller-manager bootup process.

After a pod-eviction-timeout time, the master set’s partitioned nodes pods to Terminating state and creates new instances of pods on different nodes.

The pods continue to run on the partitioned nodes.

Recommendation.

There is no magic number for pod-eviction-timeout that works for all cases. The value entirely depends upon business requirements such as application SLA’s, Cluster Resource Utilization, etc. If an environment has tight application SLAs or running out of error budgets and cluster resource utilization on the higher side, Tailwinds prescribes a lower value for pod-eviction-timeout. On the other hand, if SLAs are on the lower side or there is more leverage from error budgets perspective and cluster utilization on the lower side, Tailwinds prescribes higher value for pod-eviction-timeout.

I hope this helps. As always, I appreciate your comments.

--

--