Kubernetes Tip: How Statefulsets Behave Differently Than Deployments When Node Fails?

Published in

Tailwinds-MajorDomo

3 min readAug 25, 2020

Kubernetes is a platform that one needs to morph to make it work. As part of personalizing effort, having a strategy to handle node failure cases becomes an important criterion. To implement such a blueprint, one needs to understand how different controllers behave when a node fails. In this blog post, we look at how Stateful Sets behave differently than Deployment Sets when node failure occurs.

Recap On How Kubernetes Treats Deployment Sets When Node Fails?.

We have discussed this in detail in this blog post but the summary is, When a node fails, the Deployment controller terminates pods running that node and creates a new set of replicas to be scheduled on available nodes.

Here is the flow chart for how Deployment works when a node fails?

**Deployment Set Behaviour When Node Fails.**

Let’s understand the behavior of Statefulsets for node failures with an example.

Example Cluster Having StatefulSets.

The example Kind cluster is created having one master node and 3 worker nodes. An Nginx Stateful Set is created having 2 replicas. These replicas run on different nodes; kind-worker & kind-worker2. Figure-1 captures the state of the example kind cluster.

Create Node Failure Scenario.

A simple way is to create a node failure scenario is to delete the kind-worker2. Figure-2 provides the required steps.

**Figure-2: Captures Steps To Create Node Failure.**

How The Kubernetes System Behave?.

The worker node(kind-worker2) is set to NotReady state immediately but the pod continuous to run. The system waits for pod-eviction-timeout interval (5 mins is the default value) before setting the pod to Terminating state.

To our surprise, new pods are not created for the Statefulsets while in a similar scenario news replicas were spun up for deployment sets. Figure-3 captures the state of the Kubernetes cluster after a node failure in case of Statefulsets.

**Figure-3: State Of Nginx Stateful Set After Node Failure.**

Why A New Replica Is Not Spun Up?.

According to the Pod-Safety document, for clustered software in the Kubernetes, the system provides a guarantee of running at most one pet at any point in time. This rule is applicable to Statefulsets as replica are treated as pets while this does not apply for Deployment as replica are treated like cattle.

The fundamental reason for having at most one pet is because the Statefulsets are mostly used for clustered applications that have their own ways of electing master and slaves. In a node failure scenario, The master does not have enough information to ascertain if the node is actually failed or failure is due to a network partition. Therefore, the master refrains from taking any action thereby leading to more problems. The master takes a practical approach by having just one instance less but works in a reliable way.

How does one recover from this dangling state?

Recommendation.

There are a few ways one could handle this scenario. They are

Set terminationGracePeriodSeconds to 0 on pods spec. This will ensure that statefulset pods will be deleted forcefully when the node rejoins the cluster. By doing so, Kubernetes master knows that pod safety guarantee is maintained, so will spin up a new replica. The downside is obviously that pod shutdown is not graceful.

Have an automated way to detect node failures and delete those nodes forcefully if you know for sure the node is actually failed or removed. This will ensure statefulset pods to be respun on available nodes.

References.

Pod Safety: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/pod-safety.md

Statefulset Design Considerations: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/apps/stateful-apps.md

Discussions In Kubernetes Github On This Issue: https://github.com/kubernetes/kubernetes/issues/74947 https://github.com/kubernetes/kubernetes/issues/54368

P.S: Thanks To Aaron For Suggesting To Write This Post.