Graceful scaledown of stateful apps in Kubernetes

Deploying stateful applications to Kubernetes is tricky. StatefulSets have made it much easier, but they still don’t solve everything. One of the biggest challenges is how to scale down a StatefulSet without leaving data on the disconnected PersistentVolume orphaned. In this blog post, I’ll describe the problem and two possible solutions.


If you’re unfamiliar with what happens when a StatefulSet is scaled down, here’s a quick summary. Each pod created through a StatefulSet gets its own PersistentVolumeClaim (PVC) and PersistentVolume (PV). When you scale down a StatefulSet by one replica, one of its pods is terminated, but the associated PersistentVolumeClaim and the PersistentVolume bound to it are left intact. They get reattached to the pod upon a subsequent scale up.

Scaling a StatefulSet (diagram taken from my book “Kubernetes in Action”)

Now imagine using a StatefulSet to deploy a stateful app whose data is partitioned across its pods. Each instance holds and processes just one part of the data. When you scale down the stateful app, one of the instances will be terminated, and its data should be redistributed (i.e. drained) to the remaining pods. If you don’t redistribute the data, it remains inaccessible until you scale back up again.

Redistributing data on scale-down

Redistributing data during graceful shutdown

You may be thinking: “Since Kubernetes supports mechanisms for the graceful shutdown of pods, can’t a pod simply redistribute its data to other instances as part of its shutdown procedure?” Sadly, it can’t. There are two reasons why not:

  • The pod (or rather its containers) may receive a termination signal for reasons other than scale down. The application running in the container doesn’t know why it’s being terminated and thus doesn’t know whether to drain the data or not.
  • Even if the app could distinguish between being scaled down and being terminated for other reasons, it would need a guarantee that it would be allowed to finish its shutdown procedure even if it takes hours or days. Kubernetes doesn’t provide that guarantee. If the application process dies during the shutdown, it won’t be restarted and so won’t get the chance to drain completely.

Trusting that a pod will be able to redistribute (or otherwise process all its data) during a graceful shutdown is thus not a good idea and would lead to a very fragile system.

Using tear-down containers?

If you’re not a total novice user of Kubernetes, you most likely know what init containers are. They run just before the pod’s main containers and must all complete before the main containers are started.

What if we had shutdown or tear-down containers, akin to init containers, but which would run after the pod’s main containers terminate? Could they perform the data redistribution in our stateful pods?

If pods also supported tear-down containers

Let’s assume that a tear-down container would be able to tell if the pod is being terminated due to a scale-down or not. And let’s assume that Kubernetes (more specifically, the Kubelet) would ensure that the tear-down container completes successfully (by restarting it every time it returns a non-zero exit code). If both of these assumptions were true, we’d have a mechanism that guarantees that a stateful pod is always able to redistribute its data on scale-down.

Or would we?


Sadly, a tear-down container mechanism like the one just described would only take care of those cases when a transient error occurs in the tear-down container itself and one or more restarts of the container eventually allow it to finish successfully. But what about those unfortunate times when the cluster node hosting the pod dies during the tear-down procedure? Obviously, the procedure can’t be completed, leaving the data inaccessible.

It should now be obvious that we shouldn’t perform the data redistribution on pod shutdown. Instead, we should create a new pod, possibly scheduled to a completely different cluster node, to perform the redistribution procedure.

And this brings us to the following solution:

When a StatefulSet is scaled down, a new pod must be created and bound to the orphaned PersistentVolumeClaim. Let’s call it the drain pod, since it’s job is to drain the data elsewhere (or process it in some other way). The pod must have access to the orphaned data and can do whatever it wants with it. Because the drain procedure varies greatly from application to application, the new pod should be completely configurable — users should be able to run any container they want inside the drain pod.

The StatefulSet Drain Controller

Since the StatefulSet controller currently doesn’t provide this feature yet, we can implement an additional controller, whose sole purpose is handling StatefulSet scale-downs. I’ve recently implemented a proof-of-concept of such a controller. You’ll find the source code on GitHub:

Let me explain how it works.


After you deploy the controller to your Kubernetes cluster, you can then add a drain pod template to any of your StatefulSets by simply adding an annotation to the StatefulSet’s manifest. Here’s an example:

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: datastore
annotations:
statefulsets.kubernetes.io/drainer-pod-template: |
{
"metadata": {
"labels": {
"app": "datastore-drainer"
}
},
"spec": {
"containers": [
{
"name": "drainer",
"image": "my-drain-container",
"volumeMounts": [
{
"name": "data",
"mountPath": "/var/data"
}
]
}
]
}
}

spec:
...

The template is not much different from the main pod template in a StatefulSet, except that it’s defined through an annotation. You deploy and scale the StatefulSet as you usually would.

When the controller detects that you’ve scaled down the StatefulSet, it creates a new drain pod from the specified template and ensures it is bound to the PersistentVolumeClaim that was previously bound to the stateful pod that was deleted because of the scale-down.

Redistributing data on StatefulSet scale-down

The drain pod gets the same identity (i.e. name and hostname) as the deleted stateful pod. This is necessary for two reasons:

  • Some stateful apps require a stable identity — this may also apply during the data draining procedure.
  • If the StatefulSet is scaled up again while the drain procedure is underway, this prevents the StatefulSet controller from creating a duplicate pod and attaching it to the same PVC.

If the drain pod or its host node crashes, the drain pod is rescheduled onto a different node, where it can retry/resume its operation. Once the drain pod completes, the pod and the PVC are deleted. When you scale the StatefulSet back up, a fresh PVC is created.

Trying it out yourself

If you’d like to try this yourself, first deploy the drain controller:

$ kubectl apply -f https://raw.githubusercontent.com/luksa/statefulset-drain-controller/master/artifacts/cluster-scoped.yaml

And then deploy the example StatefulSet:

$ kubectl apply -f https://raw.githubusercontent.com/luksa/statefulset-drain-controller/master/example/statefulset.yaml

This will run three stateful pods. When you scale the StatefulSet down to two, you’ll see one of those pods start to terminate. Then, immediately after the pod is deleted, a new drain pod with the same name will be created by the drain controller:

$ kubectl scale statefulset datastore --replicas 2
statefulset.apps/datastore scaled
$ kubectl get po
NAME READY STATUS RESTARTS AGE
datastore-0 1/1 Running 0 3m
datastore-1 1/1 Running 0 2m
datastore-2 1/1 Terminating 0 49s
$ kubectl get po
NAME READY STATUS RESTARTS AGE
datastore-0 1/1 Running 0 3m
datastore-1 1/1 Running 0 3m
datastore-2 1/1 Running 0 5s <-- the drain pod

When the drain pod completes its job, the controller deletes it and the PVC:

$ kubectl get po
NAME READY STATUS RESTARTS AGE
datastore-0 1/1 Running 0 3m
datastore-1 1/1 Running 0 3m
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ...
data-datastore-0 Bound pvc-57224b8f-... 1Mi ...
data-datastore-1 Bound pvc-5acaf078-... 1Mi ...

An added benefit of the controller is that it releases the PersistentVolume, since it’s no longer bound by the PersistentVolumeClaim. This reduces your storage costs if you’re cluster is running in a cloud environment.

Wrapping up

Keep in mind that this is a proof-of-concept only. It needs a lot more work & testing to become a proper solution to the StatefulSet scale-down problem. Ideally, the Kubernetes StatefulSet controller itself would support running drain pods like this, instead of requiring an additional controller that races with the original controller (when you scale down and immediately scale back up again).

By integrating this feature straight into Kubernetes, the annotation could be replaced with a regular field in the StatefulSet spec, so it would have a template, volumeClaimTemplates and a drainPodTemplate making everything much nicer compared to using an annotation.