Avoiding Kubernetes Pod Topology Spread Constraint Pitfalls
In the Wise Cloud Platform squad we take resiliency, capacity planning and costs seriously. We are always looking for ways to improve how services running on our self-managed Kubernetes clusters in AWS can be made more resistant to system failures without significantly increasing cost.
One of the mechanisms we use are Pod Topology Spread Constraints. This is a built-in Kubernetes feature used to distribute workloads across a topology domain, such as Availability Zones or Nodes.
In this blog post, we consider Availability Zones to be our topology domain. We discuss how to configure Pod Topology Spread Constraints correctly and how a naive configuration creates unexpected Pod skew.
If Pod Topology Spread Constraints are misconfigured and an Availability Zone were to go down, you could lose 2/3rds of your Pods instead of the expected 1/3rd. This will likely negatively impact your customers' ability to use your product.
What is wrong with Kubernetes Pod Topology Spread Constraints?
Skew is the difference in the number of pods between the most populated and least populated Availability Zone. If we have 3 Availability Zones and 3 Pods, ideally we want a skew of 0, as seen in the diagram below.
If we were to add another pod, it could be placed in any of the Availability Zones and our pods would still be distributed as evenly as possible. Our skew would then be 1.
If the pods are not evenly distributed, the skew will always be greater than 1. In the case of the diagram below, Availability Zone 1 contains 2 pods and Availability Zone 3 contains zero pods, so the skew is 2.
You can configure the Pod Topology Spread Constraints through the
topologySpreadConstraints section of a Pods’ spec. The Pods to which the configuration is applied is determined through LabelSelectors and what you deem an acceptable distribution is determined by the
Below is an example of a naive
With this configuration, if we have at least 3 pods, we expect to have a minimum of 1 Pod running in each Availability Zone and our skew to never exceed 1.
To understand why we saw a deviance from what was expected, we need to explain how scheduling occurs in Kubernetes. A component called kube-scheduler decides which Nodes a Pod will be scheduled on, and a configurable set of rules informs where a pod is allowed to be scheduled. However, the problem is that these rules consider the state of the cluster at the point in time when scheduling occurs. Kube-scheduler will not rebalance pods after they are scheduled or consider the rules when removing pods.
A rolling update of a Deployment in Kubernetes will create a new ReplicaSet version and schedule new Pods and terminate the old Pods. The exact way this rollout happens is configurable, but let us assume we want to schedule 1 Pod at a time. Kube-scheduler will schedule a new Pod and the Deployment Controller waits for this Pod to report itself ready. Once ready, one pod from the old ReplicaSet is terminated and the Deployment Controller requests that another Pod from the new ReplicaSet is scheduled. This process repeats until all Pods are rotated. Note that the termination of Pods is not intelligent in any way; one is chosen at random.
As LabelSelectors are used to define which pods to apply Pod Topology Spread Constraints to, if you use only the app name or something similar the kube-scheduler will consider Pods from both old and new ReplicaSets. This makes it possible for a Pod to be scheduled in a way which violates the given Pod Topology Spread Constraints once a Pod from the old ReplicaSet is terminated.
Let’s consider an example where we are in the middle of a rollout. A new Pod has just been scheduled and reported ready, and an old Pod terminated. Kube-scheduler is now deciding which Availability Zone it can place the new Pod:
According to kube-scheduler the current skew is 0, and it is allowed to place the Pod in any Availability Zone without violating the Pod Topology Spread Constraints. If it places the Pod in one of the Availability Zones that already contain a new Pod, we will end up in a state that violates the constraints once the remaining old Pod is terminated.
To prevent this issue, we need kube-scheduler to distinguish between old ReplicaSet Pods and new ReplicaSet Pods. As Pod Topology Spread Constraints use LabelSelectors we can ensure the old and new Pods have a unique label in the LabelSelector of our
To do this, we created a mutating admission webhook which adds a label,
topology.wise.com/id, to the Pod metadata. The value of this label is a uuid. The mutating admission webhook updates the Pod Topology Spread Constraints LabelSelector accordingly whenever a Deployment object passes through the Kubernetes API.
If our example configuration were passed through this webhook, the result would be:
Now, kube-scheduler will not consider the Pods from the old ReplicaSet when scheduling the new Pods.
Instead of using a mutating webhook, it would have been possible to update tooling to edit the manifests, adding the unique label and associated LabelSelector, before they reach the Kubernetes API. This could have saved us having to deploy and monitor a new component. But to do this we would have to change how deployment manifests are created in every place they are generated, which is not scalable.
It is also the case that Kubernetes v1.25 adds a matchLabelKeys field in alpha to the Pod Topology Spread Constraints that can be enabled using a feature gate. Kube-scheduler will assume the values automatically so you can take advantage of the existing pod-template-hash label to achieve the same behaviour we do with the mutating admission webhook.
Both solutions described will not protect you against imbalances after scaling down a deployment. If you regularly scale down workloads, you may want to consider using a tool like descheduler.
You could also consider rearchitecting the way you deploy workloads, such as doing a deployment per Availability Zone or even running a cluster per Availability Zone.
Out of the box, Pod Topology Spread Constraints are not enough to ensure even Pod distribution. LabelSelectors that capture Pods from both old and new ReplicaSet versions will cause your availability expectations to be violated. If using Pod Topology Spread Constraints to distribute Pods through Availability Zones, this could mean that in the event of an Availability Zone failure, 2/3rds of a service’s pods are lost.
We solved this by implementing a mutating admission webhook to add a new label to Pods controlled by a Deployment. This label has a unique ID as a value per ReplicaSet and is referred to in the LabelSelectors of the Pod Topology Spread Constraints configuration. This allows kube-scheduler to distinguish between old and new Pods, preventing us from ending up in an undesirable state.
If you enjoyed reading this post and like the presented challenges, our Platform Engineering team is hiring! Check out our open Engineering roles here.