How can affinity rules break your Kubernetes rolling updates

Maria Ralli
XM Global
Published in
5 min readJan 29, 2021

And how to prevent it

Does Kubernetes confuse you as well?

Kubernetes offers you options to configure every aspect of your deployment, but most of the time we rely on the defaults because they just work. However, there are cases where the default values can be in conflict with your deployment and even break your updates.

In our company, High-Availability of our applications is a top priority, so when we were preparing the K8s deployment for a new application we wanted to ensure that it would keep running even after potential node failures. In K8s you can achieve HA through spreading the pods across nodes by using pod (anti-)affinity rules. When it comes to deciding on your affinity configuration, you can choose between the hard and the soft type of affinity. According to the official K8s documentation:

- The soft affinity rules specify preferences that the scheduler will try to enforce but will not guarantee.

- The hard affinity rules must be met for a pod to be scheduled onto a node.

Soft affinity seems like the safer choice as it guarantees that the pods will always be scheduled. However, it comes with a drawback. In case of node failures, the pods may end up in the same node and continue running there without you realizing thus destroying your application’s HA. To ensure that our application is always distributed across different nodes on the cluster we opted for hard anti-affinity rules. This guarantees that no pod instances will run on the same node.

Soft anti-affinity on node failures

Using hard anti-affinity rules means that we cannot have more replicas than nodes, so when we deployed our application on a development cluster consisting of two nodes, we set the number of replicas to two. Our deployment.yaml file looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
selector:
matchLabels:
app: web-store
replicas: 2
template:
metadata:
labels:
app: web-store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: web-store
topologyKey: "kubernetes.io/hostname"
containers:
- name: web-app
image: nginx:1.16-alpine

After deployment, the pods were scheduled on different nodes as desired. Sounds, like a nice deployment configuration, right? Hm, not so much…

After a couple of days, we decided to update our deployment. After applying the new manifest, we noticed that there were now three pods with the newly created one being pending, meaning that the update could not be completed.

$ kubectl get pods
NAME READY STATUS RESTARTS AGE IP NODE
web-5764db9556-sjs64 1/1 Running 0 105m 10.42.1.2 node1
web-5764db9556-2szm2 1/1 Running 0 105m 10.42.0.2 node0
web-67cd7d7bf8-6gtc2 0/1 Pending 0 25s <none> <none>

By examining the new pod, we can see that it remains unschedulable due to the specified anti-affinity rules. But this makes us wonder: why doesn’t K8s kill one of the existing pods to free one node and schedule the new pod?

$ kubectl describe pod web-67cd7d7bf8-6gtc2
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m13s default-scheduler 0/2 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules.

It turns out, this is due to the default RollingUpdateStrategy in combination with the hard anti-affinity rule and the number of replicas/nodes (two in our case). By default, K8s applies the following update strategy for deployments:

  • 25% max unavailable. K8s ensures that only a certain number of pods are down while they are being updated. By default, it ensures that at least 75% of the desired number of pods are up.
  • 25% max surge. K8s also ensures that only a certain number of pods are created above the desired number of pods. By default, it ensures that at most 125% of the desired number of pods are up.

In our case, we have two pods, so max unavailable is 0 pods and max surge is 1 pod. So, when updating the deployment the following happens:

  1. A new pod is created.
  2. The new pod cannot be scheduled due to the hard anti-affinity rule.
  3. None of the two running pods can be terminated due to the max unavailable rule which requires at least two pods running.
  4. The new pod remains unscheduled (pending) and the update is never completed.
K8s rolling update conflict

The above failure can occur only when we are running a two pods deployment on a two nodes cluster. Any other combination of nodes/pods can lead either to available space for the new pod or K8s being able to kill one of the existing pods.

Sounds very specific but it can happen!

Now that we know why our update is breaking let’s see how we can fix it. We could scale our cluster to more than two nodes or relax the anti-affinity rule and use soft anti-affinity instead. However, both these solutions do not satisfy our initial requirements.

In order to allow updates to work with the above requirements we adjust the RollingUpdateStrategy as shown below and set the max unavailable pods to 50% which translates to 1 pod in our case. The scheduler can now kill one of the two running pods and schedule the new pod on its node. This way, the anti-affinity rule is ensured and there is no downtime.

strategy:
rollingUpdate:
maxUnavailable: 50%
maxSurge: 25%
Rolling update with 50% maxUnavailable

We can see that, even in cases where hard anti-affinity rules are preferable, they can cause unpredictable conflicts due to their strictness. As a rule of thumb, make sure to always check how your application performs during updates, especially when you are imposing scheduling restrictions to your deployments.

--

--

Maria Ralli
XM Global

Coding Enthusiast and strong advocate of free software. Always love to learn new things and technologies.