Node Affinity helps autoscaling reschedule pods

Kenneth Tang
4 min readJun 20, 2022

--

Want to shut down some nodes during non-peak hours to cut some costs, but the autoscaling is not working as you expected? Let’s see if this article helps.

So, in the last post, I talked about my story on scaling-out by HPA and CA using the CPU resource request metric. I have two dual-core reserved instances running 24x7 as the AKS worker nodes to support the whole web app at all times. However, users stress the app during office hours, so I’ve added HPA and CA to scale out more CPU resources for that period. One of the beauties of the cloud is pay-as-you-go, meaning that I can shut down the extra nodes added by CA during nighttime, which is supposed to be controlled by HPA and CA too.

If you followed my last post, you should be able to scale out successfully. However, it may not work as you intended during scale-in:

  • Even if the pods were scaled down by HPA to the minimum pod counts, the remaining pods spread across all nodes, including those you wish to shut down, preventing them from doing so.
  • Even if the pods were redistributed, they might not stick to the node as you desired, resulting in shutting down the wrong nodes. In my case, I should shut down the extra nodes created by CA as they are in the pay-as-you-go tier. I have to keep my reserved instances alive as the backbone, and they are already charged anyway!

The problem is all about how to control specific pods to be scheduled or rescheduled to some specific nodes, either compulsory or preferably.

Here is the answer officially provided by k8s: Node Affinity.

We have to give identities to the nodes before we can use node affinity. In Azure, we can manually assign labels to each node after creation; or we can specify the same label for the whole node pool so that all nodes belonging to the same pool will share the same behaviour. So, if you wish to have different attributes for your nodes, group them into different node pools!

To create a new user node pool with the label “turbo=true” and enable CA with node count ranging from 0 to 2 :(You can give any label as a key/value pair, as long as they are meaningful to you)

# az aks nodepool add -g your_resource_group --cluster-name your_aks_cluster -n your_new_pool_name  --mode User --enable-cluster-autoscaler --node-count 0 --min-count 0 --max-count 2 --node-vm-size your_vm_size --labels "turbo=true"

In my case, due to historical reasons, I didn’t assign any label to my reserve instances. I am going to add the label “turbo=true” to my new pool for auto-scaling by CA. So I have 2 node pools now, one with no label at all and expected to be running 24x7; The other has a label “turbo=true”, and it is expected to be only turned on just like a turbo to serve the peak hours. (To name that label is just like killing me)

I want all my pods like apache, login-portal, my CPU-intensive PHP server… just all of them, to be scheduled on my reserve instance, such that my web app still works without that “turbo”. When “turbo” is on, only CPU-intensive tasks, ie. the PHP-server, will be scaled-out on the turbo node.

Let’s go back to node affinity. You can find the sample YAML in these official documents, and they explained pretty well, so I won’t repeat them. Instead, I will show you how I modified the code and present it as a whole.

In my PHP server deployment, I made some modifications so that my PHP pods prefer not to be scheduled on the turbo nodes:

kind: Deployment
metadata:
name: php-server
...
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: turbo
operator: DoesNotExist

For my other remaining pods, such as apache and login-portal, I use requiredDuringSchedulingIgnoredDuringExecution node affinity, such that they cannot be scheduled on the turbo node:

kind: Deployment
metadata:
name: apache
...
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
matchExpressions:
- key: turbo
operator: DoesNotExist

Note: Both matching criteria are the key “turbo” does not exist in the node, no matter what the value assigned.

It may be weird that I am trying to avoid scheduling any kind of pods into the turbo nodes. In fact, this is what I really want to do. I need k8s to schedule, or reschedule, as many pods as it can on my reserve instances at all times. So for most of the pods except the PHP server, I have to force it to stick to my reserve instances, which are the nodes without the “turbo” label. I also prefer, but do not require, my PHP-server to be scheduled there too. It tells k8s to reschedule the pods and squeeze them into my reserve instances during HPA scale-in, in order to free the turbo nodes for CA to shut them down.

So, my turbo nodes are solely used to hold extra PHP-server pods scale-out by HPA, when there are no more resources available in my reserve instances.

You should have some idea to set your own rules for pod scheduling now. Node affinity is not the only way to plan how pods are scheduled on nodes, node anti-affinity, taints, and tolerations can also be applied. Do let me know your stories and how you tackle them and don’t forget to give me a like and follow me!

--

--

Kenneth Tang

AWS Certified Solutions Architect Associate; CISSP holder; Heavy Azure user