How to Keep Your Kubernetes Deployments Balanced Across Multiple zones

Customise the Kubernetes Scheduler policy to meet your desired SLOs

Published in

Expedia Group Technology

7 min readAug 27, 2019

Kubernetes is a container orchestration engine which schedules and manages containerized applications on a set of worker nodes. The Kubernetes API is used to specify a Kubernetes container pod and its scheduling requirements. The Kubernetes Scheduler is responsible for making pod scheduling decisions based on the pods that have been specified to run on the cluster. The Scheduler takes into account the pod’s resource requirements, QoS requirements, affinity, anti-affinity specifications and so on to make effective scheduling decisions.

Imbalanced cluster

In a multi-zone Kubernetes cluster, Scheduler tries its best to balance the pods in a replication controller or service across zones when the zone information is included in the node’s label, to reduce the impact of zone failures.

The following is quoted from the Kubernetes documentation.

Scheduler will automatically spread the pods in a replication controller or service across nodes in a single-zone cluster (to reduce the impact of failures). With multiple-zone clusters, this spreading behaviour is extended across zones (to reduce the impact of zone failures).

However in my experience even though Kubernetes tries its best to balance the pods across zones, it does not try hard enough. Apart from looking at the zone information attached to a node’s label, the default Scheduler also considers the nodes with the least requested resources, affinity / anti-affinity in pod specs and so on. So you could end up in a situation where pods placement are uneven across zones putting your service level objectives at risk during zone failures.

Let's consider a very simple tested example of multi-zone Kubernetes cluster with 3 worker nodes, one in each zone and hosting 3 services A, Band C. Let's consider service Band Cis running with 2 pods each and service Cpods needs to be running along with service Bpods using the pod affinity. The following diagram depicts the pods placement of all three services. Let's assume service Apods have requested 10% of the node’s resources while the Service Band Chave requested 30% of the resources.

Pods placement with requested resources details

Now if you deploy a service D with 6 pods with each pod having resource requirement of 10% of the node’s resources. As the resources on node in Zone3 are least requested (10%) whereas its 70% in Zone2 & Zone3, it places 3 pods of service D in Zone3, 2 pods in Zone2and 1 pod in Zone1. The reason behind this is while Scheduler tries to balance the pods across zones it also tries to balance the requested resources (CPU, memory) across nodes making the service Dpods distribution un-even across zones as shown in below image.

Pod placement after service D is deployed

Now if you were to initiate a rolling update on your service D, as the rolling update tries to bring up new pods before taking out the older pods, you would end up with the majority of service pods 4podsin zone3and 1podeach in remaining zones. Based on my experience there are plenty of other scenarios where you may end up having no service pods in a particular zone or having all the pods from a service running in a single zone.

So what if my service replicas are not evenly distributed?

Kubernetes is smart enough to spin up new pods should the individual node or a zone goes down. Well that is true, but Kubernetes may take some time to recover from such failure and to provision extra capacity in available zones, if they do not already have enough capacity to shift your entire workload from the failed zone to the available zones. If the pods in your service are not evenly balanced, zone failure can bring down majority or all of your service pods causing breaches in the latency and availability SLOs your service may have to offer. The level of disruption could fall in the following categories:

Complete loss of service if all the service replicas are in one zone.
Major service degradation if the majority of your pods are impacted, but could lead to complete service failure if the remaining pods are unable to handle the load.
Major service degradation impacting customers for not meeting service latency and throughput objectives
Minor service degradation with little or no impact on latency and throughput.

Generally you should decide on the availability level your component should adhere to based on service level commitments to your customers and configure your component accordingly i.e replica count, data placement and level of over-provisioning (replica count) for your service. If the requirement of your service is to be in 4th category above, configuring your service for high availability becomes easier if your service pods are evenly distributed.

How can I make sure my pods are evenly distributed?

Let’s first have a look at the steps Scheduler takes to make the scheduling decision.

Node Filtering: For each pod in pending state, Scheduler applies a set of predicate functions to filter out the nodes on which the pod can not be scheduled. Predicate functions filter the nodes based on the constraints or requirements specified on pod’s spec like taints, affinity, anti-affinity and nodes which do not have sufficient resources requested by pod. It also filters out the nodes which are already under resource pressure (memory or disk).

Feasible nodes after predicate function filtering

Node prioritization: Once Scheduler has filtered out the nodes on which a pod can not be scheduled, it calculates the score for each remaining nodes through a set of priority functions. Priority functions are used to find the node that is the best fit to schedule the pod from all feasible nodes and each priority function is for specific criteria like spreading the pods across nodes, balance the requested resources on nodes while at the same time considering the pod’s pod and node prefer mode of affinity/anti-affinity requirement. See the Advanced Scheduling in Kubernetes blog post for more details.

Example scores for feasible nodes:

Zone1 Node2 score - 15
Zone2 Node1 score - 12
Zone3 Node1 score - 16
Zone3 Node2 score - 18

Pod scheduling: In the final step Scheduler picks up the node with the highest score to schedule the pod. If multiple nodes have the same highest score, then scheduler picks one out of them randomly.

Node with highest priority score is selected for scheduling the pod:- Zone3 Node2

Set of predicates and priority functions to be used by the scheduler and the a weight-age associated with priority functions can be configured through the scheduler policy.

The following are the default priority functions Scheduler applies to choose best fit node from all feasible nodes for scheduling.

Selector spread: Prioritize based on spreading the service replica/pods
Least requested: Prioritize based on least requested resources on a node
Balance resource: Prioritize to balance the resource utilization
Inter pod affinity: Prioritize based on pod spec (Pod affinity with typepreferredDuringSchedulingIgnoredDuringExecution)
Taint toleration: Priority based on least number intolerable taints on node
Node affinity: Prioritized based on pod spec (Node affinity with typepreferredDuringSchedulingIgnoredDuringExecution )
Node prefer avoid pod priority: Ignore pods owned by the controller other than the Replication controller

Each priority function is scored between 0–10 where 10 being the most preferred and 0 being the least preferred score. Weightage is associated with all priority functions and the final score for each node is calculated as follows.

Node score :- (weightage * priorityFunction 1) + (weightage * priorityFunction 2) …

The default scheduler policy gives equal weightage 1 to all priority functions and by increasing the weightage on the “SelectorSpreadPriority” function which tries to evenly spread the pods across nodes and zones. You are more likely to achieve the even spreading of your pods across zones as long as there are enough resources in all the zones to run your pods. You can override the default scheduler policy through the policy file or through the configmaps.

Scheduler command to use if you override the default scheduling policy through policy file

--use-legacy-policy-config=false
--policy-config-file=<file path>

Scheduler command to use if you override the default scheduling policy configmaps

--use-legacy-policy-config=false
--policy-configmap=scheduler-policy
--policy-configmap-namespace=kube-system

Here is the sample policy where the weightage for “SelectorSpreadPriority” has been increased to 5 and kept at default values for all others. Once the weightage is increased, Scheduler will give more importance to spreading the pod across nodes and zones.

"priorities": [
    {
      "name": "SelectorSpreadPriority",
      "weight": 5
    },
    {
      "name": "InterPodAffinityPriority",
      "weight": 1
    },
    {
      "name": "LeastRequestedPriority",
      "weight": 1
    },
    {
      "name": "BalancedResourceAllocation",
      "weight": 1
    },
        {
      "name": "NodePreferAvoidPodsPriority",
      "weight": 1
    },
        {
      "name": "TaintTolerationPriority",
      "weight": 1
   },
   {
      "name": "NodeAffinityPriority",
      "weight": 1
    }
  ]

Once you increase the weightage on “SelectorSpreadPriority” Scheduler will give more importance to evenly spreading the pods in your service to nodes and across zones.

You can also evenly spread the pods from a particular service across the group of nodes based on label values for a specific label attached to the group of nodes. Here is an example.

        {
            "argument": {
                "serviceAntiAffinity": {
                    "label": "<label_name>"
                }
            },
            "name": "<priority-Name>",
            "weight": <weight-age>
        },

It gives the same score to all nodes that have the same value for the specified label. It gives a higher score to nodes within a group with the least concentration of pods.

You can also run a secondary scheduler in your cluster with a different set of policy configs and call that scheduler in your service deployment YAML to meet different needs, see https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/

I hope this post has helped you to understand how the Kubernetes Scheduler works. The scheduler default configuration may work for your use case depending on your desired SLOs. If you have a more demanding SLO, you can override the scheduler policy to meet your requirements.

How to Keep Your Kubernetes Deployments Balanced Across Multiple zones

Customise the Kubernetes Scheduler policy to meet your desired SLOs

Imbalanced cluster

So what if my service replicas are not evenly distributed?

How can I make sure my pods are evenly distributed?

Written by RAHUL GUPTA