Kubernetes Operations: Prioritize Workload in Overcommitted Clusters

One of the benefits in adopting a system like Kubernetes is facilitating burst-able and scalable workload. Horizontal application scaling involves adding or removing instances of an application to match demand. Kubernetes Horizontal Pod Autoscaler enables automated pod scaling based on demand. This is cool, however can lead to unpredictable load on the cluster, which may put the cluster into an overcommitted state.

The following image represents a three node cluster that runs three applications. Pink (triangle) is the most critical. Red (pentagon) is burst-able and durable. This means if we need to stop a few instances of red, things will be ok. Blue (circle) is non-critical. I have also tried to depict in this image a cluster that is a fully maxed out state.

Imaging now that a scale out operation is needed on the pink (triangle) application. This puts the cluster in an overcommitted state with critical workload requiring scheduling.

How can Kubernetes facilitate this critical request in an overcommitted state? One option is to use Pod Priority and Preemption, which allows a priority weight to be added to a scheduling request. In the event of overcommitment, priority is evaluated, and lower priority workload is restarted (preemption) to allow for scheduling of the priority workload.

Pod Priority and Preemption tutorial

In this article, we will walk through an end-to-end demonstration of using Pod Priority and Pre-emption to ensure critical workload has priority to cluster resources.

In order to complete this tutorial, you need a Kubernetes cluster that consists of three nodes. I’ve included steps for deploying an appropriately sized Azure Kubernetes cluster. If you need an Azure Subscription or would like to read up on additional operational practices for Azure Kubernetes Service, see the following links.

Create an Azure Kubernetes Service Cluster

First things first, ensure you have an appropriately sized Kubernetes cluster for this tutorial (three nodes). The following script can be used if needed.

Create a priority class for critical workload

Create a Pod Priority Class with a weight of 1000000. This can be used to ensure that high priority workload is given priority to cluster resource.

To do so, create a file names pc.yml and copy in the following yaml.

Create the priority class.

$ kubectl create -f pc.yml

Consume all CPU cores

Run some workload to consume all CPU cores in the cluster. In the following example, a deployment consisting of three replicas is started with a CPU request of one core each. This will effectively consume the available CPU resources of the cluster.

Create a file named slam-cpu.yml and copy in the following yaml.

Run the deployment.

$ kubectl create -f slam-cpu.yml

Start low-priority workload

Now start another pod without specifying a priority class.

Create a file named pod-no-priority.yml and copy in the following YAML.

Run the pod.

$ kubectl create -f pod-no-priority.yml

You should find that the new pod cannot be scheduled due to lack of CPU resources. To see this, list the pods on the cluster and note that the pod-no-priority is in a Pending state.

Return a list of events for the pod to see the actual issue.

$ kubectl describe pod pod-no-priority

Parsing the output you should see that the pod cannot be scheduled to insufficient cpu.

Run high priority workload

Run another pod, however this time assign the custom-high-priority class to the pod.

Create a file named pod-priority.yml and copy in the following yaml. Take note that the pod spec includes the priority class created in a previous step.

Run the pod.

kubectl create -f pod-priority.yal

Now return a list of pods. If done quickly you may be able to catch one of the lower priority pods being terminated.

Once the lower priority pod has been terminated, the pod with priority is started in its place.

Very cool indeed.

Feel free to contact me on Twitter @nepeters or comment below for discussion on the topic.

Originally published at techcommunity.microsoft.com on November 20, 2018.