K8s scheduling — deep dive

Tsahi Duek
The Spot to be in the Cloud
5 min readAug 6, 2019

Kubernetes(AKA K8s), who've been around for the past 5 years, deliver a promise to free its users from dealing with infrastructure. Software engineers do not have to deal with infrastructure allocation, capacity or optimization on the infrastructure side, and for DevOps, well, they just have to take care for enough worker-nodes for the pods to be allocated on.

Although working with K8s sometimes gives the experience of one large computer that can hold all of our applications (Mainframe anyone?) where we’re just kubectl apply -ing pods all over the place, there are still some limitations we should take into consideration while deploying our applications. For example, some applications must have at least one replica per availability zone, some must not reside with other applications on the same host, some must run on a specific instance type, and so on…

On the infrastructure side, these are some of the things that need to be managed to fulfill the scheduling “requests” from above:

  • # of worker nodes that is sufficient to hold all our pods
  • Scaling — based on what metrics?
  • Scheduling limitations, such as PVCs location, affinity, and tolerations
  • Which instance types should be launched as worker nodes

Although some of these scheduling constraints and requirement can be fulfilled by various mechanisms (eksctl nodegroups, instance-groups in KOPS, GKE node-pools), in this blogpost I’d like to demonstrate (with diagrams and K8s deployments snippets) how this can be achieved by using “K8s only” features. As always — let's start from a simple example.

Scenario I — the “simple” scheduling

In this example there are no special requirements — we have a simple pod that wants to run anywhere it can. The challenge starts when there is no available room on the cluster for this pod to run on (there still can be some “room” left in every K8s worker node, but it’s not enough to even run a small application). This situation is illustrated in diagram #1.

diagram #1

So… what should we do? The answer is simple — we need to scale up our cluster. So… let’s assume we did just that as illustrated in diagram#2

diagram #2

Scenario II—specific AZ

If our first pod wasn’t that “picky”, on the next example our pod must reside in a specific availability zone (us-west-1b in our example) — this is achieved by using K8s nodeSelector. The object nodeSelector in the pod spec, makes pods be scheduled only on nodes that have specific labels. There are some built-in labels on every node that are being populated by default and can be used for our needs. In our example, this built-in label will be failure-domain.beta.kubernetes.io/zone. From the infrastructure point of view, there are not enough resources in us-west-1b (in order to simulate the connection between pod scheduling and the infrastructure of the cluster) (diagram #3)

diagram #3

Although we have enough room in our cluster to hold this pod in terms of requirements, there is no room left in the availability zone that this pod needs to run on. The solution in this example — scale up by one node in the “right” availability zone to allow this new pod to be scheduled and run (diagram #4)

diagram #4

Scenario III — Persistent Volume Claim location

Our cluster keeps growing and pods are being scheduled normally. The next scheduling aspect we need to take into consideration is pretty similar to the previous one — a pod with PVC need to be scheduled on our cluster, but this time, this pod doesn’t specify a specific availability zone — it inherits its constraints from the PVC location (in our example — PVC in AZ-a as shown in diagram #5).
In order to address this scheduling issue, we’ll need to scale our cluster (again), but this time on us-west-1a which the attached PVC resides in (diagram #6)

diagram #5
diagram #6

Scenario IV — Specific instance types

Some applications require high network throughput, API-Servers/Proxies/Web-applications for example. Therefore, these pods should run on instances with high network bandwidth (e.g. c5.4xlarge). In order to achieve this goal, we can specify a nodeAffinity with K8s special built-in label beta.kubernetes.io/instance-type and pass a list of possible instance types for our pod.

Summary

Although K8s gives us the ability to run all of our applications on “one big computer”, it is still our responsibility to specify the scheduling limitation of our application and make sure we have enough of the right infrastructure to fulfill our applications requirements.

Some of the tools K8s gives us to specify these scheduling requirements from the pod level itself are:

  1. Pod affinity/anti-affinity
  2. nodeAffinity/anti-affinity
  3. Taints and Tolerations
  4. Node selector

By using all of the above, we are able to create an infrastructure spec (a YAML) for our application which is portable across many K8s engines (whether it’s been created by KOPS/kubespray/EKS/GKE/etc…) and scalable from the deployment level itself.

On the infrastructure side of things, we need to use cluster autoscaling solutions that will guarantee we’ll have enough instances (from the right type and in the right “place”) available for our applications without the need for managing this infrastructure manually. Solutions available on the market are:

  1. Ocean — product by Spot which manage your infrastructure in a cost-efficient way by running your worker nodes on spot instance while looking at the Pod spec and look for all the requirements we’ve covered (nodeSelectors, node/pod affinity/anti-affinity, taints & tolerations, PVC awareness and much more)
  2. Cluster-autoscaler — developed and supported by the community. There are some limitations and differences in implementations among cloud providers. Require a higher level of operations to address the topics covered in this article

Would like to get your feedback on this.

Feel free to contact me on Linkedin/Twitter/Email: tsahi.duek@spot.io

--

--