Can Kubernetes evict your guaranteed pod from a node ahead of other non-guaranteed pods?

Published in

Expedia Group Technology

6 min readJul 10, 2018

Yes! Read on to see when, why and how Kubernetes handles CPU, memory and disk pressure and how these affect your pods…

Let’s start with the resources that always matter: CPU, memory and disk I/O. These resources are categorized as compressible (CPU) and non-compressible (memory, disk I/O). Compressible resources don’t cause eviction; instead Kubernetes throttles the pod’s available CPU when the node comes under CPU pressure. Non-compressible resources cannot be shared in that way, so when these come under pressure, Kubernetes will evict one of the pods on the node to keep sufficient capability for that resource on the node.

Under non-compressible resource pressure, the pod chosen for eviction by the Kubelet agent is not random, but based on the logic described below. The Kubelet agent runs on every node monitoring the node’s resource usage by polling cAdvisor metrics. Note: guaranteed means requests=limits, burstable means requests<limits and best-effort means requests=limits=0.

Under Disk pressure

The following is the default eviction threshold for disk pressure.

nodefs.available<10% (File system used by the container volumes)
imagefs.available<15% (File system used by the container run-time images)

When imagefs usage threshold is exceed, Kubelet tries to clean up disk by deleting unused images.
When nodefs usage threshold is exceeded, Kubelet stops admitting any new pods on the node and registers a disk pressure event. Kubelet’s first attempt will be to reclaim space by removing dead pods and containers and if this still does not bring the usage below threshold, Kubelet initiates the process of evicting the pod. From Kubernetes 1.9, Kubelet does not consider the pod’s QoS for eviction; instead it simply ranks the pods based on the usage and the pod with the highest usage is evicted. So even a guaranteed pod will be evicted if it’s the biggest consumer of the file-system. As Kubelet treats daemonsets the same as pods for eviction, if a daemonset is evicted from a node, Kubelet will prevent it from being restarted until the resource usage has gone above the threshold. If the pod has multiple containers, Kubelet sums up the usage of all the containers inside pod for ranking as expressed below.

Sum of container_fs_usage_bytes for all containers inside a pod

Consider the following example table where we have some pods running on a worker node with their mentioned QoS and nodefs usage.

When filesystem fs usage exceeds the eviction threshold criteria, Kubelet ranks the pods for eviction based on their nodefs usage, with highest consumer to be evicted first as shown below. As shown in this example, a guaranteed pod can be evicted first.

Under memory pressure

The following is the default threshold for the memory pressure.

memory.available<100Mi

Once the memory pressure event is registered as the usage crosses the threshold, Kubelet does not admit any new best effort pods and ranks pods for eviction first by:

Whether the pod’s memory usage exceeds requests.
By priority (if configured).
By the consumption of the memory for a pod relative to that pod’s requested memory.

From the above, it can be concluded that guaranteed pods will never be evicted before best effort pods, but can be evicted before burstable pods; similarly, burstable pods can be evicted before best effort pods. If the pod has multiple containers, it calculates the usage based on sum of usage from all containers relative to sum of requested memory for all containers as expressed below.

Sum of (container_memory_usage_bytes relative to container_resource_requests_memory_bytes) for all the containers inside pod

The above can be illustrated with the following example. Here we have a table representing a few pods on a worker node with their QoS, memory requests/limits and memory usage.

When the memory usage exceeds the eviction threshold criteria, Kubelet ranks the pods for eviction based on their memory usage relative to memory requested as shown below. As shown in this example, a guaranteed pod can be evicted first. Please note, on deciding eviction order memory limit is ignored, only requests and usage are considered.

Node OOM (Out Of Memory)

As Kubelet monitors cAdvisor data every 10 seconds, it’s quite possible that a sudden spike in usage would trigger the kernel OOM killer before Kubelet could catch or reclaim the memory by eviction. Under this situation, the kernel OOM killer decides which containers to terminate rather then Kubelet. However, Kubelet still controls this via oom_adj_score set for all the pods based on the pod’s QoS. This is set as shown below.

If the node experiences a system out of memory (OOM) event before Kubelet can reclaim memory by eviction, the kernel OOM killer takes action to relive the pressure which calculates the oom_score based on the following expression.

% of node memory a container is using + oom_score_adj = oom_score

The OOM killer kills the container with the highest score. If multiple containers have the same score, OOM killer kills the container using more memory. As the OOM killer first looks at the pod’s QoS (instead of consumption of the starved resource), it’s possible that it could terminate the process using less memory and the node gets under pressure again almost instantly. This process could repeat, affecting multiple containers.

Consider the following example where we have 4 pods each having one container. Both burstable pods have requested 4Gi memory and total node capacity is 30Gi. Kubelet will assign the following oom_adj value to the pods as shown below.

When the OOM killer is invoked, it first chooses the container with the highest score (1000). As we see two containers running with a score of 1000, OOM killer will pick the one using more memory out of the two.

Note

As Kubelet polls cadvisor metrics every 10 seconds, it’s possible that Kubelet may evict multiple pods before realizing that usage has gone below eviction threshold.
When Kubelet evicts the pod, Kubernetes will reschedule the pod immediately on another node where enough resources are available. Scheduler may reschedule the pod again to the same node depending on restart policy and if the pod has been set with affinity to node or if pod is running as deamonset.

As you are now hopefully aware of how and in which scenarios pods are evicted. If you set the right values for your component, know all the options available and if you understand your workload it makes it easier to balance your cluster as well as allow you to over commit.

Can Kubernetes evict your guaranteed pod from a node ahead of other non-guaranteed pods?

Written by RAHUL GUPTA