Efficient Node Out-of-Resource Management in Kubernetes

Published in

Supergiant.io

8 min readAug 12, 2019

As you might already know, kubelet is a primary node component in Kubernetes that performs a number of critical tasks. In particular, kubelet is responsible for:

registering nodes with the kube-apiserver
monitoring the kube-apiserver for scheduled Pods and telling the container runtime (e.g., Docker) to start containers after a new Pod is scheduled
monitoring running containers and reporting their status to the kube-apiserver
executing liveness probes and restarting containers if they failed them
running static Pods directly managed by the kubelet.
interacting with the Core Metrics Pipeline and container runtime to collect container and node metrics.

Another important kubelet’s task we wanted to discuss in this article is the “primary node agent’s” ability to evict Pods when a node runs out of resources. The kubelet plays a crucial role in preserving node stability when compute resources such as disk, RAM, or CPU are low. It’s useful for Kubernetes administrators to understand best practices for configuring out-of-resource handling to make node resources flexible while preserving the overall fault tolerance of the system and stability of the critical system processes.

How Does Kubelet Decide that Resources Are Low?

As we have mentioned, kubelet can evict workloads from a node to free up resources for other Pods and/or system tasks like the container runtime or the kubelet itself. However, how does the kubelet decide that the resources are low?

Kubelet determines when to reclaim resources based on the eviction signals and eviction thresholds. An eviction signal is the current capacity of a system resource like memory or storage. In its turn, an eviction threshold is the minimum value of this resource that should be maintained by the kubelet.

In other words, each eviction signal is associated with a certain eviction threshold that tells the kubelet when to start reclaiming resources. At this time, the following eviction signals are supported:

memory.available — A signal that describes the state of cluster memory. The default eviction threshold for the memory is 100 Mi. In other words, the kubelet starts evicting Pods when the memory goes down to 100 Mi.
nodefs.available — The nodefs is a filesystem used by the kubelet for volumes, daemon logs, etc. By default, the kubelet starts reclaiming node resources if the nodefs.available < 10%.
nodefs.inodesFree — A signal that describes the state of the nodefs inode memory. By default, the kubelet starts evicting workloads if the nodefs.inodesFree < 5%.
imagefs.available — The imagefs filesystem is an optional filesystem used by a container runtime to store container images and container-writable layers. By default, the kubelet starts evicting workloads if the imagefs.available < 15 %.
imagefs.inodesFree — The state of the imagefs inode memory. It has no default eviction threshold.

The above-described eviction thresholds are quite sensible defaults. However, users can configure their custom eviction thresholds by setting appropriate flags on the kubelet binary. These user-defined thresholds can change the default kubelet eviction behavior.

At this time, Kubernetes supports hard and soft eviction thresholds.

If a hard eviction threshold is reached, the kubelet starts reclaiming resources immediately, without any grace period. In contrast, soft eviction thresholds include a user-defined grace period that should expire before the kubelet starts reclaiming any resources.

You can define a hard eviction threshold with the--eviction-hard flag on the kubelet binary. For example, kubelet —- eviction-hard=memory.available<1Gi would tell the kubelet to start reclaiming resources when the node’s memory.available is below 1Gi.

If you want to allow for a grace period before eviction, you can use the —- eviction-soft flag in combination with the —-eviction-soft-grace-period flag. For example, kubelet —- eviction-soft=memory.available<2Gi and kubelet —- eviction-soft-grace-period=1m30s will make the eviction threshold hold for 90 seconds before triggering the eviction threshold.

Users can also specify the maximum grace period allowed by setting the —- eviction-max-pod-grace-period in seconds.

How Does the Kubelet Reclaim Resources?

The kubelet reclaims resources at the expense of the end-user Pods as a last resort. It first tries to reclaim such resources as unused container images or dead Pods.

The kubelet reclaims node resources differently if a node has a dedicated imagefs filesystem along with the nodefs filesystem. In this case, if the nodefs reaches the eviction threshold, the kubelet deletes all dead Pods and their containers. Correspondingly, if the imagefs reaches the eviction threshold, the kubelet removes all unused container images.

If there is no imagefs used, the kubelet first deletes all dead Pods and their containers and then removes all unused images. For more information about this process, see this article from the Kubernetes documentation.

If reclaiming containers images, dead Pods, and other resources does not lead out of the resource starvation, the kubelet starts deleting end-user Pods as a last resort. The kubelet decides which end-user Pods to evict based on the Pod’s QoS class, Pod Priority, and a number of other parameters discussed below. Before describing this process, let’s recall the basic QoS classes in Kubernetes.

As you may already know from our previous tutorials, in Kubernetes, Pods can be Guaranteed, Burstable, or Best-Effort.

Guaranteed Pods are Pods where resource limits and optionally requests are set for both CPU and RAM in all containers, and they are equal.
Burstable Pods are Pods where requests and (optionally) limits are set for one or more resources (e.g., CPU, RAM) for one or more containers, and they are not equal.
Best-Effort Pods are Pods with no resources set.

This QoS model is implicitly used by the kubelet in its Pod ranking scheme. In general, the kubelet ranks candidates for eviction using the following rules:

whether or not a Pod has exceeded its resource requests. In Kubernetes, Pods are scheduled based on their requests rather than limits. Therefore, all containers and Pods are guaranteed to have the amount of RAM/CPU they have requested. However, if no limits are set and if a Pod has exceeded its resource requests, it can be killed or throttled if the Guaranteed Pod or some system task require a constrained resource. Even those Pods that consume less than requested can be killed in certain circumstances. For example, when the system tasks memory is critically low and when there are no lower priority Pods to be killed.
by Pod priority. If there are no Pods that have exceeded their requests, the kubelet checks the Pod Priority. It will try to evict lower priority Pods first. Note: Pod priority and preemption graduated to GA in Kubernetes 1.14. They have been enabled by default since 1.11. You can learn more about Pod Priority in this article.
by the consumption of the starved compute resource (e.g., RAM) relative to the Pods’ resource requests.

Given these rules, the kubelet evicts end-user Pods in the following order:

First candidates for eviction are Best-Effort and/or Burstable Pods whose usage of a constrained resource exceeds the requests. If there are several such Pods, the kubelet ranks them by Priority and then the resource consumption above the specified request.
Guaranteed and Burstable Pods with the resource usage below the requests are evicted last. However, if some system tasks like kubelet or Docker require the starved resource and there are no Best-Effort Pods on the node, the kubelet can evict Guaranteed Pods that consume below their request. In that case, it will evict Guaranteed and/or Burstable Pods with the Lowest priority first.

Minimum Eviction Reclaim

If the amount of the resources the kubelet reclaims is small, the system can repeatedly hit eviction thresholds. This is not the desired behavior because it can lead to poor scheduling decisions and frequent evictions of Pods. To avoid this scenario, users can set a per-resource minimum reclaim level using —- eviction-minimum-reclaim flag on the kubelet binary.

For example, take a look at the kubelet configuration below:

--eviction-hard=memory.available<1Gi,nodefs.available<2Gi,imagefs.available<200Gi--eviction-minimum-reclaim=memory.available=0Mi,nodefs.available=1Gi,imagefs.available=2Gi

This —- eviction-minimum-reclaim setting ensures that the minimum amount of the nodefs storage available after the reclaim is 3Gi and the minimum amount of the imagefs storage available is 202 Gi. Thus, the configuration above ensures that the system has enough resources available to avoid hitting eviction thresholds very frequently.

Another potential issue you can encounter with the poor out-of-resource handling configuration is the oscillation of node conditions. When the kubelet receives an eviction signal, the latter is mapped to a corresponding node condition. For example, when the memory.available eviction threshold is hit, the kubelet assigns a MemoryPressure node condition to the node. This condition is associated with the corresponding taint that prevents new Pods from being scheduled on a node with the MemoryPressure node condition. You can find more information about node conditions in our earlier article.

However, if you use a soft eviction threshold with a long grace period, node conditions can oscillate between true and false within the grace period. This may lead to eviction indeterminacy and, therefore, poor scheduling decisions. To avoid this situation, you can use —- eviction-pressure-transition-period flag on the kubelet, which defines how long the kubelet has to wait before meeting the eviction condition.

A Simple Out-of-Resource Handling Scenario

Now we’ll illustrate how to configure out-of-resource handling for your K8s cluster. Let’s imagine a simple scenario where only node RAM is considered. Assume that our node’s memory capacity is 10 Gi of RAM. We would like to reserve 10% of total memory for system daemons like kernel, kubelet, Docker, etc. We also want to evict Pods at 95% of memory utilization.

The kubelet is launched with default eviction threshold and has no system-reserved set. We need to explicitly set several flags on the kubelet to enable the behavior we want.

To achieve our goal, we need to set the following flags on the kubelet:

eviction-hard=memory.available<500Misystem-reserved=memory=1.5Gi

As you see, system-reserved is set to 1.5Gi although, intuitively, it should be set to 10%=1Gi. However, the “System reserved” should include the amount of memory covered by the eviction threshold (1Gi + .5Gi).

Depending on how you provision the K8s cluster, the kubelet flags can be set differently. For example, if you plan to provision your K8s cluster with Kops, run kops edit cluster $NAME to open the editor with the cluster configuration. If this is a VI editor, enter the Insert mode by pressing “I” to edit the file. The kubelet flags for the above out-of-resource handling policy should look as follows:

kubelet:
  eviction-hard=memory.available<500Mi
  system-reserved=memory=1.5Gi

Conclusion

That’s it! In this tutorial, we discussed some useful Kubernetes administration practices for customizing the kubelet out-of-resource management in Kubernetes. The platform allows administrators to set custom eviction thresholds and eviction grace periods to decide on which conditions are considered to be dangerous for node stability. With that freedom, however, comes much responsibility. Kubernetes ships with the sensible out-of-resource management defaults. Therefore, you should be cautious when setting eviction thresholds too high or making eviction grace periods too long.

Originally published at https://supergiant.io.