Kubernetes Container Resource Requirements — Part 1: Memory

Understanding requests, limits and throttling

Will Tomlin
Jun 25, 2018 · 5 min read

In order for Kubernetes (K8s) to reliably allocate the resources your component requires to run and make the best use of the infrastructure upon which it sits, you should specify container resource requirements. You can currently specify two types of requirement, requests and limits, for two types of container resource: memory and CPU. This post aims to explain what these two types of requirement mean and the meaning of memory within the Docker container runtime.

Requests vs. Limits

When defining a pod, you can specify two categories of memory and CPU requirements for each of its containers — requests and limits:

memory: "8Gi"
cpu: "4"
memory: "8Gi"
cpu: "4"

Requests is a K8s concept used for scheduling pods within the underlying infrastructure: “place my pod’s containers where there are enough of these resources to accommodate them”. Limits is a hard cap on the resources made available to container that propagates to the underlying container runtime— we’ll assume Docker here. Exceeding limits results either in throttling or, in the worst case, termination of the container.

You might be asking if there’s reason to set limits higher than requests. If your component has a stable memory footprint, you probably shouldn’t since when a container exceeds its requests, it’s more likely to be evicted if the worker node encounters a low memory condition. In the case of CPU, additional resources between requests and limits can be opportunistically scavenged, provided they are not being used by other containers. The Burstable QoS class (see below) allows potentially more efficient use of underlying resources at the cost of greater unpredictability — for example, a CPU-bound component’s latency may be affected by transient co-location of other containers on the same worker node. If you’re new to K8s, you’re best off starting with the Guaranteed QoS class by setting limits the same as requests.

If requests is omitted for a container, it defaults to limits. If limits is not set, it defaults to 0 (unbounded).

QoS classes are employed according to the presence/configuration of requests and limits (from here):

If limits and optionally requests (not equal to 0) are set for all resources across all containers and they are equal, then the pod is classified as Guaranteed. These pods are considered top-priority and are guaranteed to not be killed until they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.

If requests and optionally limits are set (not equal to 0) for one or more resources across one or more containers, and they are not equal, then the pod is classified as Burstable. When limits are not specified, they default to the node capacity. These pods have some form of minimal resource guarantee, but can use more resources when available. Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no Best-Effort pods exist.

If requests and limits are not set for all of the resources, across all containers, then the pod is classified as Best-Effort. These pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory. These containers can use any amount of free memory in the node though.

The Meaning of Memory

What exactly does memory mean here? In a nutshell, it’s the total resident set size (RSS) and page cache usage of a container. In pure Docker, this figure would normally include swap, however K8s permanently disables swap on your behalf.

RSS is the amount of RAM used by a process at any given time. For a Java process, this can include several areas including the native heap/non-heap areas, thread stacks and native memory allocations. RSS is influenced by JVM thread configuration, thread counts and component behaviour.

The page cache is an area of RAM for caching blocks from disk. All I/O is normally performed through this cache, for performance reasons. Whenever your component reads from, or writes to, a file, you can expect the relevant blocks to be cached here. The more files you read or write, the greater the requirement. Note, the kernel will use available spare memory for the page cache but will reclaim it if it is needed elsewhere — this means the component’s performance may suffer if there isn’t sufficient space (depending on its reliance of the page cache for performance). There’s a potential gotcha here — Docker’s overlayfs storage driver enables page cache sharing, meaning multiple containers on the same node accessing the same file share the same page cache entries for that file (think indexes or other shared things). The Docker documentation states that:

Accounting for memory in the page cache is very complex. If two processes in different control groups both read the same file (ultimately relying on the same blocks on disk), the corresponding memory charge will be split between the control groups. It’s nice, but it also means that when a cgroup is terminated, it could increase the memory usage of another cgroup, because they are not splitting the cost anymore for those memory pages.

…so be aware that page cache usage per container may vary depending on what files might be shared with other containers running on the same node.

Memory requests and limits are measured in bytes, however you can specify a number of suffixes. JVM memory configuration is expressed using binary prefixes (e.g. Xmx1g is 1024³ bytes), so it makes sense to use the equivalent K8s suffix (Gi) as a basis for your container specification.

Memory Throttling?

Memory limits are considered non-compressible in K8s parlance, meaning it cannot be throttled. If your container experiences memory pressure, the kernel will aggressively drop page cache entries to satisfy demand and may eventually be killed by the Linux Out of Memory (OOM) Killer. Since K8s disables swap (by passingmemory-swappiness=0 to Docker), it’s quite an unforgiving environment for misconfiguration.

You should empirically determine your component’s memory requirements in order to create a good configuration; if your component depends on good page cache efficiency, allow appropriate overhead via your container’s resource requirements. Of course, things change and your component memory requirements will invariably follow suit — consider monitoring and alerting on memory statistics as part of your normal operations function.

Expedia Group Technology

Stories from the Expedia Group Technology teams

Will Tomlin

Written by

Principal Engineer at Hotels.com

Expedia Group Technology

Stories from the Expedia Group Technology teams