Kubernetes Compute Resource Management

Published in

Almatar

7 min readApr 18, 2019

At Almatar, working with Kubernetes (k8s) at scale, few occasions when we encounter unexpected behavior from some of our services led us to ask few questions about how to make things more stable and reliable. One of which was what shall we do when one pod stops running because k8s couldn’t manage to satisfy the resources (in terms of memory and cpu) needed by that pod on a specific node, should we restart the pod in this case? is there a smarter way to manage the resource allocation? Let’s look into how we encountered such situation and how we managed to better control it.

Kubernetes, The Big Picture

In a containerized world, Kubernetes has multiple components in its control plane a.k.a Master node, one of them is the kube-scheduler. It’s the component/process that is responsible for watching pods in the cluster, assigning workloads to the corresponding Node, and tracking resource utilization on each of the running hosts in order to match the workloads with the available resources.

Resource limits are the parameters Kubernetes uses to get information on what are the resources the pod requires to operate properly, and what are the maximum allowed resources the pod can utilize.

The scheduler is also responsible for matching the container needs, in terms of computing resource utilization (memory/CPU), with the available nodes, i.e: for a newly created pod requiring X memory and Y CPU, the scheduler will make sure to assign the pod to a node that can handle the workload assigned to it, but what will happen otherwise? Default k8s allocations don’t have forced resource limits for memory/CPU, hence a container can use as much resources as it can along with other pods in the same node affecting each others risking a congestion state.

Load Testing Backstory

Before releasing our flights product, we started doing some load testing using Apache JMeter gathering few metrics through Grafana to make sure things are stable during a high traffic.

One side effect of running the tests multiple times consecutively is the nature Linux kernel memory management is done (a complex topic to discuss here, but can be useful to read about along with Cgroups). By executing a high number of requests on the same pod, the memory/CPU rates were spiking, the kernel wasn’t able to free up some space which caused the memory to continue piling up until it started affecting the other services running in the same cluster node causing a contention state fighting on the available resources.

If a container exceeded the memory limit it’s allowed (in our case it was unbounded), kubelet (daemon running on each node) will start taking actions to restart the problematic pod if it can, otherwise if it exceeded the requested memory (in our case it was also open), the pod will be EVICTED if it caused the node to run out of memory (or oom-killed based on how the kernel will signal), or other pods in the same node might also be evicted causing more chaos and instability for these services to stop running intermittently.

At this stage, things might still be fine, Kubernetes will restart the pod, the linux kernel will reclaim some space, the scheduler will then re-shuffle the pods to other nodes (remember, no allocation policy exist yet), but we still risk the same behavior to happen again on other nodes. We need to instrument Kubernetes to its current consumption rate for each service so that it can make better allocation decisions given the current infrastructure.

What Are These Requests/Limits In K8s?

After the kube-scheduler decides on which node to place the pod/container, it’s important that the container has the enough resources available to run on that node, and that’s one role of the kubelet daemon running on each node responsible for the health of the pods. For instance if your app requires 8 GB in memory (RAM) and the max allowed on the host is < 8 GB, that’ll cause the node to run out of memory and things will stop working. It’s important to set these needs so that the scheduler will make better decisions during the placement step. But how and what are these metrics set?

>> Requests (e.g: 8 GB memory, 0.1 CPU): Tells Kubernetes to only allocate the pod on a node that does have these resources available.

>> Limits (e.g; 10 GB memory, 0.2 CPU): Tells Kubernetes scheduler to never let the pod exceed these thresholds while consuming the resources (go up to these limits then restrict it).

spec:
  containers:
  - image: php
    imagePullPolicy: IfNotPresent
    name: flights-x-service
    ports:
    - containerPort: 80
      protocol: TCP
    resources:
      limits:
        cpu: 800m
        memory: 2000Mi
      requests:
        cpu: 700m
        memory: 1000Mi

In short, the above example describes a deployment for a pod with one running container of a PHP image, specifying the requests as follows:

requests cpu: 700m (in millicores)
requests memory: 1000Mi (in Mebibytes)
limits cpu: 800m (can go up to 800 millicores)
limits memory: 2000Mi (can take up to 2000 mebibytes)

The scheduler will first try to find a node that can handle the requests thresholds, if found then it’s allocated, but what happens if it couldn’t find one? what if these requests thresholds are too high or if these limits were exceeded at run-time?

In case the requests values were set too high, scheduler will fail to schedule the pod on a any node and will hang. Looking at the events fired in the namespace we’ll see the reason:

$ kubectl get events | grep flights-x-serviceWarning FailedScheduling
0/x nodes are available:
x Insufficient memory, y Insufficient cpu.

For other cases where, at run-time, the pod exceeded the limit of its memory/cpu, kubernetes will behave different depending on which resource type. CPU resources are compressible, meaning kubernetes will compress the CPU if it reached its limits, but will not terminate the pod, causing a worse performance for your app. On the other hand, memory is not compressible (no way to throttle it like cpu), if it exceeded the allowed quota, the pod will be terminated.

There are other more in depth articles that discuss how does kubelet handle the out of resources cases

How To Define Requests-Limits Thresholds (Upper-Limits)?

In the beginning, we can go with mostly an experimentation approach for tracking resource utilization while doing load testing, then afterwards we can use our analytics/monitoring tool to track the usage on the pods, adjust the thresholds if needed to prevent over or under utilization of the resources, the outcome is called Upper-Limits or in short Ulimits :

We can get some rough estimations on the resource allocations in the cluster if we managed to mimic the behavior of hundreds/thousands of users accessing our platform within a short time window, if it’s a simple API we can use something like loader.io or build some more complex flows with Apache JMeter, then monitor the resource utilization on every pod using kubescope-cli or any other analytics/monitoring tool integrated with Kubernetes like Grafana.

2. Once we record the utilized memory/CPU, we can start categorizing the pods utilization by its type, remember the next question we have to answer is If we have a new service deployment, what resources thresholds should it be assigned then? , a greedy approach is to have three main categories:

High-Traffic: In Almatar, these are the main consumers of requests along the critical path of user-facing actions like search, product selection, …etc.
Mid-Traffic: It can be the services which are hit on a probabilistic basis, i.e: a service that is backed by another caching layer which is not consumed on every heavy-traffic request path.
Low-Traffic: These could be the ones that are hit with a low probability, e.g: having some service that serves static data for flights/hotels through some APIs, these ones can be used to seed information into some other layer like Elastic Search, these ones don’t need high consumption even during its requests.

3. Apart from the above buckets, other pods can form other categories like database storages, if one pod is for mongodb for instance, Mongodb has an extra layer for its u-limits, eg: https://docs.mongodb.com/manual/reference/ulimit/ which references a more granular level of the limits regarding file sizes, in-memory allocation, stack size, …etc.

4. For any new service requested, the default deployment file should consider setting some default limits depending on which category the service falls into. Later on, these limits can be adjusted depending on the need, redesigning the requests critical paths, and so on.

Final Thoughts

Some resource quotas can be set on the namespace level, it controls the aggregated resource consumption across all pods in the same namespace. A ResourceQuota Kubernetes object can be created with different classification (think about it as tags), for each tag some thresholds are set for memory/cpu, then for each created pod a tag is specified to reference the resource thresholds defined in the ResourceQuota object, e.g: high-traffic-pod, low-traffic-pod, … etc.
Production load testing can be tricky, we need to continuously run these periodic tests in prod environment in off-peak times while at the same time making sure resources will not be eaten up causing a downtime for other services, setting u-limits is quite important before going into this stage.
This article describes a basic understanding of what’s the importance of setting ulimits for k8s pods, how requests and limits are different and how does Kubernetes behave regarding these thresholds. Lots of other approaches can be done to refine these thresholds, more control over (not only pods) persistent volumes, storage requests, …etc.

References:

ResourceQuota Object Type: https://kubernetes.io/docs/concepts/policy/resource-quotas/
Understanding Kubernetes Memory: https://medium.com/@betz.mark/understanding-resource-limits-in-kubernetes-memory-6b41e9a955f9
Kubernetes Best Practices: https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-resource-requests-and-limits
Getting the most out of kubernetes with resource limits and load testing: https://www.youtube.com/watch?v=NuLFomXGUj4

Kudos to Tamer Elfeky and Mohamed Abdulmoghni from our DevOps team.