A Guide to Kubernetes Application Resource Tuning — part 3

V. Sevel
10 min readJan 4, 2023

--

This article aims at providing a good understanding on container resource sizing in Kubernetes.

This is the second of 3 parts:

Assessing resource requirements

A key aspect of working with Kubernetes compared to traditional workload is specifically the ability to describe resource requirements, which allow making better informed decisions in terms of placement. Without it, you can still run in Kubernetes, but you are missing a lot of its value.

What are the resource requirements for your application?

First, it is worth making a distinction between memory and CPU, simply because the effects are very different if you are running short of any of them:

  • If you are running out of memory, the pod will get killed and restarted, leading to interruption of in-flight activity.
  • If you are running out of CPU, your containers will be throttled by the CFS, leading to slowness.

For that reason, it is a good idea to not over-commit memory (e.g. setting a memory limit greater than the memory request).

This setting should be enough for the application to run for its entire normal life. If you set a limit higher than the request, you risk getting the pod killed if it is trying to increase its memory usage, when the node is near OOM. For that strategy to be effective, you have to make sure also that your containers are able to give memory back.

Until Java 11, the JVM would only consume additional memory, and never give it back, even if parts of it were not actually used, for instance in the heap. G1 was improved in the context of JEP 346 to allow returning to the OS unused committed memory. ZGC followed in Java 13 through JEP 351. Shenandoah might even be a better option, since it was backported to Java 11. VMWare clusters are licensed by the amount of allocated memory. If the business and operational objectives can be met with a non guaranteed memory, then this may be an option to lower the cost of the infrastructure. Note however that most of the products are priced using CPU allocation rather than memory. So the effect on the TOC will be limited.

If you want to stay on the safe side, and assuming your application does not have a memory leak, assessing memory requirements is a matter of running a simulation of the expected load on the targetted service, with the target application configuration, and make sure the container does not go OOM. When you find your memory requirement, make it a request=limit, which will guarantee the resource availability, and avoid altogether the risk of getting a node OOM.

For CPU you have 3 choices, intituively mapping to the 3 Kubernetes QoS:

  • Not defining CPU requirements (aka Best effort).
  • Using a limit equal to the request (aka Guaranteed).
  • Something in the middle (aka Burstable).

The more you define precise requirements on your application, the better will be the predictability of its behavior. But this may also lead to a lower CPU efficiency, since unused resources reserved by a specific process will not be usable by neighboring workloads. Guaranteed CPU makes sense when you cannot compromise on predictability, or if your application has a continuous resource consumption near its limit.

For many workloads however, we will want to find the best compromise between behavior predictability and resource usage efficiency.

Recommendation engines and the percentile approach

CPU requests should be set to the minimum amount of resources needed by the application. One approach consists in looking at past behavior and assess requirements based how much the application requests most of the time. In other words, what is the value of y in the following sentence for different values of x: The application consume less than y cores x% of the time. This measure is the nearest-rank method of the so-called Percentile function. Compared to average, it has 2 extra benefits:

  • Captures volatility, as opposed to average that treats the same [50, 50, 50, 50] and [0, 25, 75, 100].
  • Offers a cursor to characterize the most of the time variable. Well known percentiles are 99 (e.g. The application consumes less than y cores 99% of the time), 90, 95, ... 50 being also know as the median value.

When the requests are calculated using a high percentile (e.g. 90, 95, 99), it lowers the density on worker nodes, which lowers the CPU usage, but brings high predictability on the workload behavior because there is a high probability that it will receive the resources it needs. There will be some gambling on spikes only.

When the requests are calculated using a low percentile (e.g. 65, 70, 75, 80), it increases the pod density on nodes, which improves CPU usage, but brings less predictability for the workload, because a bigger portion of resource usage will have to be provided through overcommit (i.e. the assumption that the other pods leave some room for others to handle their spikes by accessing resources above their requests).

Keeping track of the POD resource usage, calculating the targetted percentile and retrofitting it as requests on our Deployments or Statefulsets may be cumbersome. To help with this, resource recommendation engine such as the Vertical Pod Autoscaler (used in Google autopilot), or harness provide recommendations that can be applied with different levels of automaticity.

Harness has an interesting approach, where you can use between different profiles (cost optimized, performance optimized or custom).

VPA uses a slightly different algorithm for CPU and memory, plus a weighted approach to give more importance to the most recent observations. However VPA has some limitations when running with HPA, or when running Java workload, in spite of recent features in GCs as we have seen earlier.

They both share in common using a percentile calculation on past usage to assess requests.

Research show that a P95 percentile guarantees desired access to resource with a very high probability, even when the desired level is well above the configured requests, as we are going to see next.

see also Getting Started with Cloud Cost Optimization (Harness) and VERTICAL POD AUTOSCALING: THE DEFINITIVE GUIDE (VPA)

Case study

For this case study, we used Materna-Trace-1 from The Grid Workloads Archive. From that trace, we extracted the CPU usage from 50 VMs (each VM is configured with an average of 4 cores, which can be seen as a limit) over 8000 points in time, as if these were 50 PODs running on a single Kubernetes worker node:

cumulative CPU usage (ms) of 50 pods over 8000 points in time

The question we are trying to answer is: how many cores do we need on our worker node for our workload to run efficiently. Another question would be: what are the consequences on my workload if I added or removed resources out of my worker node.

As we can see on the graph, all workloads can run without throttling if the worker node has around 30 cores. Anything above this value is a waste of resources on our worker node.

If we had used the sum of the maximum usage over time for each POD, we would have needed to configure our worker node with around 80 cores. And, worse, if we had trusted the limits configured on each POD, we would have needed to configure our worker node with around 180 cores.

The 30 cores above is the ideal number if we do not want to throttle. But if we do allow a bit of throttling then we can run our workload with a much smaller configuration. Without any calculation at this point, we can see on the diagram that almost the entire node usage is below 15 cores, and most of it is above 10 cores.

To get more accurate numbers, we created a naive CFS simulator that will be used to distribute CPU cycles to all PODs at each point in time. Here is how it works:

  • We calculate a k8s request using a percentile x (e.g. P95) on each POD based on observe usage from the source.
  • We optionally assign a k8s limit as a factor on the k8s request for each POD (e.g. limit = request * 20) .
  • At each point in time the entire node capacity (e.g. 30000 milicores) is distributed to all PODs, proportionally to their requests.
  • A POD will try to match its targeted consumption, but may be limited by either its own limit (e.g. a POD has a request of 100 millicores, a limit of 2000, and a targeted consumption of 2500 for a particular point in time), or the node running out of CPU cycles to distribute. In either case the POD will be throttled on that point in time if it does not receive what it is asking.

Using this approach, here is the result of the simulation if requests were calculated using the P99, and limit = 20*requests:

  • sum of requests (i.e. configuration of worker node) = 37 cores (to be compared with 30 cores, which is the ideal value for no throttling).
  • number of times where a POD is throttled = 40 times (out of 400000) = 0.01%. In this case PODs get throttled by their limits (since the node never reaches 100% total CPU).

If we had used the P95 to calculate our requests, we would get sum of requests = 21 cores, throttling = 0.03%.

With an even more aggressive P90, we would get sum of requests = 15 cores, throttling = 0.2%, time of maxed out CPU on node = 2%.

impact on worker node with requests calculated using different percentiles

We can see with these results that through accepting a little bit of throttling, we can save a lot of cores on our worker node.

It may be difficult to fully represent having some throttling on PODs. Let’s take for instance POD 23 in our dataset, which has a peak of activity at around point in time 2200. In the following area-layered diagram we are showing in green (at the back) the amount of CPU we would cover using the P95 for our requests, and on top of it, in yellow, the amount of CPU we would cover using the P90:

zoom on activity peaks for pod 23

In that case, requests calculated with the P95 (212 millicores) covers all our needs, whereas the P90 (184 millicores) is too aggressive, and will get the process to throttle for a short period of time (in the bubble where we see some green).

The horizontal black line is showing the ideal request, which is the minimal request allowing no throttling at all (in that case 200 millicores, obtained with a P93).

pod 23 throttling and requests with different Px

We can make a few comments out of this simple example:

  • a minimal amount of request (e.g. 200) can make us cover high CPU peaks (e.g. 700 millicores here), which is more than 3x, with no throttling. This translates into huge savings compared to a guaranteed mode where we would need to cover all peaks with reserved resources.
  • aggressive requests (e.g. using the P90) will lead to throttling, but not necessarily during the highest peaks, which might be counter-intuitive: the process gets throttled when asking for 470 millicores (in the bubble), but not when asking 700 millicores during a later peak. The reason is that resource access depends on what the others are doing in a mutualized environment when using overcommitting. There is a bit of luck involved.

Hopefully, through this simulation, you can better visualize the savings we can make on worker node resource allocation and the amount of associated CPU starving, by allowing some level of throttling on your workload.

Conclusion

It is time to wrap up this long article. Hopefully you have a better idea now on how to approach resource tuning on your application workload, and the reasons to do it.

The first idea that was discussed is that on our journey from bare metal to virtualization, and from virtualization to containers, we live in a very dynamic world now. And some of the overcommitting work that the infrastructure team could do for us at the virtualization level is not sufficient anymore. A direct consequence, is that it is then key to assess resource requirements for our workloads.

In sections Running workload in containers, we looked at the underlying mechanisms, such as memory limit, CPU shares and CPU quotas.

Entering the Kubernetes realm, we saw how those low level tools mapped to requests and limits, talked about scheduling and the consequences of underestimating of overestimating resources for our workloads.

We then introduced an assessment approach based on the percentile function, incidentally used in well know recommendation engines, and illustrated it with a case study.

Some of our best practices:

  • unless you want absolute certainty that you process won’t be throttled, do not run in Guaranteed QoS (and if you do, make sure you do not get throttled anyway by your virtualization layer).
  • set some requests to provide meaningful information to the Kubernetes scheduler.
  • do no set CPU requests based on peaks.
  • set some CPU limits specifically on high requestors to reintroduce some distribution fairness.
  • get some help from a recommendation engine. Get confident, and use the auto-apply mode.
  • do not overcommit memory (unless you can afford OOMs from time to time).

Tuning is about finding the sweet spot between workload resource availability (hence behavior predictability) and resource usage, which is a measure of resource efficiency. Resource reservation is not a binary decision on using Guaranteed QoS or not. It is a cursor that you need to position for your different workloads between high density/throttling probability on one end (high overcommit) and low density/throttling probability (low overcommit) on the other.

This is key in a Kubernetes environment to save resources on worker nodes, and core to the FinOps principles.

--

--