Tuning JVM containers for better CPU and memory utilisation in K8s environment.

7 min readNov 5, 2022

The JVM is one of the oldest yet powerful virtual machines ever built.

Whenever a new JVM process starts, all required classes are loaded into memory by an instance of the ClassLoader. This process takes place in three steps:

Bootstrap Class Loading: The “Bootstrap Class Loader” loads Java code and essential Java classes such as java.lang.Object into memory. These loaded classes reside in JRE\lib\rt.jar.
Extension Class Loading: The ExtClassLoader is responsible for loading all JAR files located at the java.ext.dirs path. In non-Maven or non-Gradle based applications, where a developer adds JARs manually, all those classes are loaded during this phase.
Application Class Loading: The AppClassLoader loads all classes located in the application class path.

This initialisation process is based on a lazy loading scheme.

Once class-loading is complete, all important classes (used at the time of process start) are pushed into the JVM cache (native code) — which makes them accessible faster during runtime. Other classes are loaded on a per-request basis.

The first request made to a Java web application is often substantially slower than the average response time during the lifetime of the process. This initial startup period can usually be attributed to lazy class loading and just-in-time compilation.

Keeping this in mind, for low-latency applications, we need to cache all classes beforehand — so that they’re available instantly when accessed at runtime.

This process of tuning the JVM is known as warming up.

High Latencies during JVM warm-up is a prominent problem. JVM-based applications deliver great performance but need some time to “warm-up” before reaching the top speed. When the application launches, it usually starts with reduced performance. It can be attributed to things like Just-In-Time (JIT) compilation which optimizes frequently used code by collecting usage profile information. The net negative effect of this is that the requests received during this warm-up period will have a very high response time as compared to the average. This problem can be exacerbated in containerised, high-throughput, frequent-deploys, and auto-scaled environment

In this article, I will be discussing about JVM Warmup Issues, High Heap Memory Utilisation in our Kubernetes Cluster and how we approached the issue and our learnings from it.

High response time whenever a new pod was scheduled

Initially when it was not clear that it was a JVM Warmup related issue the easiest way to solve this was increasing the pods to around 3 times our required number in steady state. This definitely solved our issue but resulted in high infra cost. On digging deeper in the problem we found out that there are other ways to solve the same issue as well.

JVM needs more CPU, in our case it was ~3x than the configured limit x during the initial warm-up phase which lasts a few minutes. After the warm-up, JVM can run at its full potential comfortably even with a x CPU . If the required CPU is not available, the pod CPU gets throttled to the available resources and that causes the entire issue.

There is an easy way to verify the same. Kubernetes exposes a per-pod metric container_cpu_cfs_throttled_seconds_total which denotes — how many seconds CPU has been throttled for this pod since its start.

Because of the same throttling , Warmup takes time and until then the latency increases which results in request queuing and which therefore causes high thread states in waiting state.

Kubernetes schedules pods using “requests”, not the “limits”.

This was one very important piece of information we came across while debugging the issue.

Kubernetes assigns QoS classes to pods based on the configured resource requests and limits.

The answer to our problem seemed pretty clear after getting this insight regarding QoS classes: Kubernetes Burstable QoS

Since Kubernetes uses the values specified in requests to schedule the pods, it’ll find nodes with x spare CPU capacity to schedule this pod. But since the limit is much higher at 3x , if the application needs more CPU than x at any time and if spare CPU capacity is available on that node, the application will not be throttled on CPU. It can use up to 3x if available.

This fits nicely with our problem statement. During the warm-up phase when JVM needs more CPU, it can get it by bursting. Once the JVM is optimised, it can go on at full speed within the request . This allows us to use the spare capacity in our cluster (we’ll still face the issue if we don’t have spare capacity, but 8/10 times we tend to have that extra capacity which is required for limits) to solve the warm-up problem without any additional cost.

During warm-up (pod startup) CPU over 100% of requests by bursting

Response Time during deployment with the new Configuration

After reducing the CPU resources it was time to look into the high memory usage of our system

While we were optimising our costs, we figured out there were very high memory resources allocated for our services because the heap size would keep on increasing and if the resources were not high the Pod would be in CrashLoop if the memory requirement is not met.

Ineffective Garbage Collection was figured out as the main reason for the high heap size in our service, as the heap memory used to gradually keep on increasing until the major GC which Resulted in High Memory usage and eventually needed to provide high resources to keep service stable.

**Heap Memory**: Earlier vs Post Introducing G1GC

Garbage Collection: Garbage collection aka GC is one of the most important features of Java. Garbage collection is the mechanism used in Java to de-allocate unused memory, which is nothing but clear the space consumed by unused objects. To deallocate unused memory, Garbage collector track all the objects that are still in use and it marks the rest of the object as garbage. Basically garbage collector use Mark and Sweep algorithm to clear unused memory. These are the following types:

1. Serial Garbage Collector

Serial garbage collector works by holding all the application threads. It is designed for the single-threaded environments. It uses just a single thread for garbage collection. The way it works by freezing all the application threads while doing garbage collection may not be suitable for a server environment. It is best suited for simple command-line programs.

Turn on the -XX:+UseSerialGC JVM argument to use the serial garbage collector.

2. Parallel Garbage Collector

Parallel garbage collector is also called as throughput collector. It is the default garbage collector of the JVM. Unlike serial garbage collector, this uses multiple threads for garbage collection. Similar to serial garbage collector this also freezes all the application threads while performing garbage collection.

3. CMS Garbage Collector

Concurrent Mark Sweep (CMS) garbage collector uses multiple threads to scan the heap memory to mark instances for eviction and then sweep the marked instances. CMS garbage collector holds all the application threads in the following two scenarios only,

while marking the referenced objects in the tenured generation space.
if there is a change in heap memory in parallel while doing the garbage collection.

In comparison with parallel garbage collector, CMS collector uses more CPU to ensure better application throughput. If we can allocate more CPU for better performance then CMS garbage collector is the preferred choice over the parallel collector.

Turn on the XX:+USeParNewGC JVM argument to use the CMS garbage collector.

4. G1 Garbage Collector

G1 garbage collector is used for large heap memory areas. It separates the heap memory into regions and does collection within them in parallel. G1 also does compacts the free heap space on the go just after reclaiming the memory. But CMS garbage collector compacts the memory on stop the world (STW) situations. G1 collector prioritises the region based on most garbage first.

Turn on the –XX:+UseG1GC JVM argument to use the G1 garbage collector.

We switched from Java 8 default GC (Parallel GC) to G1 GC which is meant to tackle high Heap sizes effectively. Resulted in overall constant Heap Size and therefore we could reduce the memory resources and keep the same stability in our services, and the results were amazing

Overall memory utilisation got stabilised

This helped us reduce the cost of our infra by over 50% and now we’re much more stable than before.

Key Learnings

We can still face CPU throttling if the node does not have the required limits as specified by us, since pods are scheduled basis the requests and not limits
We should not be banking too much on the limits for steady state but keep our requests sufficiently high to cater the steady state requirements of the application
While using G1 GC we should also try using String Deduplication as most web app heavily uses Strings so i think the advantage will be pretty evident. String Deduplication is a Java feature that helps you to save memory occupied by duplicate String objects in Java applications. To use it we need to add this in our JAVA_OPTS : - XX:+UseStringDeduplication