Warm up the relationship between Java and Kubernetes

Tony Demol
BlaBlaCar
Published in
14 min readJan 12, 2024

The BlaBlaCar backend infrastructure is fully running on Kubernetes on the Google Cloud Platform. During the last few years, this backend was migrated from a PHP monolith to a service-oriented architecture, split into hundreds of services, mainly built in Java with SpringBoot.

Java is a dynamically compiled language, in the sense that the byte code produced by the initial compilation is an intermediate state, and the final compilation from byte code to native code is performed dynamically by the Java Virtual Machine (JVM) after the application is started. Consequently, there’s a delay before a Java application reaches its highest performance peak.

Java’s ‘write once, run anywhere’ idea paved the way for modern containers like those in Kubernetes. We could compare the JVM and servers like Tomcat as the old ‘orchestrators,’ and jars as the forerunners of today’s ‘pods’.

Now, these JVM and servers live inside containers wrapped into pods. But here’s the catch: pods are ephemeral, meaning that we need to start cold JVMs & applications continuously, without direct correlation with the development & deployment cycles. The formerly long-running JVM & application server platforms are now short-living.

This change brings a new challenge: making sure the JVM and application are ready and performant as soon as a pod starts receiving traffic. It’s a shift from the old way to this new, faster world of containers.

When your backend applications handle thousands of requests per second, introducing “cold” JVM & application in an existing deployment could lead to sudden spikes of high latency, high CPU consumption, and multiple technical issues.

This article will detail how BlaBlaCar faced cold JVM issues and implemented a warmup system leveraging the Kubernetes native features. It will also explore some other possible existing or emerging alternatives (because warmup is a “hot” topic!).

Workloads are ephemeral

In a Kubernetes cluster, the workloads are wrapped into pods, and the cluster dynamically manages multiple instances (or replicas) of these pods (through replicasets & deployments).

Multiple factors lead to creating or stopping pods at any time:

Elasticity
The number of pods in a given deployment can shrink or grow (thanks to the Horizontal Pod Autoscaler, or HPA) depending on the workload usage (mainly based on CPU consumption at BlaBlaCar).

Costs optimization
To optimize the costs of our infrastructure, we use “spot” nodes of the Google GKE platform. These nodes are cheap but can be preempted at any time.
When a node is preempted, all pods running on the node are shut down, and new pods are started on other nodes to replace them (according to the elasticity need). You can find more information about our usage of spot nodes in the Be Lean, Go Far: leveraging Kubernetes for an elastic right-sized platform article.

Continuous deployment
At BlaBlaCar, engineers deploy code and configuration changes continuously in production. A given workload could be updated a dozen times in a single day.
When a new version of a workload is deployed, Kubernetes performs a progressive rollout by starting new pods with the new version and stopping previous pods.

Failures
Even if we try to avoid them, random technical failures of any kind are part of any IT infrastructure (network issues, unexpected upstream service failure impacting our workload, external provider latency, …). Kubernetes embraces the concept of fault-tolerance by supporting failure by design. A pod can become unhealthy for any reason, and then be automatically restarted or replaced anytime.

Consequently, workloads are ephemeral by design. This is a challenge when workloads are built with a JVM-based language like Java. New cold JVMs are started continuously and must be able to handle a high number of requests per second immediately.

From 0 to … too many

One of our Java backend applications dealing with ~7000 requests per second suffered from random latency spikes even when the number of requests was almost stable, severely impacting our users’ experience.

Workload P90 latency from API consumers point of views (usual latency is below 100ms)

Even if a latency of 3 or 4 seconds is pretty bad for our end users, this could be considered acceptable for a short period. But with ~7000 requests per second, thousands of processes are queuing in the system, and after a few seconds, we start encountering technical issues and timeouts.

In a distributed system, a user request goes through dozens of services, all participating in the user request latency. If (at least) one suffers from high latency, the whole process chain becomes slow, potentially impacting other processes not directly related to the slow service. This is unacceptable from a user experience point of view and is pretty dangerous for our backend systems.

After looking for potential bottlenecks in upstream services, I/O latency, JVM Garbage collection, heap resizing, and slow database queries, we focused on the application’s CPU consumption. It was pretty high and unstable, crossing the 70% target of the CPU request which is the target of our Horizontal Pod Autoscalers (HPA) for this service:

Average % of CPU request consumed by the application pods

That is pretty unstable, but this doesn’t explain why we encounter sudden high latency. One explanation would be that some pods are terribly slow. When analyzing the potential CPU throttling metrics, we confirmed our guesses:

CPU throttling per pod (number of throttled CPU periods)

A pod is throttled when it tries to consume more CPU than the configured CPU limit. This application has a 6 CPU limit! We quickly identified that the throttled pods were cold pods freshly started (here because of a node preemption leading to stopping and creating new pods, but the issue could also be encountered during deployments or scale-up).

While looking at this CPU throttling graph, we were also surprised by the time these throttling lasted: almost 15 minutes for 2 of them! Does the JVM need 15 minutes to warm up?

Zoom on JVM dynamic compilation

One of the strengths of the Java ecosystem is the JVM, which allows dynamic compilation of the byte code to native code based on code usage, also performing on-the-fly optimizations by rearranging the code to make it faster (Just in Time compilation, or JIT). In practice, these optimizations are done by several tiers of compilations, but we will consider the whole process as atomic for simplicity.

This dynamic optimization mainly works by analyzing the code execution and identifying the “hotspots” (reminds you of a JVM name?). For a backend application exposing a REST API, this means that we need to process multiple requests before the application becomes optimized and fast.

But this also means that these requests that used to let the JIT find the hotspots will be slow since the code is not optimized (and even slower because the JIT will consume CPU to optimize the code at the same time). In other words, we need to process slow requests to make the JVM fast, but we don’t want slow requests for our users & systems 🙂.

This optimization process is one of the reasons explaining why we’re observing CPU throttling for almost 15 minutes:

  • Cold JVMs are suddenly receiving hundreds of requests per second with not-yet-optimized code
  • This leads to high CPU consumption and request queuing (and then more and more CPU consumption)
  • This high CPU consumption slows down the JVM JIT compilers from optimizing and converting the byte code to native code

In the end, there’s not enough CPU to handle a lot of requests with non-optimized code and perform the code optimization at the same time.

Note that additionally to the JVM just-in-time compilation, other technical reasons contribute to the low performances of a cold pod: frameworks & resources initialization at runtime, empty caches, thread pools initialization, TCP connections creation, etc.

The infinite loop of problems: HPA flapping

Another side effect is because the CPU consumption became very high on average, Kubernetes scaled up the application and then added new pods to counterbalance. But new pods also suffered from high CPU consumption, leading to more scale-up!

The phenomenon is called “HPA Flapping”.

Number of pods part of the deployment of the application. Red metrics are unhealthy pods

This flapping is a consequence of the long-lasting CPU throttling. The Kubernetes HPA ignores the freshly created pods’ CPU consumption for 5 minutes (by default) when computing the CPU consumption average of all the deployment’s pods.

But if pods are throttled for almost 15 minutes, the HPA will scale up. And you end up with more cold pods… Then we enter a vicious cycle of latency, high CPU consumption, and HPA scaling 🙂.

Cold pods consume CPU for an extended period, leading to new HPA scale-up, leading to more cold pods

Leveraging Kubernetes startup probe to warmup

Requirements

To mitigate these cold pod issues, we set up a warm-up system (other alternatives exist as listed below). Its purpose is to shift the period of low application performance & high CPU consumption to before the moment when real production traffic reaches the pod. There are multiple possible ways to do so. Our 4 main requirements were:

The live traffic should not reach a pod that is not considered warm
The warmup must be performed once the Java application is up & running, but before Kubernetes considers the pod as ready. This should also mitigate HPA flapping because non-ready pod CPU consumption is not included in the average CPU consumption metric.

A workload should be considered warm if it complies with configured availability & latency minimum requirements
To guarantee optimal application performances, we must have requirements on request results (availability — in the sense of expected results — & latency). The warmup system must then be smart enough to analyze the results and continue until requirements are met.

The warmup processes should consider the application as a black box
To be efficient, the warmup system must behave like real production processes, without mocks or hacks. Ideally, it should use the public APIs of the Java application.

The system should not be wrapped inside the application code
To limit any coupling between the application code & the warmup system, we consider it an infrastructure concern, agnostic from the application implementation, language & framework. It also avoids running the warmup tool on the application’s JVM to be warmed up, which could lead to the problem of warming up the warmup tool 🙂.

Implementation

The implemented solution is a homemade configurable tool written in Go, wrapped into a sidecar container that is part of each workload pod, and configured to act as a proxy for the startup probe of the Java container.

Once a pod is created, the overall process is split into 4 phases:

Phase 1: Startup probe in error, warmup in progress

Kubernetes polls the startup probe of each container to know if they completed their startup:

  • The startup probe of the Java application container is configured to be routed to the warmup tool on a specific HTTP endpoint
  • The warmup tool answers an HTTP error as startup probe response. The whole pod is considered unhealthy while this probe is unsuccessful
  • In the meantime, the warmup tool starts running predefined requests against the Java application APIs. It will continue until defined availability & latency requirements are met.

Phase 2: warmup requirements are met, startup probe is forwarded to the Java application

Once requirements are met, the warmup process is stopped, and the warmup tool will stop returning an HTTP error on the “proxified” startup probe endpoint and forward the startup probe call to the Java application startup probe.

Phase 3: Kubernetes starts polling the liveness & readiness probes of the Java application

Kubernetes receives a successful response from the startup probe and then starts polling the liveness & readiness probes of the Java application.

Phase 4: pod is considered up and running, live traffic is opened

Once the pod is considered running by Kubernetes, it starts receiving production requests. The Java application is already partially optimized and peak performance is reached faster.

The warmup tool configuration

The warmup tool is configured for each workload with 2 elements: the technical execution configuration with requirements and the list of requests (named “samples”).

Execution configuration & requirements
A set of configurations about how the warmup requests must be run and the expected average latency for all requests. The requirements must be configured carefully to both reach acceptable performances within a pragmatic delay.
The tool also implements a backoff system to pause the warmup process for a configured time when multiple consecutive requests don’t match the requested requirements (with an increasing factor on consecutive failures), to leave room for the JVM optimizations (if the warmup is too aggressive, we would end up with the same issues as previously described with a JVM that is not able to both process the requests & optimize the code in an acceptable delay).

Here is a simplified example of a warmup configuration:

startup:
delay_ms: 10000 # Delay before starting warmup after boot
threads: 2 # Number of concurrent threads to use for warmup
goals:
success_rate: 1.0
max_response_time_ms: 200 # Expected maximum latency for all warmed up endpoints
min_samples: 1500 # Minimum number of requests on all warmed up endpoints that must match the requirements


backend:
host: "127.0.0.1" # Java application container host
port: 8080 # Java application container port


backoff:
on: ["error", "timeout", "failure"]
min_ms: 200
max_ms: 3000
factor: 1.5


[...]

Requests samples with expected responses
A list of requests is configured in another configuration file with the expected HTTP response code(s). These requests will be executed until requirements are met.
Selecting suitable requests is not an easy task. We need to focus on high QPS endpoints to be efficient, using dynamic payload data when relevant to limit caching effects, and being careful with requests performing writes.

Here is a simplified example of warmup request samples (the warmup tool provides some basic templating features enabling dynamic request content):

samples:
- method: GET
path: /user/v1/{{ uuid }}?filter=blocked
headers:
X-Locale: 'fr_FR'
X-Correlation-Id: '{{ uuid }}'
acceptable_status_codes: [ '200' ]
- method: POST
path: /validation/v1
headers:
X-Locale: 'fr_FR'
X-Correlation-Id: '{{ uuid }}'
Content-Type: 'application/json'
body: |-
[{
"type": "email",
"value": "zero-empty-seat@email.com"
}]
acceptable_status_codes: [ '200' ]


[...]

Another approach would have been to perform some kind of parroting of real traffic, both routed to pods already opened to live traffic and to the new cold pod. This solution reduces the maintenance and is nearer to real traffic (since it is). But this also means we need to filter out requests with side effects (mainly writes), which could be complex and leads to a less relevant warmup process.

Results

By creating efficient samples and fine-tuning the requirements, the results are immediate. When new pods are opened to live traffic, we no longer suffer from high latency storms and the CPU consumption remains low, mitigating the HPA flapping effects.

The following graphs show a regular scale-up where a new pod is added to the deployment:

P90, P95 & P99 latency of the workload while a new pod is entering into the deployment
Number of pods for a given deployment — Regular HPA scale-up at ~8:45, without flapping
Slight CPU throttling of the cold pod during warmup, without impact on live traffic

Trade-Offs

There’s no silver bullet in software engineering, and this warmup system is no exception to the rule:

Rollout time
The warmup is triggered within the Kubernetes rollout process and slows it down. The warmup setup must be fine-tuned to keep the additional time to roll out within acceptable boundaries, but it’s not free. Concerning our example, the rollout is 3 times slower with the warmup.

Maintenance
Adding a warmup system on a workload means it must be maintained & updated according to the Java application changes. Not all applications deal with many requests per second with low latency requirements. Selecting the most critical workloads is key, but again, it’s not free.

Focused on critical endpoints
As a consequence of both rollout time & maintenance constraints, we need to be pragmatic and only warm up the critical endpoints of a given workload. This means that less critical endpoints could remain slow once opened to traffic. It is acceptable if the number of requests per second is not too high.

Additional optimizations

Additionally to the warmup system, we also adjusted multiple parameters to both reduce the frequency of pod terminations and reduce the impact of a rollout:

  • We adjusted the deployment affinity & anti-affinity parameters to spread the pods on different nodes and reduce the impact of a node preemption
  • We optimized the startup, liveness & readiness probes initial delays, polling, and timeouts to leave room for the warmup to do its jobs while optimizing time after the warmup is completed
  • We reduced the rollout max-surge parameter to limit the number of new pods entering into the deployment at the same time
  • We audited the code to fix incorrect usage of some APIs (example: Java SecureRandom usage optimizations)

It is worth noting that warmup & infrastructure adjustments are some part of the complex subject of Java application performance optimizations. You also need to look at the JVM memory & heap size configuration, thread pooling adjustment, IOs timeout management, concurrency bottlenecks, etc.
You should be pragmatic and iterate until you reach acceptable performances without over-engineering.

Future alternatives

The JVM warmup topic has been under the spotlight for a few years now, and multiple solutions have emerged (and are still emerging). We went toward a classic warmup solution for the moment, but we also keep an eye on future alternatives:

  • Spring AOT compilation, moving the Spring application context creation & configuration at build time (this was created for native compilation but can be used to optimize an application startup running on a regular JVM)
  • Native compilation, moving the byte code to native code compilation at build time instead of runtime. This solution is interesting but also brings some drawbacks and limitations that we’re not ready to accept for now
  • Java CRaC project (with recent first class support by Spring), enabling warmup of an application and taking a snapshot of the JVM memory stored into a file, that can be provided to a cold JVM when starting again the same application
  • Java Leyden project, whose purpose is to move some part of the JDK code optimization at build time but without the constraints of the native compilation
  • Dynamic CPU limit configuration, allowing to set a higher CPU limit during startup and reduce the limit once the application is optimized. This doesn’t fully solve the slow process issues, but can participate in a global setup enabling fast warmup coupled with cost optimization

Conclusion

After suffering from performance issues on freshly started workloads handling a high number of requests per second leading to high latency and high CPU consumption, we created and set up a warmup system matching our requirements and leveraging Kubernetes native feature (startup probe).

Following the setup & additional infrastructure adjustments, we successfully mitigated the issues. This system is currently configured on a dozen BlaBlaCar backend applications, and new workloads could be configured if necessary.

You should keep in mind that when dealing with Java application performance issues, the warmup solution is not the magic one fixing everything. Multiple factors must be taken into consideration beforehand, from infrastructure configuration, DB query optimizations, caching, JVM parameters, memory right sizing & code modifications. Once done, if your application suffers from similar symptoms when freshly started, then the warmup solution is a good candidate.

We know that our current system is temporary. Exciting alternatives are emerging in the IT world and the topic will move fast for the next months and years!

I would like to thank Jose Martin-Sanchez, Edouard Durieux, Nicolas Le Goff, Adrien Hellec, Tim Gallois, Victor Rubin, Guillaume Wuip for the review, and Denis Wernert for both the review and the creation of this amazing warmup tool!

--

--