Cut Container Startup Time for Better Performance and Costs — Part2

Federico Iezzi
Google Cloud - Community
14 min readFeb 22, 2024

In the initial part of this series, our attention was predominantly on the theoretical aspects of boosting the Pod’s startup time. We spent considerable time diving deep into the nitty-gritty details of the process. Now, as we introduce in this second and final chapter, it’s crucial to apply the theoretical concepts we’ve learned. Hence, we’re pivoting to a more pragmatic approach, fortified with empirical evidence. We have an array of benchmarks that substantiate and validate our theory. The focal point of this approach is to illustrate that our concepts are not merely theoretical jargon but can be practically emphasized.

· A very much Java focused analysis
· Introducing the two sample applications
​ ​ ​​​ ​ ​​​ ​∘ Elasticsearch
​ ​ ​​​ ​ ​​​ ​∘ Custom Pub/Sub and Cloud Storage Client Java Application
· A very much hands-on approach
​ ​ ​​​ ​ ​​​ ​∘ Introducing the GKE Cluster Setup
​ ​ ​​​ ​ ​​​ ​∘ 1 — The Machine Serie Does Matter
​ ​ ​​​ ​ ​​​ ​∘ 2 — Need for NVMe
​ ​ ​​​ ​ ​​​ ​∘ 3 — GKE Image Streaming
​ ​ ​​​ ​ ​​​ ​∘ 4 — COS vs. Ubuntu
​ ​ ​​​ ​ ​​​ ​∘ 5 — Turbo Boost for Pods
​ ​ ​​​ ​ ​​​ ​∘ 6 — Code Tuning and Java Native Images
· Key Takeaways

A very much Java focused analysis

Before introducing the experiments, let’s dive a bit into the offended applications 😫 sorry I meant the application that will be used to profile the startup speed :-P

Opting for Java seemed like an obvious choice to me for several reasons:

  • While Java isn’t typically seen as embodying the qualities of modern, lightweight, and swift microservice applications — often being labeled as resource-heavy — it presents a unique challenge.
  • As someone who previously worked at Red Hat, I observed first-hand the slow progression and adaptation of getting OpenJDK and then JBoss efficiently running on Docker, a process that spanned years. It wasn’t until 2022 that Java 17 became fully container-aware (albeit the cgroup enhancements being retroactively applied to as far back as Java 8). And yet, there’s ongoing effort to have a container-aware heap sizing.

So, why still lean towards Java? It’s precisely because of these challenges. If Java’s applications can be significantly optimized, it sets a promising precedent for accelerating other platforms as well.

Java enthusiast see you in the comments below. Anybody else, carry on with the reading.

Introducing the two sample applications

When assessing what type of applications could well capture real startup speed improvements I came up with the following two main areas:

  • InfraMod Benefits: this has to be something standard git clone - terraform apply ✅ that would allow people to identify with. Ideally a distributed application with an established community.
  • AppMod Benefits: This area zeroes in on the coding itself and the assembly of the Docker image, focusing purely on the technical improvements.

Elasticsearch

From beans of Java, my structure takes flight,
A web of indices, in data’s soft light.
My thirst for resources, a known appetite.

Yes, that’s Google’s Gemini.

ECK, or Elastic Cloud on Kubernetes, simplifies the setup process. With just a few CRDs, the Operator, and a declarative YAML file, boom Elasticsearch springs to life. For deploying CRDs and the Operator, simply follow the official quick start:

For the Elasticsearch resource, follows the YAML:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: eck-medium
spec:
version: 8.12.1
nodeSets:
- name: default
count: 3
config:
node.store.allow_mmap: false
node.roles: ["master", "data", "ingest", "ml"]
podTemplate:
spec:
containers:
- name: elasticsearch
imagePullPolicy: Always
resources:
limits:
memory: 16Gi
cpu: 4
env:
- name: ES_JAVA_OPTS
value: "-Xms14g -Xmx14g"
volumes:
- name: elasticsearch-data
emptyDir: {}
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
elasticsearch.k8s.elastic.co/cluster-name: eck-medium
topologyKey: kubernetes.io/hostname

Here are a few key points to keep in mind:

  • We’re dealing with a three-node cluster with a homogenous role configuration.
  • In such a cluster, the single node’s resources (quantity and quality) are as critical as network speed, minimal VPC latency (or lack of — no pun intended), and reliable storage performance.
  • To minimize variables, we’re not using PV/PVC nor a Load Balancer/Ingress to expose the service.

ECK is deployed on a fresh cluster where nothing else is running, ensuring that resources aren’t shared with anything beyond the GKE defaults. Setting resource limits and podAntiAffinity guarantees each Pod is allocated to a separate Worker.

Custom Pub/Sub and Cloud Storage Client Java Application

Some time ago, Aaron Wanjala developed a streamlined Java application specifically designed to showcase the significant enhancements provided by Java Native Image. This turned out to be ideally suited for our scenario. The application is crafted to accurately measure and report two critical metrics:

  • The startup time of the application.
  • The duration required to publish a message to Pub/Sub.

This precise focus makes it a perfect case study for understanding the operational efficiency of software loading times and establishing meaningful connections when playing with the various GKE parameters.

I made several adaptations to the versions of the dependencies used:

Last but not least I made Docker images of both the JAR and the Native Image version. In the context of the Native Image version, the latest Alpine version 3.19.1 was selected.

Our custom Java App’s Docker Image is deployed as a Kubernetes Job, executed a hundred times with 18 jobs running simultaneously. Each container is allocated 500 millicores and 512MiB of memory. To comprehensively assess the underlying infrastructure’s impact, the majority of tests utilize the original JAR. Finally, a comparison will be made to see how traditional Java’s JIT fares against the new AOT compilation, with a specific focus on infrastructure considerations.

apiVersion: batch/v1
kind: Job
metadata:
name: cut-container-startup-time-jar
spec:
template:
spec:
containers:
- name: jar
image: europe-west4-docker.pkg.dev/medium-quick-pod-414109/medium/cut-container-startup-time:jar
imagePullPolicy: Always
resources:
limits:
memory: "512Mi"
cpu: "500m"
restartPolicy: Never
backoffLimit: 4
completions: 100
parallelism: 18

A very much hands-on approach

Introducing the GKE Cluster Setup

Before delving into the specifics of the various experiments, let’s outline the setup of our clusters:

  • The clusters are situated in the europe-west4-a zone.
  • Kubernetes version 1.28.5-gke.1217000 from the Regular release channel powers them.
  • The OS image is COS build cos-109–17800–66–54 which translates into Linux Kernel 6.1.58 and containerd v1.7.10.
  • In line with GKE’s default since version 1.21, our clusters are configured as VPC-native.
  • Features such as Vertical Pod Autoscaler (VPA), Horizontal Pod Autoscaler (HPA), Managed Service for Prometheus (GMP), and NodeLocal DNSCache are all activated.

The initial cluster employs a default GKE NodePool configuration comprises of:

  • Three e2-standard-8 nodes, each featuring 8 vCPUs (4 physical cores, with SMT enabled) and 32GB of RAM. The selected E2 machines use Intel Skylake (1st Scalable Xeon generation) CPUs.
  • Each node has a 20GB Balanced PD boot disk.
  • Autoscaling is enabled and the NodePool can scale up to 6 nodes and down to 3.

Opting for the e2-standard-8 as our baseline stems from its status as GKE’s default option. This choice allows us to maintain a conservative approach while we explore the C3D, one of GCP’s most advanced machine families, known for its optimal price/performance balance.

1 — The Machine Serie Does Matter

In our inaugural experiment, we set out to compare the performance of the standard GKE zonal cluster against a modified version. The only deviation in the second cluster is the machine type: instead of e2-standard-8 nodes, it utilizes c3d-standard-8 nodes, powered by AMD Genoa (4th EPYC generation) CPUs. The switch also introduces Google’s Titanium IPU technology. The configuration — 8 vCPUs (4 physical cores, with SMT enabled) and 32GB of memory — remains unchanged, ensuring a consistent comparison.

First, let’s direct our focus towards Elasticsearch:

To begin with, it’s fascinating to note the initialization times for the Pods, ranging from 26 to 35 seconds for C3D and E2 Skylake machines respectively. Signal Initialized encompasses the Pods’ networking and storage setup, container images retrieval, and the completion of Elasticsearch’s two init containers. It’s unfortunate not having access to millisecond precision level but we’ll live without it. Post-initialization, the C3D configuration signals ContainersReady in an additional 16 seconds, totaling 42 seconds for the cluster to transition from Scheduled to fully Ready. On the other hand, the E2 configuration takes an extra 32 seconds post-initialization to achieve a fully Ready state.

In summary, C3D is 59% quicker. However, does it also cost 59% more? Interestingly, no; the price difference is roughly 30% ($215.42 vs. $278.38). There would be more to consider, such as SMT performance improvements in recent CPU μArch, the increased IPC Zen4 offers, the superior core load management compared to Skylake, and advancements in ISAs — all of which merit their own detailed discussion.

Second, let’s look at the Java application startup and first execution time:

The graph clearly illustrates the efficiency differences between environments:

  • Startup times for C3D containers are impressively swift, roughly half that of E2 — witnessing 330ms for C3D. 650ms for E2 is quite remarkable too, especially considering these are Java applications.
  • For C3D, the initial Pub/Sub request is recorded around 3000ms, concluding the cycle near 6000ms. In comparison, E2’s first request lags close to 5000ms, with the cycle completing just under 12 seconds.
  • The total duration for the job on E2 reached 101 seconds, whereas C3D accomplished it in 57 seconds.

Despite C3D’s higher cost relative to E2, its superior performance justifies the investment, particularly for short-lived or dynamically scaling workloads, ultimately enhancing the user experience.

2 — Need for NVMe

Moving forward, we will retain the C3D cluster but pit our two applications against a counterpart employing local NVMe storage. This adjustment involves a switch in Machine Type from c3d-standard-8 to c3d-standard-8-lssd, alongside appending --ephemeral-storage-local-ssd count=1 as an argument to the GCloud CLI. Our spotlight now turns to Elasticsearch:

The integration of Local NVMe in our latest test trims an additional 10 seconds off the total time. Now, Elasticsearch transitions from non-existent to fully operational and initialized in just under 32 seconds. Reflecting on a decade ago, I used SSD-baked Elasticsearch to manage the logs of enormous OpenStack clusters consisting of thousands of nodes. At that time, Elasticsearch demanded significant resources, and even with what was then considered modern hardware, initializing empty clusters was a lengthy process.

As for our bespoke Java application, the difference when incorporating Local NVMe compared to without is subtle yet present, evidenced by a 14% improvement in startup time. This enhancement is by no means insignificant.

In summary, opting for the c3d-standard-8-lssd configuration with Local NVMe incurs an additional cost of approximately 33 USD per node (or +11%). However, this investment greatly benefits I/O-intensive activities, such as running Elasticsearch or Java applications with extensive dependency loads, thanks to the near-unlimited IOPS and significantly reduced NVMe latency.

3 — GKE Image Streaming

In our forthcoming experiment, we'll return to using the c3d-standard-8 configuration, sans out local NVMe. This time, we'll activate GKE Image Streaming on one cluster to observe its impact on the Java application's performance.

The application exhibits a slight yet consistent increase in speed, improving by 8%. This level of optimization, especially when it’s offered at no extra cost, is quite cool.

Elasticsearch was not part of this report, as its readiness time, whether with or without Image Streaming, consistently hit our benchmark — and the Answer to the Ultimate Question of Life, the Universe, and Everything: 42 seconds.

4 — COS vs. Ubuntu

In this experiment, we opted for the default GKE Ubuntu image, specifically the latest LTS version 22.04.3 featuring Kernel 5.15.0–1048-gke and containerd v1.7.0, to compare it against the COS that has been our go-to until now. The setup is straightforward: both clusters are initiated with a single node and auto-scaling activated. The ECK operator is then installed, together with a dummy Pod designed to utilize almost all resources of the solitary node. Following this, the creation of the ECK cluster prompts the roll-out of three additional Worker nodes. We recorded timings at three critical junctures:

  • The moment Kube’s API calls are made.
  • When the nodes reach a stable Ready state.
  • And finally, when the ECK itself is deemed Ready.

GKE Worker nodes running on COS are fully operational within 33 seconds, whereas those on Ubuntu take over twice as long, clocking in at 82 seconds. Beyond this point, while the timelines are similar, COS consistently outpaces Ubuntu, which fails to overcome the initial 49-second delay in node readiness. The ECK cluster becomes completely Ready in just 84 seconds on COS, compared to 144 seconds on Ubuntu. In summary, unless specific needs dictate otherwise, Container-Optimized OS is the preferable choice.

5 — Turbo Boost for Pods

Our final experiment wraps up the exploration of tuning possibilities from an infrastructure perspective. Specifically, the Kube Startup CPU Boost, which hinges on the In-place Resource Resize feature for Kubernetes Pods — an Alpha API. For this purpose, both clusters were set up with the --enable-kubernetes-alpha feature activated; however, only one underwent the Turbo Boost rollout. Within the NodePool, our trusted C3D stood ready.

Moreover, deploying ECK with abundant resources, as we’ve done so far, is a straightforward task. In this trial, we capped the Pod resources to one core and 1GiB of memory.

          resources:
limits:
cpu: "1"
memory: 1Gi
requests:
cpu: 500m
memory: 512Mi

The StartupCPUBoost resource was set up to double the resources allocated to the Elasticsearch container, maintaining this enhancement until it received the Ready signal.

apiVersion: autoscaling.x-k8s.io/v1alpha1
kind: StartupCPUBoost
metadata:
name: boost-eck
selector:
matchExpressions:
- key: elasticsearch.k8s.elastic.co/cluster-name
operator: In
values: ["eck-medium"]
spec:
resourcePolicy:
containerPolicies:
- containerName: elasticsearch
percentageIncrease:
value: 100
durationPolicy:
podCondition:
type: Ready
status: "True"

Witnessing this in action felt like experiencing magic firsthand. Reflecting on the earlier, somewhat distant days of virtualization, resizing within the same host was deemed both cool and unattainable, primarily due to concerns around balancing and distributing resources.

  • In a setting constrained by resources, ECK impressively reached full initialization in 76 seconds.
  • However, the impact of doubling the resources at startup was transformative, allowing ECK to complete in merely 57 seconds — achieving a 33% improvement in speed.

Before we endorse the implementation of Kube Startup CPU Boost, it’s crucial to elaborate on some potential pitfall:

  • Being the In-place Resource Resize an Alpha API, implies that Google reserves the right to decommission the GKE clusters after 30 days without providing any SLA or offering updates for the Control Plane and worker nodes.
  • My experience with Kube Startup CPU Boost v0.4.0 showed that the StartupCPUBoost Admission Webhook prolongs Pod deployment times.

While these issues could eventually be smoothed out with subsequent software updates, the practice of temporarily increasing Pod resources at startup — effectively boosting them and then scaling back once the Pod is operational — carries the risk of causing system instability. Although I did not encounter any OoM issues, it remains a critical factor to monitor. It’s also important to understand that application needs can vary greatly. Throughout my career, I’ve observed numerous instances where an application required additional resources during its initial launch phase but subsequently, both CPU and memory demands decreased. The key takeaway here is the importance of having a thorough understanding of your application’s behavior and requirements.

6 — Code Tuning and Java Native Images

The preceding five experiments primarily focused on infrastructure aspects, but our most recent investigation shifts its attention towards the benefits of optimizing the code itself. As previously discussed, Java Native Images offer a method for compiling code Ahead-Of-Time (AOT), diverging from the Just-In-Time (JIT) compilation technique traditionally utilized by Java since its inception.

The initial graph I wish to discuss compares the E2 and C3D performances for our custom Pub/Sub and Cloud Storage application using the Java Native Image approach.

When I first saw these numbers, I was taken aback and had to triple verify them.

  • On average, startup times clock in under 3 milliseconds for E2 instances and under a mind-blowing 🤯 1 millisecond for C3D instances — the performance is nothing short of astonishing, as illustrated by a pod launching in a mere 888 microseconds in the screen above. I am particularly impressed by how the startup of C3D is a perfect exemplar of yielding deterministic outcomes.
  • Moreover, the initial request dispatch times are incredibly swift, occurring within 41 milliseconds for E2 and an even more rapid 22 milliseconds for C3D.
  • On the aspect of shutdowns, the Java application concludes its operations in just under 300 milliseconds on E2 and approximately 179 milliseconds on C3D, showcasing remarkable efficiency.

I had always held the view that Java meant slow startups and resource hog, I completely didn’t see it coming. No more 😲.

Moving forward, I’m eager to examine how infrastructure optimization with JAR images on C3D compares against native image Java code on E2.

Despite the intense efforts from the JAR and C3D teams, they can only do so much here. It’s important to remember, though, that this observation is based on a uniquely tailored scenario meant to highlight these differences. Nonetheless, the ability of Java Native Image to massively outperform state of the art systems is nothing less than marvellous.

Key Takeaways

Our exploration, from understanding the theoretical underpinnings to hands-on implementation, has illuminated the profound impact that infrastructure choices, Kubernetes configuration, and innovative technologies like Java Native Images can have on Pod startup times. These experiments demonstrate that achieving significant reductions in startup time is not only possible but can also lead to a cascade of benefits. Streamlined application deployment, reduced operational costs, and ultimately, a superior user experience are all within reach.

This series offers a blueprint for success, but we recognize that optimization is a moving target. We urge you to take these findings as a catalyst for your own optimization journey. Experiment, explore new strategies, and continually strive to unlock the full potential of your containerized environments — it’s a journey well worth taking.

--

--

Federico Iezzi
Google Cloud - Community

Cloud Engineer @ Google | previously RedHat | #opensource #FTW | 10+ years in the cloud | #crypto HODL | he/him | 🏳️‍🌈