Garbage Collection (G1GC) Optimisation on Apache Ignite

Published in

Segmentify Tech Blog

8 min readMar 9, 2022

If you define your job as a Java back-end developer, I believe you’ve struggled with the Java Garbage Collection (GC) process from time to time, either on your code itself or a 3rd party tool you use. Understanding how Java Virtual Machine (JVM) manages the memory and operations of Garbage Collectors is essential knowledge. Since GCs have a significant impact on overall java applications in terms of performance and quality of your applications, the less full GCs suffer, the better performance in the end.

At Segmentify, an eCommerce personalisation platform, we aim to help online retailers to optimise their conversion rates by enabling them to deliver a unique shopping experience for each visitor. But, of course, you need a very robust system for this kind of service. To create such a system, inevitably you’ll need to use some side-kick systems for your needs. One of the side-kick systems we use is one of the top-5 projects of the Apache Software Foundation’s project, Apache Ignite.

Recently, we decided to upgrade Apache Ignite’s version to the latest. At this point, we started to face performance issues, especially on GC processes. Let me briefly explain our struggles and how we handled them accordingly.

For starters, let me summarise system information that we conducted our tests before going into production:

System Information (Apache Ignite Clusters)

Heap size: 12GB out of 64GB of memory
CPUs: 12 cores
The number of server instances: 10
IOWait: within normal range
IOPS: 550–600K
JDK: Oracle JDK8, 1.8.0_281
JVM Flags: -Xms12g -Xmx12g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+UseStringDeduplication
Ignite Configurations: Out of box properties adjusted to node resources

Some parts of our system were designed using Apache Ignite as memory cache, and the overall response time is supposed to be under 30ms. However, after upgrading Apache Ignite to a newer version, we have faced some serious issues:

1. Response Time Issues

It can be observed that the response times increased dramatically due to long GCs.

2. Heap Issues

As Apache Ignite suggests, we were using G1GC as the Garbage Collection process. Although default configurations were working seemingly good with the old version, these configurations did not apply with the newer version. As a result, we started to see too many and long GCs and heap usage was always near its peak value.

Once heap usage reaches its maximum values, GCs cannot allocate enough memory and eventually, very long and many full GCs are performed.

Keeping your software up-to-date is an inevitable fact for many reasons. First of all, you need updates and upgrades for a bug-free environment. Also, updates and upgrades are necessary for optimised systems. Last but not least, you need to keep your systems up-to-date for your clients. In other words, upgrading was inevitable for us, so we started to figure out how to solve it.

Solving the Problem with Provider

Before going deep inside for the solution, we first tried to reach the Apache Ignite community for the solution. We’ve shared our findings about the problem mentioned in the beginning. Our first suspicion was that there was a bug that was not encountered before the release or not reported until our upgrade scenario. We have received some optimisation suggestions on how we use Apache Ignite on our codebase. We immediately applied changes, but we were still out of luck…

Solving the Problem with Updating Apache Ignite Configurations

With a new version of a software, there are new parameters introduced from time to time. As a result, being aware of the new parameters can save the day. Again, there were no new configuration parameters that we missed. However, we have discovered some misfigured parameters by going over the configurations. For example, in Apache Ignite, you need to set some thread-pool sizes for your use case. Default configurations suggest that these pool sizes must not exceed logical CPU sizes. Unfortunately, we were using way too many thread sizes than recommended.

Old Configuration:

<property name=”systemThreadPoolSize” value=”128"/><property name=”publicThreadPoolSize” value=”128"/><property name=”queryThreadPoolSize” value=”128"/><property name=”serviceThreadPoolSize” value=”128"/><property name=”stripedPoolSize” value=”128"/><property name=”dataStreamerThreadPoolSize” value=”64"/><property name=”rebalanceThreadPoolSize” value=”8"/>

New Configuration:

<property name=”systemThreadPoolSize” value=”12"/><property name=”publicThreadPoolSize” value=”12"/><property name=”queryThreadPoolSize” value=”12"/><property name=”serviceThreadPoolSize” value=”12"/><property name=”stripedPoolSize” value=”12"/><property name=”dataStreamerThreadPoolSize” value=”12"/><property name=”rebalanceThreadPoolSize” value=”12"/>

With the old configuration, we thought we might cause resource starvation for GCs to take place at the right time. After updating configurations to suggested values, we observed a slight improvement in the GCs. But again, we had no luck. This change only delayed the main problem.

Solving the Problem with GC Optimisations (G1GC Optimisations)

We did not make any GC optimisations with the old version since the system was working properly. Before going to the tuning phase, we need to pinpoint the problem first. Our main problem with Apache Ignite nodes was very high heap usage at all times. Even on every GC occurrence, JVM could not free enough memory, and eventually, full GCs took place. We do not like full GC operations since they are the “STOP THE WORLD” operations, which causes lousy throughput over your systems.

The system cannot reclaim enough memory, but also, the reclaiming period is way too high.

System not only cannot reclaim enough memory also the reclaiming period is way too high.

We started to realise that the problem here is either JVM cannot initialise and run GC activities on time or there are not enough resources for both Apache Ignite to work and JVM to keep itself running smoothly. We immediately ruled out insufficient resource ideas since the system was working properly before upgrading Apache Ignite to a newer version. Also, we have confirmed with the Apache Ignite developers that there is no major resource consumption increase on more recent versions.

So we started to dig into tuning garbage collection processes. Here we try to follow the vendor’s official tuning recommendations.

Default values are important ones, but when it comes to basic G1GC collector tuning following ones took our attention:

-XX:ParallelGCThreads=n

Sets the value of the STW worker threads. Sets the value of n to the number of logical processors. The value of n is the same as the number of logical processors up to a value of 8.
If there are more than eight logical processors, it sets the value of n to approximately 5/8 of the logical processors. This works in most cases except for larger SPARC systems where the value of n can be approximately 5/16 of the logical processors.

With default -XX:ParallelGCThreads size on each Apache Ignite node, we realised that there are way too many context switches on CPUs because the default Apache Ignite configured system has already been using much of the CPU resources. So we have updated the -XX:ParallelGCThreads flag to 5/8 of the logical processors.

-XX:ConcGCThreads=n

Sets the number of parallel marking threads. Sets n to approximately 1/4 of the number of parallel garbage collection threads (ParallelGCThreads).

As the vendor suggests, we also updated the -XX:ConcGCThreads flag to ¼ number of parallel garbage collection threads size.

-XX:InitiatingHeapOccupancyPercent=45

Sets the Java heap occupancy threshold that triggers a marking cycle. The default occupancy is 45 per cent of the entire Java heap.

As I mentioned before, the system was not reclaiming memory properly. That is why this flag caught our attention. Marking definition from the vendor is the following:

G1 GC uses the snapshot-at-the-beginning (SATB) algorithm, which logically takes a snapshot of the set of live objects in the heap at the start of a marking cycle. The set of live objects also includes objects allocated since the beginning of the marking cycle. The G1 GC marking algorithm uses a pre-write barrier to record and mark objects that are part of the logical snapshot.

We wanted the concurrent marking process to be fastened, so we decided to decrease concurrent marking cycles to start earlier when the heap occupied >40%. We have reached the number by taking many tests. Lower numbers caused more GC activities than expected and affected the whole throughput. So if you need to update this flag, give it many tries to find the optimal number.

This set of tuning flags solved all GC problems on our systems:

JVM memory size:

Key performance indicators:

Heap After GC:

Heap Before GC:

G1 Collection Phase Statistics:

G1 GC Time:

GC Causes:

Wrap-up

I think not only that GC optimisation solved the problem entirely but applying Apache Ignite recommended tunings also helped.

I also believe that using default values of flags on G1GC can cause performance issues, even if you have much more resources. G1GC and other GCs have many flags and complex ones; however, understanding and using the right values can be more helpful than you can imagine. Also, do not forget there is no perfect set of values applicable to all systems. G1GC flags need to be tuned according to your system needs.

Upgrading may be harsh and tiring, but it is manageable with a little bit of luck and help. I believe some of you have already faced this problem and may already find different solutions. So keep up…

Written by İbrahim Halil Altun from Segmentify Technology Team

References