How Garbage First Garbage Collector (G1GC) affected the performance of our back-end

Hoan Nguyen
7 min readNov 24, 2018

--

As a Java back-end developer, I believe that understanding how JVM manages the memory and operations of Garbage Collectors is essential knowledge. In Java 8, since full GC cycling is single-threaded, the fewer full GCs we suffer, the better.

In our Flights&Hotels booking metasearch company, making the system faster & faster and providing a good user experience are also considered as a key feature.

Recently, our back-end application got serious performance issues with so many out-of-memory (OOM) errors and a very high system response time. It allowed me to dig into the Garbage Collector (GC), particularly the G1GC, and play with it on production. By getting better performance by tuning G1GC flags only, I want to share my practical experience and hope people in similar use cases could benefit.

The poor performance of our system

Here are the details of our servers and web app:

  • Heap size: 10GB out of 16GB of memory.
  • CPUs: 4 cores, 8 logical processors.
  • The number of server instances: 4.
  • Working-set size: 4.5 GB to 5GB (around 45% of the heap). It’s the memory occupied by all liveness data after each full GC.
  • JDK8: We value low latencies and a seamless user experience, so we use G1GC. The G1GC flags were previously used, and they are still in use:

-Xms10G -Xmx10G -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1NewSizePercent=30 -XX:InitiatingHeapOccupancyPercent=70

Our system was designed using a memory cache; the response time was originally under 50ms. However, these were performance issues we got on production:

OOM errors every day

With 4 production servers and an average of 3+ OOM errors happened daily, meaning likely every single server got full memory errors every single day.

High CPU and Heap usage on one of four production servers

Because of XX:InitiatingHeapOccupancyPercent (IHOP) =70%, the concurrent marking cycles triggered when the entire heap usage was more than 70%. It was too late; our heap got full, and GC didn’t have a chance to clean up the non-live data. Full GCs or OOM could happen before mixed GCs. As an inevitable consequence, unstable servers caused high latencies:

High latencies. Fact: The average response time (per minute) of 4 production servers was usually over 50ms.
2h17m of GC logs showing a bad throughput of 95.77% and too many full GCs happened!

Using default values did not help

Since, as of that time, there were no big changes in the source code as well as the infrastructure, the first actions were removing all the G1GC flags and using the default values. However, the performance didn’t get better, even worse:

High CPU and Heap usage with -XX:+UseG1GC -XX:+UseStringDeduplication
High CPU and Heap usage with only -XX:+UseG1GC

Because the working set size was 45% and the default value IHOP=45%, the concurrent marking cycles followed by mixed GC happened all the time, and most of the CPUs were used for GC activities. This led to low application throughput, which is bad.

The concurrent marking cycles started too late

We want the Concurrent marking cycles of mixed GCs to happen not too early or too late. Also, we need to shorten the Stop-The-World pause. We tested with few IHOP values, and below was the best result with IHOP=55 and -XX:ParallelGCThreads=8:

-Xms10G -Xmx10G -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1NewSizePercent=30 -XX:InitiatingHeapOccupancyPercent=55 -XX:ParallelGCThreads=8

Heap was controlled under 8GB most of the time with -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1NewSizePercent=30 -XX:InitiatingHeapOccupancyPercent=55 -XX:ParallelGCThreads=8
OOM errors dropped significantly
Better throughput of 98.06% (compared to the old 95.77%), and the number of full GCs also dropped significantly.

Full GCs still happened with a larger heap

Coincidentally, at this time, due to a need for larger non-heap memory (for a new feature), we decided to upgrade our servers and hoped to prevent the long pauses of full GCs and eliminate OOM errors.

  • Heap size: 16GB out of 32GB of memory.
  • CPUs: 8 cores, 16 logical processors.

However, even with a larger Heap and more powerful CPU, occasionally-happened full GCs caused an overall higher response time:

7 days 23 hours of GC logs showing that full GCs happened 7 times…
The average pause of the 7 full GCs was 8 seconds; they should be avoided
Full GCs still happened even though CPU/Heap usage “looks good” with -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1NewSizePercent=30 -XX:InitiatingHeapOccupancyPercent=55 -XX:ParallelGCThreads=8
Long pauses of full GCs might cause an overall high response time (over 50ms)
By inspecting GC logs, full GCs happened (the black bar)

From the chart above, before the full GC, the young generation (the yellow area) suddenly increased from around 5GB to 9.5GB (keep in mind that the default max size of the young generation is 60% of the total heap, meaning 60% * 16GB=9.6GB). That led to the sudden increment of heap usage to 15GB (to fill up the Eden space of the young generation). After a few minor GCs, the heap reached 16GB, and full GC was triggered to evacuate the entire heap.

Given those observations, I decided to reduce both the minimum size and maximum size of the young generation to 25% and 30%, respectively. Also, we want the concurrent marking cycles to start earlier by using the default IHOP=45:

-Xms16G -Xmx16G -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1NewSizePercent=25 -XX:G1MaxNewSizePercent=30 -XX:ParallelGCThreads=8

Heap usage was kept under 12GB most of the time with -Xms16G -Xmx16G -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1NewSizePercent=25 -XX:G1MaxNewSizePercent=30 -XX:ParallelGCThreads=8

However, again, occasionally, we still got full GCs

With this chart and by checking GC logs, this full GC was triggered because the Heap got full while the Concurrent marking cycle was still running:

  • JVM started the concurrent marking cycle:
    2018-11-19T17:43:26.562+0800: 9530.178: [GC concurrent-mark-start]
  • This process took a few seconds to complete; however, at the same time, coincidentally, too many application objects were allocated, which caused multiple young GCs:
2018-11-19T17:43:31.381+0800: 9534.997: [GC pause (G1 Evacuation Pause) (young)
2018-11-19T17:43:33.606+0800: 9537.222: [GC pause (G1 Evacuation Pause) (young)
2018-11-19T17:43:35.769+0800: 9539.384: [GC pause (G1 Evacuation Pause) (young)
2018-11-19T17:43:38.077+0800: 9541.693: [GC pause (G1 Evacuation Pause) (young)
2018-11-19T17:43:39.954+0800: 9543.569: [GC pause (G1 Evacuation Pause) (young)
2018-11-19T17:43:43.955+0800: 9547.570: [GC pause (G1 Evacuation Pause) (young)
2018-11-19T17:43:45.978+0800: 9549.593: [GC pause (G1 Evacuation Pause) (young)
...
  • As a consequence, the heap was full, and JVM couldn't trigger mixed GC; a Full GC happened:
2018-11-19T17:43:59.029+0800: 9562.644: [Full GC (Allocation Failure)  15G->4830M(16G), 10.4119543 secs]

Once more, we need to fasten the concurrent marking cycles

What we want here is for the concurrent marking process to finish faster, so we decided to increase Parallel GC threads to 10(according to Oracle, equals 5/8 of logical processors); set Concurrent GC Threads to 3 (roughly equals 1/4 of parallel GC threads); and ask the Concurrent marking cycles to start earlier when the heap is occupied ≥ 40%:

-Xms16G -Xmx16G -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1NewSizePercent=25 XX:G1MaxNewSizePercent=30 -XX:ParallelGCThreads=10 -XX:ConcGCThreads=3 -XX:InitiatingHeapOccupancyPercent=40

Good heap usage is controlled under around 10GB
No full GCs triggered for more than 4d10h and a good throughput of 98.91%. Also, the server defended well against multiple consecutive young GCs.
The average pause is 203ms, which is acceptable for us

Until now, this set of tuning flag values has been working fine on our production; no more full GCs have been encountered.

Final thoughts

In conclusion, poorly tuning G1GC (or even using the default values of flags in certain use cases ) can cause performance issues, even with a large heap size. Of course, G1GC is complex, with many tuning flags; however, understanding and using the popular flags properly helped us. Also, no perfect set of values can be applied to all similar systems. G1GC flags need to be tuned and monitored on a system basis. In addition, it’s a good practice always to print out the GC logs; that would help us much for the inspection.

References:

Tools:

GC Viewer: https://github.com/chewiebug/GCViewer

VisualVM: https://visualvm.github.io/

Oracle Java 8 G1GC: https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/

--

--