How we moved to Java 15 or the story of jvm bug that lived six years long

Nikolai Gribanov
hh.ru
Published in
9 min readAug 3, 2022

We have been preparing for the release of Java 15 due to some of its new features. In particular, text blocks. Yes, they were introduced in Java 14 (you can read about new features in Java 14 here), but only as a preview feature. It became available as a complete feature in Java 15.

At hh.ru, we are used to introducing and using the latest technologies in software development. Trying something new is one of the key tasks of the Architecture team. While many people write in Java 8, we are already close to sending Java 11 to the dustbin of history.

How we moved to Java 15 or the story of one bug

As you know, Java is releasing versions a lot more frequently, so we have more work on updating the versions. On the one hand, we had to adapt to the new realities and change our habits, which is not always comfortable. On the other hand, we can aim at the features which will appear in a new version of the language in advance and not wait for the release for 3 years.

Migration from Java 14 to Java 15. Something went wrong

Having waited for the new Java to come out, we began the transition. Without any second thought, we chose one of the busy services that was already running on Java 14. In theory there shouldn’t be any difficulties with the migration, in practice it turned out to be so. Updating Java 14 to Java 15 is not the same as updating Java 8 to Java 11.

The “hh and in production” service is updated, the work is done. What’s next? And next is monitoring the work. We use okmeter to collect metrics. We used it to monitor the behavior of the updated service. There were no anomalies compared to the previous version of Java, except for one — native memory. In particular, the Code Cache area has almost doubled!

Until the end of November 17 Java 14, after Java 15

For a service with a large number of instances, every megabyte of memory counts. In addition to its spike, we can see an increasing trend on the graph. It seems that we are dealing with a native memory leak in Code Cache.

What is this Code Cache of yours all about?

Code Cache is an area of native memory where the Java bytecode interpreter, JIT compilers C1 and C2, and their optimized code are stored. The primary user is the JIT. All recompiled code will be stored in Code Cache.

Since Java 9, the Code Cache is divided into three separate segments, each of which stores a different type of optimized code (JEP 197). But in the graph above, you can see only one selected area, even though Java 14 and Java 15 are there. Why just one?

The thing is that we had fine-tuned the memory size when migrating services to Docker (you can read about it here) and deliberately set the Code Cache size flag (ReservedCodeCacheSize) to 72 MB in this service.

The three segments can be obtained in two ways: leave the default ReservedCodeCacheSize value (256MB) or use the SegmentedCodeCache key. This is what these zones look like in the graph from our other service:

Finding native memory leaks in Code Cache

Where to start investigating? The first thing that comes to mind is to use Native Memory Tracking, a feature of the HotSpot virtual machine that allows you to track native memory changes by specific zones. In our case, there is no need to use Native Memory Tracking, since we’ve already figured out that the problem is in Code Cache, thanks to the metrics we’ve gathered. So we decided to do the following — run the service instances with Java 14 and Java 15 together. Since we’ve had the service running for three days now on 15, we add one instance on 14.

We’re deciding to continue the search for the leak using Java utilities. We start with jcmd. Since we know that our Code Cache is leaking, we access it. If the service is running in Docker, we can run the command this way for each instance:

docker exec <container_id> jcmd 1 Compiler.CodeHeap_Analytics

We get two very long and detailed reports on the state of the Code Cache. After painstakingly

comparing them, we noticed the following interesting fact related to Code Cache cleanup:

// Java 14

Code cache sweeper statistics:

Total sweep time: 9999 ms

Total number of full sweeps: 17833

Total number of flushed methods: 10681 (thereof 1017 C2 methods)

Total size of flushed methods: 20180 kB

// Java 15

Code cache sweeper statistics:

Total sweep time: 5592 ms

Total number of full sweeps: 236

Total number of flushed methods: 11925 (thereof 1146 C2 methods)

Total size of flushed methods: 44598 kB

Note the number of full cleanup cycles — Total number of full sweeps. Recall that the service in Java 15 runs for 3 days, and in Java 14 only 20 minutes. But the number of full Code Cache cleanups is strikingly different — almost 18 thousand in 20 minutes, versus 236 in three days.

How Code Cache Sweeper works

Now it’s time to go deeper into details. A separate jvm thread CodeCacheSweeperThread, which is called with certain heuristics, is responsible for cleaning up the Code Cache. The thread is implemented as an infinite while cycle, within which it is blocked until a 24-hour timeout expires or the blocking is unblocked by a call.

CodeSweeper_lock->notify();

After the lock is released, the thread checks if the timeout has expired and if at least one of the two flags that trigger Code Cache cleanup has the value true. Only if these conditions are met, the thread will call sweep() to clean up the Code Cache. Let’s take a closer look at the flags:

should_sweep. This flag is responsible for two strategies to clean up the Code Cache — normal and aggressive. We will talk about the strategies next.

force_sweep. This flag is set to true when it is necessary to force the Code Cache cleanup without the conditions of normal and aggressive cleanup strategies being met. Used in jdk test classes.

Standard cleanup

  1. During a GC call, methods stored in the Code Cache can change cleaningtheir state according to the following scenario: alive -> nonentrant -> zombie. Non-alive methods are marked as “must be removed from Code Cache at the next run of cleanup thread”.
  2. At the end of its work GC passes a link to all non-alive objects in the report_state_change method.
  3. Then the total size of the objects marked as non-alive in this GC pass is incremented into a special variable bytes_changed.
  4. When bytes_changed reaches the limit set in the sweep_threshold_bytes variable, the should_sweep flag is set to true and the cleanup thread is unblocked.
  5. The Code Cache cleanup algorithm is started, at the beginning of which the bytes_changed value is reset. It itself consists of two phases: scanning the stack for active methods and removing inactive ones from the Code Cache. This completes the normal cleanup process.

Starting from Java 15, the limit value can be controlled with the jvm SweeperThreshold flag — it takes the value as a percentage of the total Code Cache memory set with the ReservedCodeCacheSize flag.

Aggressive sweeping

This type of cleanup appeared in Java 9, as one of the ways to combat Code Cache overflow. It is performed at the moment when the free space in the Code Cache memory becomes less than the predefined percentage. This percentage can be set independently using StartAggressiveSweepingAt key, by default it is 10.

Unlike standard cleanup, where we wait for the buffer to fill with “ dead “ methods, the check to start aggressive cleanup is run every time we try to allocate memory to the Code Cache. In other words, when the JIT compiler wants to put new optimized methods into the Code Cache, it runs a check to see if cleaning needs to start before allocation. This check is quite simple; if there is less free space than specified in StartAggressiveSweepingAt, the cleanup is triggered forcibly. The cleaning algorithm is the same as in the normal strategy. And only after the cleanup is done, JIT will be able to put new methods into the Code Cache.

What have we got?

In our case the Code Cache size was limited to 72 MB, and we didn’t set the StartAggressiveSweepingAt flag, so by default it is 10. If you look at the Code Cache cleanup statistics, it might seem that it is the aggressive strategy that works in Java 14. We were additionally convinced by the same graph, but with a larger scale:

Java 14

It has a jagged structure, which indicates that the cleanup is frequent, and probably methods are unloaded from the Code Cache in circles, in the next iteration of the JIT compilation are put back in, then removed again, etc.

But how is this possible? Why does the aggressive cleanup strategy work? By default it should run when the free space in Code Cache is less than 10%, in our case only when it reaches 65 megabytes, but we see that it also happens at 30–35 megabytes of used memory.

In comparison, the graph with Java 15 running looks different:

Java 15

There is no jagged structure, there is a smooth growth, then purification and growth again. The clue is somewhere nearby.

A leak is not a leak

Since the Code Cache is managed by jvm, we went looking for answers in the openJDK sources, comparing Java 14 and Java 15. While searching, we found an interesting bug. It said that the aggressive Code Cache cleanup has not worked correctly since it was introduced in Java 9. Instead of starting aggressive cleanup at 10% free space, it was called at 90% free space, which is almost always. In other words, leaving the StartAggressiveSweepingAt = 10 option would actually leave StartAggressiveSweepingAt = 90. The bug was fixed on July 3, 2020. And it was all about one line:

This fix is included in all versions of Java after Java 9. But then why isn’t it in our Java 14? It turns out that our Java 14 docker image was built on April 15, 2020, and then it becomes clear why the fix is not included there:

So there is no native memory leak in Code Cache either? It’s just that all the time the cleanup wasn’t working properly, wasting cpu resources. After observing the service in Java 15 for a few more days, we concluded that this was the case. The overall native memory graph had plateaued and stopped showing an upward tendency:

the spike in the graph is a migration to java 15

Summary

  1. Update your Java as often as possible. This applies not only to major versions, but also to patch versions. It may contain important fixes
  2. Wise use of metrics helps detect potential problems and anomalies
  3. Switch to Java 15, it’s worth it. Here’s a list of all the new features that came out in 15
  4. If you are using Java 8, you do not have the problem of aggressively cleaning the Code Cache, for lack of this functionality as such. However, there is a risk that the Code Cache could overflow and the JIT compilation will be forcibly disabled

--

--