Everybody leaks

Detecting native memory leaks in Java

Milan Mimica
Published in
6 min readApr 15, 2017

--

Mandatory intro

What is a memory leak? It is a type of resource leak that occurs when memory which is no longer needed is not released, according to Wikipedia. The “no longer needed” part is ambiguous here, and left to different interpretations. If you talk to a C programmer, he’ll tell you a block of memory has leaked when there is no reference to it from any of program roots (thread stack pointers and similar). By this definition you cannot (easily) leak memory in Java, since it’s a garbage-collected language, and the garbage collector will take care of reclaiming any unreachable memory blocks. Still, Java folks constantly talk about memory leaks, but on another level. In Java, you leak memory by not using it in your code. For example, you allocate an array, put it in a map, use this map also for something else, and forget about the array. There is no leak detector which will reliably detect such leaks. Technically, it’s not even a leak.

When memory is gone

When I had a problem with leaking memory resources on one of our Messaging Core services, it did not fall in any of the above categories. The service would exhaust all of 32GB memory over two weeks and eventually get prosecuted by Linux OOM killer. The Java process was given 25G heap. Side note: I like to use fixed heap sizes — same value to -Xmx and -Xms flags, because I see no point of making JVM manage heap size for the service running on a virtual machine which has been designed for peak memory usage. The JVM just must have enough memory, any time it needs it, so let it reserve it. It also makes things more simple. Because I knew heap size was fixed I knew without any doubt that I head a problem with native (off-heap) memory management. Hotspot JVM uses native memory for various things, like metaspace, housekeeping metadata, thread stack, GC metadata… All sorts of internal things which are needed for the JVM to work. Or, in JVM user’s point of view: wasted memory. Without much thinking, one would say that 25:7 heap to waste radio would be big enough, but apparently it wasn’t.

I couldn’t nail down the root cause of native memory leak just by searching the Internet. I could easily exclude the top 4 causes of native memory leaks I found people complaining about the most:

  1. Not closing the java.util.zip.DeflaterInput/OutputStream – These streams internally use java.util.zip.Deflater which uses JNI and allocates some native memory in the process. If you don't close it you have a temporary leak. Deflater itself implements java.lang.Object.finalize() where the memory is reclaimed. My leak was of permanent nature and not even Full GC did make any difference.
  2. sun.misc.util.Unsafe.allocateMemory et. al. – Hotspot JVM exposes this functionality to manage native memory. It's easy to misuse and create C-style memory leaks. I have scanned the classpath, and the only use of this feature was by netty 4. After having throughfully examined netty's buffer management code I had to admit I can't blame netty. There is also a static variable AtomicLong io.netty.util.internal.PlatformDependent.DIRECT_MEMORY_COUNTER which keeps track of the amount of allocated memory.
  3. direct java.nio.ByteBuffer – this is actually not much more than a wrapper around Unsafe.allocateMemory. It's used by netty 3, which I also happen to have in classpath. It's easy to check how much memory is allocated by this with JMX Mbean value java.nio.BufferPool.direct. It also showed that there is no leak. There is also this buffer caching feature which is known to produce memory leaks, but setting jdk.nio.maxCachedBufferSize didn't change anything.
  4. The class loader: if you keep creating new class definitions on the fly you are bound to leak memory. Just don’t. I didn’t.

Native memory profiling

So it looked like I was not facing a well-known problem, and if I wanted to find the cause of this memory leak I had to do it myself. Hotspot does offer some native memory tracking facilities. When enabled, it keeps track of native memory allocations, divided in several categories. Also, you can create a baseline for later comparison. Here is the profile after the service has been running for 5 days, against a 10 hours old baseline:

It allocated almost 600MB of memory in last 10 hours only! And that is after 5 days of high load, which one would expect to be enough time for the system to stabilize. Unfortunately, it wasn’t and the “Internal” category just kept on growing until it crashed. For reference, here are all native memory categories (the ones with zero delta against baseline were omitted in previous output):

Unfortunately, mtInternal doesn't narrow down to anything. It's basically “uncategorized”. At least I could be sure now that memory wasn't taken by Java heap, or thread stack spaces, or Classloader, or JIT, or GC… or could I? What else is there?

This guy successfully used jemalloc to detect his cause of a Java native memory leak and I gave it a try. (Lession leared: don’t just clone master branch from github and assume it’s production ready. It is not.) The jemalloc library has a tool to track sources of allocations of live (not free()-ed) memory blocks at some given moment. It periodically dumps profiles between which you can create delta profiles. It literally points on a picture where recent memory consumption comes from. It does so by sampling calls to malloc(3) et. al. It also demangles C++ symbols. What a nice time to live in!

This is what I got for last 10 hours on a service with 5 days uptime:

jemalloc profiling output

In that period 755MB worth memory blocks were allocated (via malloc(3)) and not released. About 711MB came from G1RemSet and HeapRegion classes. The values are sampled but represent the correct ratio. The leak is coming from G1 GC code! It just keeps allocating more memory. This is actually G1 keeping track of heap areas (cards) between heap regions which are being changed between generations. Sorry, but there is no more simple way to put it. Here is a glossary. What I found strange is that this allocation is classified as mtInternal by the VM, and not as mtGC. There is a ticket filled for this now.

Meanwhile in Internal G1 department

The most effective way to reduce the resource waste of change tracking between heap regions is to reduce the count of heap regions. You do that by increasing heap region size. By default, G1 tries to divide the entire heap into 2048 regions of equal size, but so that region size is between 1M and 32M, and a power of 2. For 25G heap it calculated 3150 regions of 8M. By setting -XX:G1HeapRegionSize=16M region count halved, reducing housekeeping overhead, thus stabilizing native memory usage. It also has other, potentially negative side-effects like increase of heap fragmentation and increase of homogeneous object threshold, but you should know this already. But the good thing is, native memory starvation problem was gone. Category mtInternal didn't go above 2G. I still had the 25:7 heap/native memory ratio which bugged me.

Turns out the trick is in OtherRegionsTable::add_reference method. When storing remembered sets it tries to use a sparse representation (very fine grained, memory efficient, linear complexity), then when it overflows it falls back to a more coarse grained representation using bitmaps (which use far more memory). This branching is also visible in allocation graph as BitMap::resize and HeapObj::new flows. We want to use the HeapObj::new more, and the parameter is -XX:G1RSetSparseRegionEntries. Here is a possibly better explanation, but it's hard to explain code. [Edit: and another one]

In my case, default value for G1RSetSparseRegionEntries was 20. I increased it to 64 and mtInternal category halved, with no measurable performance impact. Finally.

Real man use default settings

… as this guy put it.

This was a long journey. I still don’t know what is it in my service that makes JVM behave that way. Or if this could be qualified as a leak at all. Maybe it just needed more off-heap memory and it would stabilize eventually. There must be some easier, less time-consuming, business-friendly, pragmatic thinker’s solution. And I agree. Let’s imagine a patient with explaining the symptoms to a pragmatic doctor:

Patient: “After my service has been running two weeks…”
Doctor: “Wait, you run the same version for two weeks? You’re not deploying fast enough. Fix your agile process.”
Patient: “I set a heap size of 25G and …”
Doctor: “Wait, that’s way too much! Here, have a prescription for micro-services.”

Or as one my colleague says: If you have to set anything more than -Xmx and -Xms you must be doing something wrong.

Update June 3rd 2019

The default value of G1RSetSparseRegionEntries is going to change in JDK13, to scale exponentially with heap size: JDK-8223162.

--

--