JVM garbage collector murder mystery

The garbage collector is like the plumbing in your house. You know it’s there, you rely upon it every day, you (sort of) understand how it works, but when water is pouring out of a light fitting, you really don’t look forward to punching a hole in the ceiling to try and work out what went wrong

This post is about a mysterious bug I encountered while developing a compute heavy asynchronous task service. The tasks processed by this service generate toolpaths for our large scale 6-axis 3D printers — they are demanding both in terms of memory and CPU. The basic architecture is simple — tasks are created, passed to a FIFO queue and a worker pool processes them. To keep things as simple as possible, all IO (e.g. database queries and fetching geometric data from S3) is completed during the creation of the task itself, before it is en-queued. As such, the actual processing of the task is pure, depending only on its inputs. This makes it much easier for us to pass tasks off to physically separate machines if demand requires it. Furthermore, the output is immutable. Once a toolpath is generated it is never changed. This is just some background however, so there is some context around the issue I want to discuss.

We first noticed the problem when the server was attempting to generate toolpaths for particularly large (in terms of data) geometries. The strange behaviour we observed was that during a particular stage of the toolpath generation, the server would become totally unresponsive. Login attempts would fail, CRUD operations would hang, and in general the server would not respond to any requests. The server would remain stuck like this for up to several minutes at a time.

My first guess was that since we were running on a t2.small instance (having one core), the heavy computation running on the toolpath task thread was just starving the request thread pool of CPU time. I’d just bump it up to a t2.medium (2 cores), restrict the task worker count to one, and be done with it — or at least have proven my theory. Thinking about it more though, I realised this didn’t actually make any sense. The JVM’s thread scheduler would surely find at least some breathing space for other threads, especially considering the sort of undemanding requests we were trying to make. A theory like this might explain higher latencies and lower throughput — but complete unresponsiveness for minutes? Not likely.

My next guess was ‘hmm, must be a GC problem’. I knew the memory demands were large, and in certain steps of the algorithm there was a lot of memory churn, so the GC was triggering quite regularly. Perhaps it was just triggering a ‘stop the world’ GC event which was taking a particularly long time to complete.

Armed with a renewed sense of overconfidence, I told the CEO I’d have the problem solved in 20 minutes, rolled up my sleeves, and got to work. My guess that it was a GC related problem was indeed correct, but as we will see, the mechanism of the failure took a lot longer to diagnose than expected and ended up teaching me a lot about various dark corners of the JVM, some of which I hadn’t previously known existed.

The first thing I did was reproduce the bug on my 8 core, 16gb RAM development machine. This was straightforward and exhibited identical behaviour to the server. There goes the theory about thread starvation. I now wanted to rule out the usual suspects in terms of GC. Despite not having received any OutOfMemoryErrors, I figured that maybe the required heap was very close to the max heap size and the GC was constantly running to reclaim a tiny amount of available memory. I cranked up the heap size to 10gb with -Xmx10000m

Unfortunately this did not work. The behaviour was identical.

I now decided to try changing the GC algorithm. Even if this didn’t truly offer a solution (I would feel very uncomfortable choosing a GC algorithm on this basis alone) I hoped that if one of them remedied the problem then it would give me a lot of clues about the root cause of the issue.

The JVM offers the following GC algorithms

  • Serial GC
  • Parallel GC
  • Parallel Old GC (Parallel Compacting GC)
  • Concurrent Mark & Sweep GC (or “CMS”)
  • Garbage First (G1) GC

It took only a few minutes to try all of these. None worked. The bug stubbornly persisted.

I figured at this stage I’d have to get my hands dirty and start monitoring the GC directly. I booted up VisualVM, triggered the integration test I had been using to reproduce the bug, and patiently waited until it reached the point at which it had been stalling.

Red arrow shows the 73s interval where the JVM becomes totally unresponsive and the data ‘gaps’

I had been hoping to see the GC activity go crazy during this period but instead all I got was silence. VisualVM stopped receiving metrics from the JVM and did not resume until 73s later, after the unresponsiveness ceased. How could this be happening? I was starting to run out of ideas.

I knew roughly at which stage of the algorithm the JVM was stalling, but I wanted to work out the exact lines of code to see if that might offer any clues. After taking a closer look at the logs I found the problem area

The GC seemed to be triggering upon entering this method but not actually completing until this method was returned from. Other than the fact that this is particularly CPU intensive, especially for large matrices, there was nothing I could immediately see in the code that ought to cause problems. All I knew was that the 73s the JVM was hanging for overlapped with the execution of this method.

I decided to do some broader research and see if anyone else had experienced similar problems. I searched things like ‘jvm unresponsive during garbage collection’ and ‘jvm garbage collection taking seconds’. Most of the articles were about generic GC tuning — improving the performance of a JVM, but not necessarily dealing with such an egregious fault. Eventually, after 30 minutes or so of Googling, I made a breakthrough. I came across an article on something I hadn’t heard of before called safepoints. A safepoint is a mechanism used by the HotSpot VM to allow threads to safely park themselves when a ‘stop the world’ GC event needs to run. Any time a thread encounters a safepoint, it checks a flag to see if it needs to stop. When the GC wants to run a ‘stop the world’ event, it flips the flag and waits for all running threads to park themselves at the next safepoint they encounter. Usually this happens very quickly. Usually.

I started to wonder if maybe this had something to do with the bizarre stalling I was experiencing. I learnt that you can tell the JVM to report on how long it takes to get all threads into a safepoint by adding the XX:+PrintGCApplicationStoppedTime flag.

I ran the test again and waited to see what would be reported. After trawling through the logs I found the proverbial smoking gun.

Total time for which application threads were stopped: 73.0001055 seconds, Stopping threads took: 72.9945621 seconds

Here we can see that the JVM tries to start a full GC event but has to wait an entire 73s just for all the threads to park themselves at safepoints. Just to reiterate — this is not 73s spent actually cleaning up garbage, this is 73s spent waiting for running threads to stop just so that the actual garbage collection can proceed. This is really unusual. A typical event of this nature looks more like this

Total time for which application threads were stopped: 0.0001009 seconds, Stopping threads took: 0.0000395 seconds

There are six orders of magnitude difference between the thread stopping times in these two cases.

I had found the cause of the problem but now I had to fix it. I figured that the fundamental issue here was that the JIT compiler was choosing not to place a safepoint inside this chunk of code. The GC therefore has to wait until any threads executing this chunk of code have left it and reached a safepoint somewhere else. I wondered if there might be a way to manually place a safepoint. Since this is the responsibility of HotSpot, and as such is merely a JVM implementation detail, it would be a leaky abstraction to have a ‘manual safepoint’ feature. However, perhaps there might be conditions under which HotSpot would always try and place a safepoint. It is possible to hint to the JVM that you want to offer up CPU time to competing threads with a call to Thread.yield(). This doesn’t guarantee a thread context switch, but the JVM may decide to take you up on the offer. I figured that the JIT compiler would surely place an opportunistic safepoint here, since the conditions for thread switching must closely mirror those needed to run a STW GC.

I placed a single call to Thread.yield() at the top of the outermost for-loop and ran the test again. Lo and behold, the 73s safepoint delay totally disappeared. I removed it and replaced it a few more times just to really convince myself I hadn’t fluked it. As expected, the section of code where the problem had manifested still took ~73s to execute, but crucially, garbage collection could now start and finish promptly at any point during this period.

Well, that’s the end of my ‘entire day to write one line of code’ story — hope you enjoyed it and I hope this might help somebody else with the same problem someday. We haven’t had any issues since then, so I’m reasonably confident the problem is fixed for good. I’m still a bit uncomfortable about the presence of a Thread.yield() in the middle of a mathematical algorithm, so if anyone has a better solution please get in touch, I’d love to hear it.

Update 12/07/18:

Some users on Reddit provided very insightful feedback on the story.

u/sluukkonen clarified that the C2 compiler only would insert a safepoint here at the end of the method (at the return point) and inside the for-loop, but only if the for-loop was uncounted (uncounted loops being loops that can’t be proven to terminate in a finite time (e.g. for (;;)). Since this is a counted loop he concludes that there would be no safepoints placed inside, thus providing a mechanistic explanation. This issue is apparently fixed in JDK 10

u/mike_hearn provided a suggestion to use the JVM flag -XX:+UseCountedLoopSafepoints as a better solution to this issue. I remember encountering this solution at the time and trying it, but it didn’t resolve the issue. I am not discounting the possibility that I screwed up and in fact it does solve the problem, so I will give it another try when I next get a chance. If u/sluukkonen’s explanation is correct it would seem that this JVM option should solve the issue — assuming it does what it says on the tin