A few months ago, our production operations team created an enigmatic ticket for our team: “spikes in 504s”. The story ahead depicts our long winding road with this issue and how we religiously followed the seven steps of engineering. Shock, Denial, and Whack-a-mole Repair took the better part of this timespan, as different people were on-call, suspected random pull requests got rolled back, and different unrelated framework pieces were upgraded. The simple reason is that debugging distributed systems is just difficult; it might seem simple at first, but between our clients and our servers and between our servers and downstream servers lie load balancers, NGINX reverse proxies, Varnish caches, and a few more devices.
What is a 504? Sure, it’s a Gateway Timeout. But where? A 504 is a response to an HTTP request to a proxy that tried forwarding it to a downstream service but the latter failed to respond in a timely fashion — some pre-configured timeout elapsed. Each 504 represents a client waiting for a page and waiting and waiting. Instead of a 100[ms] response, it would take, say, 5.1[s] when a 5[s] retry mechanism is in place; if there isn’t a built-in retry, a failed page is likely shown, requiring user intervention to retry; either case leads to a decrease in customer satisfaction, lower conversion rate, and tarnished brand. Luckily for Groupon, after enough back and forth, some higher-up imposed step 4: Escalation.
This document mostly details our next step, step 5: Investigation; we rolled up our sleeves and started digging in. What made this issue sneaky is that it only happens once in a while for a short duration. Because sometimes the 504s show up only upstream — in our clients — we have to switch from looking for spikes of 504s to looking at downward drops of 200s. A 200 responses graph shows the problem:
Not only the downwards spikes are short-lived, but they happen at different times on different servers, with different severity. Let’s look more closely at a single down spike:
What we see is that for 2 to 4 minutes a server stops replying positively. What might be causing this? It will be fruitless to try to reproduce it locally, setting breakpoints, inspecting logging, or anything like that.
Instead, we dug some more into available graphs and discovered that
jvm.non_heap_memory.g1_old_gen.used.max was piling up till these crucial points in time:
It’s not easy to see still in this week-long resolution, but the small drops in 200s (blue) drop close to 0. Zooming in on one, shows the correlation perfectly:
The only phenomenon that reduces the total Java heap size is a major GC (Garbage Collector) event. A major GC event has many phases, some of which are “stop-the-world”, meaning that your code hangs while the JVM cleans up. A major GC event happens when the tenured generation becomes full.
The graphs show a clear picture though: something was hogging memory, but that memory was reclaimable. Contrary to what several engineers declared, this is NOT a memory leak. A memory leak is a memory used by objects that are not reclaimable. This will result in memory usage increasing, in spite of major GC events happening, until the server draws its last breath and an OutOfMemoryError (OOM) is thrown.
While a mere mortal might give up and exclaim “that’s Java”, we say “bring it on”. Our server deals with incoming HTTP/S requests and makes outgoing HTTP/S requests; these generate a lot of temporary garbage as new objects get created but stopped being referenced soon after. But this is just temporary garbage. Surely the mighty JVM is smarter than just accruing all this garbage, then faltering once a day in a major GC event, and that premise is correct — we use G1GC for garbage collector with Java8. G1GC differentiates between new garbage and old garbage and reclaims new garbage fast and with smaller hiccups while it cleans up — in events called minor GC. The idea is simple, tracing what is not garbage and what Is garbage — or more technically what is reachable from a root and what is not — is time-consuming and thus doing it on a small part of the memory is much faster than on the whole; also, since most Java applications, especially servers, create a lot of temporal garbage, it is worthwhile to only consider the latest memory allocations as possible garbage to be reclaimed. For this, G1GC designates a section called Eden Space where all new heap memory allocations happen. When Eden reaches 45% of the entire JVM heap, the GC cleans Eden entirely by “promoting” non-garbage (things that are still referenced) out to a separate section called Survivor Space. If some object got promoted from Eden to Survivor and survived for the number of epochs defined by flag
-XX:MaxTenuringThresholwith the default value of 15, it will be promoted to Tenured Space (aka Old Gen), where it will no longer be considered for minor GC events. But our problem was, again, with a major GC event. So, with all this nifty G1GC explanation, how does temporal memory end up in Old Gen, eventually causing a major GC event? Now we knew what we were looking for.
Since we were only interested in seeing the items in the tenured space, we took a heap-dump before Major GC and one after major GC. The difference showed all the objects which are either in Survivor space or Tenured space of heap but are not being strongly referenced and can be reclaimed.
With the help of VisualVM, we started analyzing the items ending up in the tenured space of the heap. We started randomly sampling our objects for the root object i.e. an object in the current java frame referencing them. In most cases it was
SSLSessionImpl. So now the question was why did
SSLSessionImpl objects were ending up in Tenured space of heap.
So we started reading about
SSLSessionImpl class. Here’s what we found.
SSLSessionImplrepresents an instance of an SSL session and gets created for every new session
- And at the end of SSL connection, the server puts
SSLSessionImplobject in a cache.
- This is done so that if a client uses the same SSL session-id, an SSL session can be resumed.
- This cache has a size defined by
javax.net.ssl.sessionCacheSizeenvironment variable and has an expiry period defined as 24 hours by default
This made sense, so basically, for every SSL connection, we were creating an object which would not get dereferenced for 24 hours, so such objects would become tenured after 15 rounds of Young GC and would keep on accumulating and when the JVM reaches the threshold of total usage, Garbage Collector would clean up everything in tenured space which is dereferenced. But this would take long since the old gen is big and it takes a long time to clean it.
The solution for this problem was rather simple as we found that our service does not even use the feature of resuming the session, so we were holding on to objects which would never get reused. We decided to disable the SSL session cache.
After the change was rolled out on production hosts we have seen the drop in 504s for all the hosts for our service.
Although it was a simple fix, in the end, we learned quite a few things about Java and Garbage Collection along the way.