Suddenly we ran out of memory….

Andreas Sandberg
PriceRunner Tech & Data
2 min readSep 7, 2020

In every software developers career there are moments where you initially can’t understand why the application suddenly crashes in production although nothing major has changed.

This week we had just this kind of moment at PriceRunner and we were surprised when we finally realized that we actually had stumbled across a bug in the JDK (version 11.0.8).

It all started with a really small change in one of our services where we wanted to record certain events in the service and send these events to our event tracking service.

eventClient.postTrackingEvent(trackingEvent)
.whenComplete((response, exception) -> logEventServiceError(response, exception, trackingEvent));

The underlying eventClient made use of the HttpClient released in JDK 11 to send a simple http post request to our event tracking service.

This change was released into production and everything looked as usual, however 24 hours later the service crashed due to an OutOfMemoryError. Looking at the heap utilization we could instantly infer that the service had a memory leak.

Memory heap utilization before crash

But the question was how this insignificant change could cause such a behavior?

By connecting to the service using VisualVM we could identify a large set of jdk.internal.net.http.common.MinimalFuture objects in the heap which in turn pointed us in the direction of this registered bug.

Suddenly all the pieces fall in place, the event tracking service is one of few services at PriceRunner responding with a (proper) 204 http status code indicating no response body should be expected by the client.

The HttpClient in the JDK version (11.0.8) we were using does not return the HttpConnection to the internal pool after a 204 response which in turn causes a memory leak.

We decided to upgrade the service to JDK 14 and the heap utilization graph does no longer indicate a linear growth (the red area at the right end of the graph below).

Memory heap utilization before and after upgrade

This story also confirms the fact that small recurrent releases make it a lot easier to narrow down the root cause for issues like this. Imagine if this small change had been included in a much larger release with a large set of changes, would you have suspected this tiny tracking event change? We would most probably have not.

--

--