How to Analyze Memory Leaks in
Containerized Java Applications ?

Mehmet Berker Ot
Trendyol Tech
Published in
8 min readNov 29, 2023

--

In this article, I will explain how to analyze a memory leak in a containerized Java application that you suspect or know exists, and I will share the solution we implemented after identifying the cause of the leak, based on a recent memory leak issue we encountered.

First, let’s briefly talk about our problematic application. Heimdallr can be described as an application that sends us messages via Slack when there is lag in our Kafka topics used by our team, or when there is data in the error topic, which we refer to as an error. We achieve this by creating a consumer for the relevant topic at runtime and then calculating the offset on the consumer group’s partitions based on the max offset in the topic.

Usually, we set up our alerts at 01:00 PM, and one day we noticed that no alerts or informational messages were coming at 01:00 PM. We realized the problem this way. When we investigated why the alerts didn’t come, the first thing that came to mind was that there might have been a problem connecting to Kafka or Slack. However, we found out that the problem was much different. The application crashes due to the OutOfMemoryError error, and then K8 creates a new instance of the application. The first action taken was to divide the 85 tasks that were piled up at 01:00 into 01:15, 01:30 and 01:45. The second action was to slightly increase the memory resource of the application. Thanks to the actions we took, we have created the necessary time to investigate the root cause of the application’s increased memory usage.

When we investigated the memory usage of the application on Grafana to find the root cause, we discovered that the problem was the application’s continuous increase in memory usage. As a result, it crashes because it exceeds the memory limit we defined on K8.

What is Memory Leak ?

A memory leak is a situation where unused objects occupy unnecessary space in memory. Unused objects are typically removed by the Java Garbage Collector (GC) but in cases where objects are still being referenced, they are not eligible to be removed.

At this stage, we examined the source code to find the specific piece of code that could be causing this situation. However, we were unable to reach a definitive conclusion.

What is Heapdump?

The term Heapdump describes the JVM mechanism that generates a dump of all the live objects that are on the Java heap, which are being used by the running Java application.

Since there does not appear to be a situation that would cause a memory leak at the source code level, we aim to take the heap dump of the application and detect the situation that caused the memory leak.

How to take Heapdump ?

To capture a heap dump from a JVM, we can use the jmap tool that comes with Java JDK. Jmap is a tool used to print statistics about memory in a running JVM. It can be utilized for both local and remote processes.

jmap -dump:live,format=b,file=/tmp/dump.hprof <pid>

This method is quite suitable for the local environment. However, in the production environment, to execute this command, we would need to access the pod and execute the command internally.

Due to security measures, we do not have permission to access production pods. At this point, we used Spring Boot Actuator Production-ready Features.

<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
</dependencies>

After adding the dependency, send a request /actuator/heapdump endpoint.

curl --location --request GET 'http://localhost:8080/actuator/heapdump' 
--output myheapdump

After sending the request in the local environment, it was seen that it returned 200 and the heap dump file was downloaded.

When we deployed the application and sent a request to the relevant /actuator/heapdump, we received a 503 Service Unavailable response. However, there did not appear to be a problem in other endpoints of the application.

At this point, we reached out to our SRE teams for assistance, expressing the need to access the pod in the production environment and capture a heap dump. We requested the execution of the

jmap -dump:live,format=b,file=/tmp/dump.hprof <pid>

command within the respective pod.

As a result of the request, it was stated that the command received an error.

To simulate the pod in the production environment, we pulled the application’s latest version image from the registry and deployed it as a container locally. After entering the container, we executed the command and encountered the same error.

In summary, we capture the dump when running locally, but we cannot do so when running within a container.

What could be the difference between the two environments?

In the production environment, we build the application using JDK and run it within an image derived from a JRE environment. The reason for using JRE here is that in the production environment, we only need to run the application. Using JRE allows us to minimize our container sizes, as it fulfills the requirements while keeping the footprint smaller

https://blog.devops.dev/how-to-reduce-jvm-docker-image-size-by-at-least-60-459ec87b95d8

As seen in the example, we can say that it has been reduced approximately to one-third of its original size. Considering the number of pods at Trendyol, there is a significant saving in disk size when using this approach.

JDK vs JRE

JDK(Java Development Kit) is used to develop Java applications. JDK also contains numerous development tools like compilers, debuggers, etc.

JRE(Java Runtime Environment) is the implementation of JVM(Java Virtual Machine) and it is specially designed to execute Java programs.

https://www.geeksforgeeks.org/difference-between-jdk-and-jre-in-java/

As evident from their definitions, JRE does not include development tools such as debugger, compiler, and jmap. This was causing errors when trying to capture a heap dump from within the container.

Once we realized this issue, we modified our Dockerfile, built a new image locally, and successfully obtained the heap dump file.

Analysis of Heap Dump File

When we download the heap dump, it comes as a binary file. To open this file, we use VisualVM, which allows us to inspect instances in the JVM’s heap.

When we open the heap dump file using VisualVM, we can examine the objects present in the memory at that moment.

Since we captured the heap dump right after deploying the application, we didn’t observe any abnormalities. From this point onwards, our task will be to periodically capture heap dumps and identify objects in memory that are continuously increasing in number.

Dump 1

Dump 2

Dump 3

Comparing the heap dumps we captured at regular intervals, our issue becomes apparent. Instances of the KafkaMetric class are increasing significantly. This has provided us with a valuable clue to identify the problem in our code.

We continuously open and close a KafkaConsumer to measure lag alerts. Upon reviewing the code again, we noticed that a method was using the consumer without closing it afterward. However, this was not the cause of the memory leak because, in the heap dump, we saw an accumulation of KafkaConsumerMetric class instances, not KafkaConsumer. For enhanced reliability, we reviewed the code and closed the consumer after its usage.

After conducting a brief investigation, we found a known issue in Kafka 2.4.0. We were continuously opening and closing a KafkaConsumer to measure lag alerts. However, in this version of Kafka, closing the consumer did not close the KafkaConsumerMetric instances. At this point, we started tracking the resolution of our issue by upgrading the Kafka version to see if the problem would be resolved.

Issue :

Dump After Fix

Conclusion

If you are experiencing a memory leak issue in a containerized JVM, here are important and effective steps to identify the problem:

  1. Use JDK:
  • Make sure you are using JDK because you cannot capture a heap dump from a JVM running on JRE. (Note: https://github.com/jattach/jattach is a tool that allows heap dump collection from a JVM running on JRE.)

2. Periodically Capture Heap Dumps:

  • Capture heap dumps at regular intervals before the application exceeds memory limits and restarts.

3. Heap Dump Analysis:

  • Examine the instance counts and continuously increasing collection elements between heap dumps.

4. Investigate Problematic Codes:

  • Examine the code where instances of increasing classes are being used. This can help you understand where and how they are being utilized.

5. Library Version Checking:

  • If the issue remains unresolved, investigate whether the version of the library containing the problematic class has a previously reported memory leak issue. Check if there is an open memory leak issue associated with that specific library version.

These steps can assist you in identifying, isolating, and resolving memory leak problems.

Thanks for reading.

We’re building a team of the brightest minds in our industry. Interested in joining us? Visit the pages below to learn more about our open positions.

References

https://www.geeksforgeeks.org/difference-between-jdk-and-jre-in-java/

https://issues.apache.org/jira/browse/KAFKA-9504?jql=project %3D KAFKA AND text ~ “Consumer Leak”

https://issues.apache.org/jira/browse/KAFKA-9306

https://www.ibm.com/blog/jvm-vs-jre-vs-jdk/

https://www.baeldung.com/spring-boot-actuators

https://blog.devops.dev/how-to-reduce-jvm-docker-image-size-by-at-least-60-459ec87b95d8

https://spring.io/guides/gs/actuator-service/

https://medium.com/trendyol-tech/synchronizing-data-sources-a-journey-of-decomposition-with-couchbase-transactions-78f907c9adfb#:~:text=We’re building a team of the brightest minds in our industry. Interested in joining us%3F Visit the pages below to learn more about our open positions.

--

--