Practical Leak Hunting in Jenkins

Shlomi Benita
CyberArk Engineering
6 min readMay 15, 2024

--

For Jenkins administrators dealing with high-scale operations, the phrase “Jenkins is slow” is a recurring theme. But what’s causing this slowdown?

Sometimes, the answer is simple, like too many jobs or resource starvation (e.g., CPU, memory, disk).

But in some cases, pinpointing the exact cause behind the slowness is a complicated task with multiple factors at play. One of the most challenging is a memory leak.

In this blog post, I’ll share our journey to hunt down a memory leak that significantly impacted Jenkins’ performance and stability, offering insights and strategies for addressing similar issues in the future.

Since Jenkins is a Java Application, it is crucial to understand how the Java Virtual Machine (JVM) behaves to get maximum performance.

A Crash Course in Java Memory

The JVM uses two main memory areas:

  • Heap: This is where objects are stored. The JVM allocates memory in the heap as needed for Jenkins objects.
  • Stack: A temporary storage area is used for method calls, local variables, and parameters. When a method exits, its stack frame is destroyed, and the memory is freed.

Within the Heap memory of the JVM, there’s a further division for memory allocation focused on different object lifespans:

Young Generation: The place for new objects, further subdivided as:

  • Eden Space: This is where all newly created objects are initially allocated.
  • Survivor Space: This is where objects that survive a garbage collection cycle in Eden are moved.

Old Generation: This is for long-lived objects that have survived multiple garbage collection cycles in the Young Generation. It has a larger capacity than Eden.

What Does a Memory Leak Look Like?

First, you need to monitor the JVM using your favorite tool. The important metrics that we are looking for are the heap areas : Eden, Survivor and Old.

Let’s take a look at a graph (a real one):

Heap Areas Graph Over Work Week

We can see that throughout the work week Eden is going up and down as expected, but the Old keeps increasing.

This is the best indication of a memory leak. Why?

The Old is the area of long-lived objects, and the graph shows that more and more long-lived objects are being stored in memory.
A healthy Old graph should have a straight trendline without increasing.

The Impact of a Memory Leak

The Garbage Collector’s (GC) regular operation has to pause the Java program when handling areas in the memory to avoid corruption. Each pause looks like a Jenkins freeze, but it usually takes milliseconds and doesn’t interrupt users.

Now imagine the pursuit of a GC that is running out of space in the Old area. It has to kick in more and more. It leads to more pauses (i.e., freezes) in the program.
Jenkins becomes slower and slower until we won’t have any room for new objects, and Jenkins freezes.

Tip: The GC in Jenkins is one of the most important to tune, as it has a significant impact on stability and responsiveness.

Now that we have a good understanding of JVM and memory management, let’s hunt for memory leaks!

Take a Heap Dump

To pinpoint a memory leak in Jenkins, we’ll need to capture a snapshot of the heap at a specific point in time (heap dump). This snapshot provides a detailed look at all the objects currently residing in the different areas.

Since we need to perform analysis on a different machine, we can upload the heap to S3 storage (it can be any shared storage).

#!/bin/bash

# Get current timestamp
timestamp=$(date +"%Y%m%d%H%M%S")

# Define filename for heap dump
heap_dump_file="heap_dump_${timestamp}.hprof"

# PID of the Java process you want to generate heap dump for
java_process_pid=$(pgrep -f "java.*")

# Check if Java process exists
if [ -z "$java_process_pid" ]; then
echo "Java process not found."
exit 1
fi

jmap -dump:live,format=b,file=$heap_dump_file $java_process_pid
echo "Heap dump generated: $heap_dump_file"

echo "Compressing heap dump file…"
gzip $heap_dump_file

# Upload the compressed heap dump file to S3 bucket
echo "Uploading compressed heap dump to S3…"
aws s3 cp $heap_dump_file.gz s3://<your bucket name>/$heap_dump_file.gz

echo "Upload complete."

# Remove it from local storage
rm $heap_dump_file.gz

Heap Dump Analysis

For the analysis, we use a great tool called Eclipse Memory Analyzer (MAT) by Eclipse.

The Eclipse Memory Analyzer is a fast and feature-rich Java heap analyzer that helps you find memory leaks and reduce memory consumption.

(There are more alternatives like VisualVM, JProfiler, etc.).

Here are a few of my personal tips for MAT usage:

  • Tip 1: I recommend using a strong machine to run MAT if your heap dump is large. For my 30GB heap dump, I used 16 cores with 64GB RAM.
  • Tip 2: MAT, by default, is limited to 1GB of memory usage. I recommend changing it to the machine capacity (that is free). The configuration is in “MemoryAnalyzer.ini” a flag named “-Xmx”.
  • Tip 3: Load the heap dump and grab some coffee. It takes time to load and analyze.

After loading the heap dump, I got this excellent report:

This is a fantastic report to start working with, but luckily, MAT has an exciting feature of “Leak Suspects” report. Let’s take a look into it:

MAT creates suspects and provides more specific data relevant to the potential leak.

In this use case, let’s look into suspect №1:

What an amazing insight! it points to a specific class instance that holds 11.2GB(!) in memory.

Digging a bit more (by clicking on details) reveals an insightful stack trace:

At this point, I had to go to the code to try to understand what it is.

A quick search led me to the LoggingHandler class which holds the leaking singleton (LOG_COUNTER).

I jumped back to the Jenkins script console to explore more about it:

import net.bull.javamelody.*

println(LoggingHandler.getLogCounter())
Result of printing Counter script

Here you are:1,631,058 requests are in the map that utilize the memory.

From the code, I saw that there is a public method of Counter instance for clearing the map.

So, I ran this Jenkins Script:

import net.bull.javamelody.*

println(LoggingHandler.getLogCounter())
LoggingHandler.getLogCounter().clear()
println(LoggingHandler.getLogCounter())

System.gc()

Looking back on the monitoring graph of the heap:

Heap Areas Graph After Clear

It released the memory, and Jenkins performance immediately improved. Hooray!

After identifying the source of the leak, we can start a new analysis of why it is leaking and seal it. In this specific use case, we found a repetitive warning message that fills the map and occupies the memory. Fixing the root cause of the message will solve the issue.

No More Memory Mysteries

We saw how MAT helped us find a memory leak, identify the plugin causing the issues, and provide an excellent start for fixing the issue and helping improve Jenkins’ performance.

Have you encountered memory leaks in your Jenkins? Now you know how to solve it :).

--

--

Shlomi Benita
CyberArk Engineering

DevOps system architect at CyberArk. Love to secure stuff, architecting, code, and solve problems.