Tracking down a Golang memory leak with grmon
Discovering there’s a memory leak in your application is the easy part. Diagnosing where the problem is to fix it can be tricky. Using the open-source
grmon command-line tool, I was recently able to quickly track down leaking goroutines in a web application I deployed a few days ago.
A “leaking goroutine” is a light-weight thread that runs in the background of our program, unintentionally forever.
💧Oops, I’m Leaking Goroutines
In short-lived applications like command-line tools, leaking goroutines tend to have a less noticeable effect on performance and memory.
However, if not implemented properly in long-lived applications like a web server, these improperly managed goroutines can degrade performance or reliability and raise the costs associated with your application over time.
While memory is relatively cheap, it’s certainly not free.
The memory leak in my application boiled down to the fact I had a bunch of goroutines that were getting stuck sending data down a channel, even after the HTTP request that spawned those goroutines had finished.
To help illustrate this problem, here is an example “command-line” version of the application:
This application is centered around the
newProducer function. For the most part, it seems harmless — it even looks like it cleans up after itself with
Problems won’t start to bubble up until you decide you want to turn this command-line program into an API endpoint for a web application to use.
exampleEndpoint function, to prevent abuse let’s pretend, we’ve added a limit in the form of a
counter which we will increment upon each result from the
newProducer. After three results have been written to the client connection, we stop the function with a
So, now we feel confident, and ran some local tests — we’re ready to deploy this baby to the cloud!
Depending on the amount of traffic your application receives, you might start noticing a steady increase in memory usage that just doesn’t stop almost right away.
👩🏽💻Back on our localhost, we want to start debugging this issue.
grmon— a command-line goroutine monitor — in our application to help with debugging is pretty straightforward:
The only thing about the application that has changed is the
grom.Start() called before
http.ListenAndServe() and the
import to bring in our
Now, when we go to run our server, we can monitor our goroutines by running
grmon in another terminal window. For good measure, and because I want to start making requests at the server, I’ll open a third terminal window to
curl the endpoint.
We can use
tmux to easily see everything running at once:
By default, we see five default goroutines that are running when we fire up our server. This is fine. If we didn’t touch the server from here, we shouldn’t see too much activity going on at all. It’s not until we start interacting the server that the problems will occur.
curl the server’s HTTP endpoint to see what happens with those goroutines.
Do you see that
chan send that is associated with the
main.newProducer function in the lower half of the screenshot?
If we didn’t touch the server from here, that won’t ever go away or change state.
Let’s go ahead and
curl the endpoint a second time:
Again, one of those
main.newProducer functions is stuck in the
chan send state. For the record, if we were to use a tool like
pprof, we would see similar indicators — but not as interactively.
🤔How do we fix this?
Setting a Channel Send Timeout
We can fix this problem by implementing a channel send timeout. This is essentially a
select statement with 2 different cases: first, where the send succeeds, and second, where a timeout occurs.
We can adjust our
newProducer function to accept a first argument as a
time.Duration and use that with a
select upon each iteration of the given products:
Now, even if a consumer of the client where to cut the the producer short for any reason, we have the
<-time.After(readTimeout) value to protect us from leaking an extended period of time; and we can tune this value or modify it based on the request if we wanted to.
Note: this is just one way to fix problem. Another — perhaps more idiomatic way — to solve this problem would be to use the
contextpackage as gohpers like Rakyll have shown before.
For this application, we’ll set a default timeout of 1 second stored in the
Monitoring the application with
grmon, we can confirm we no longer have the stuck
chan send state. Instead, we will see a
select state which will go away after about a second.
You can increase the timeout value to a large value to be able to easily identify it when monitoring with
grmon like so (looking for the
select stat for
Leaking goroutines aren’t always obvious to diagnose. Using
grmon gave me the exact tooling I needed to find and fix my issue, and I hope it can be of some use to you in the future too.
⛵️Now my web application is cruising along nicely.
Until next time, that’s all folks!