Tracking down a Golang memory leak with grmon

🎉Finally fixing this on Sunday evening was such a relief! 🎉

Discovering there’s a memory leak in your application is the easy part. Diagnosing where the problem is to fix it can be tricky. Using the open-source grmon command-line tool, I was recently able to quickly track down leaking goroutines in a web application I deployed a few days ago.

A “leaking goroutine” is a light-weight thread that runs in the background of our program, unintentionally forever.

💧Oops, I’m Leaking Goroutines

In short-lived applications like command-line tools, leaking goroutines tend to have a less noticeable effect on performance and memory.

However, if not implemented properly in long-lived applications like a web server, these improperly managed goroutines can degrade performance or reliability and raise the costs associated with your application over time.

While memory is relatively cheap, it’s certainly not free.

Example Application

The memory leak in my application boiled down to the fact I had a bunch of goroutines that were getting stuck sending data down a channel, even after the HTTP request that spawned those goroutines had finished.

To help illustrate this problem, here is an example “command-line” version of the application:

This application is centered around the newProducer function. For the most part, it seems harmless — it even looks like it cleans up after itself with defer close(results)!

Problems won’t start to bubble up until you decide you want to turn this command-line program into an API endpoint for a web application to use.

For the exampleEndpoint function, to prevent abuse let’s pretend, we’ve added a limit in the form of a counter which we will increment upon each result from the newProducer. After three results have been written to the client connection, we stop the function with a return.

So, now we feel confident, and ran some local tests — we’re ready to deploy this baby to the cloud!

Depending on the amount of traffic your application receives, you might start noticing a steady increase in memory usage that just doesn’t stop almost right away.

This is the exact graph I shared with the team to let them know there’s a problem.

👩🏽‍💻Back on our localhost, we want to start debugging this issue.

👁‍🗨 grmon

Setting up grmon— a command-line goroutine monitor — in our application to help with debugging is pretty straightforward:

The only thing about the application that has changed is the grom.Start() called before http.ListenAndServe() and the import to bring in our grmon agent.

Now, when we go to run our server, we can monitor our goroutines by running grmon in another terminal window. For good measure, and because I want to start making requests at the server, I’ll open a third terminal window to curl the endpoint.

We can use tmux to easily see everything running at once:

By default, we see five default goroutines that are running when we fire up our server. This is fine. If we didn’t touch the server from here, we shouldn’t see too much activity going on at all. It’s not until we start interacting the server that the problems will occur.

Now, let’s curl the server’s HTTP endpoint to see what happens with those goroutines.

Do you see that chan send that is associated with the main.newProducer function in the lower half of the screenshot?

If we didn’t touch the server from here, that won’t ever go away or change state.

Let’s go ahead and curl the endpoint a second time:

Again, one of those main.newProducer functions is stuck in the chan send state. For the record, if we were to use a tool like pprof, we would see similar indicators — but not as interactively.

🤔How do we fix this?

Setting a Channel Send Timeout

We can fix this problem by implementing a channel send timeout. This is essentially a select statement with 2 different cases: first, where the send succeeds, and second, where a timeout occurs.

We can adjust our newProducer function to accept a first argument as a time.Duration and use that with a select upon each iteration of the given products:

Now, even if a consumer of the client where to cut the the producer short for any reason, we have the <-time.After(readTimeout) value to protect us from leaking an extended period of time; and we can tune this value or modify it based on the request if we wanted to.

Note: this is just one way to fix problem. Another — perhaps more idiomatic way — to solve this problem would be to use the context package as gohpers like Rakyll have shown before.

For this application, we’ll set a default timeout of 1 second stored in the timeout variable.

Monitoring the application with grmon, we can confirm we no longer have the stuck chan send state. Instead, we will see a select state which will go away after about a second.

You can increase the timeout value to a large value to be able to easily identify it when monitoring with grmon like so (looking for the select stat for main.newProducer):

Conclusion

Leaking goroutines aren’t always obvious to diagnose. Using grmon gave me the exact tooling I needed to find and fix my issue, and I hope it can be of some use to you in the future too.

⛵️Now my web application is cruising along nicely.

Look at that beautiful, low memory usage.

Until next time, that’s all folks!