Investigating Goroutine Leak

Wungmathing Shaiza
5 min readJul 22, 2022

--

A few months back, our monitoring channel in Slack got inundated with warning messages of a micro-service restarting at an unusually high frequency. Taking a closer look, it appeared like the micro-service in question was restarting due to out-of-memory issue. The memory usage chart looked something like this. Each of the drop to the bottom is a restart 😬

memory usage over time

Diagnosis 🕵

This was interesting because we saw no major spike in our traffic and we hadn’t really made any major change to this particular micro-service in a while. But clearly something was wrong. So we began our investigation 🔍 We soon realized that the issue cropped up only after a particular PR was merged. This was golden because it helped narrow our scope of investigation and we could now focus on the changes made in that PR. So we started looking at what changes were introduced in the PR of interest. Lucky for us, it was a small PR and it in was a change to make a call to one of our helper function that looked something like this

Looking at it, nothing really stood out. Surely, the problem couldn’t be with RunAsync we thought. This helper function was not new code and we’ve had it in production for a while now.

Continuing our debugging journey, we added pprof endpoints to our micro-service to get more insight into what was going on. If you are unfamiliar with what pprof is, I’d suggest reading this article . It should give you enough information and links to other useful resources to get started. In short, pprof samples your application and provides you with different profiling data that gives you more insight into the state of your application.

After we added pprof to the microservice and deployed it, we downloaded the heap profile from our micro-service

curl <baseUrl>/debug/pprof/heap --output heap1.out

Opening the file,

// opens in the browser
go tool pprof -http=":9090" heap1.out

our attention was drawn to a section of the output that looked something like this

The thing that stood out was the big red rectangle on the left. We initially did not know what this was but after some Googling around and looking at the source code, we learned that runtime.malgis a call to allocate stack to a newly created goroutine. And the goroutine holds on to this until it completes execution.

https://github.com/golang/go/blob/master/src/runtime/proc.go#L4066-L4081

Seeing that the heap profile was showing a relatively high usage of memory by this function, it could mean 2 things:

  1. a memory leak in the function itself
  2. high frequency call to the function

We quickly ruled out 1 as, while not impossible, the chance of there being a bug in runtime.malg seemed super low. On the other hand,2 made sense to us because we were calling RunAsync (which starts a goroutine) on one of our busier endpoints and it was also part of the change introduced in our problematic PR.

Feeling a bit more confident 🤠, we took a closer look at the implementation of RunAsync again but came up with the same conclusion: nothing suspicious 😑. Something had to be wrong though 🤔 Our metric report in DataDog wasn’t lying 😕 We then took a look at how RunAsyncwas called and then it dawned on us 💡 Turned out, the caller of RunAsync was ignoring the return value of RunAsync 🤦‍♂

func handler(w http.ResponseWriter, r *http.Request) {    // set up for background processing
RunAsync() // 👈 problem point
// complete the request without waiting for the task to finish
w.WriteHeader(http.StatusOK)
}

The way channel works is that when data is sent into an unbuffered channel, it is blocked until the data is consumed off the channel by another goroutine. In our case, because nothing was consumed off the channel returned by RunAsync, the code was blocked at line 5 in RunAsync. So while doSomething() did execute, it was stuck in the next instruction which is to send the return value of doSomething() into the channel ch . This resulted in a goroutine leak situation where none of the goroutine completed execution as each was blocked waiting for another goroutine to consume off the channel ch . And this meant the resources allocated to each goroutine never got released. So over time, it exhausted all available memory in the micro-service and eventually crashed it.

Solution 👷

We evaluated two options to fix it:

Option 1:

func handler(w http.ResponseWriter, r *http.Request) {  // set up for background processing
go func() {
ch := RunAsync()
if err := <- ch; err != nil {
log.Print(err)
}
}()
// complete the request without waiting for the task to finish
w.WriteHeader(http.StatusOK)
}

Option 2:

func handler(w http.ResponseWriter, r *http.Request) {  // set up for background processing
go func() {

if err := doSomething(); err != nil {
log.Print(err)
}
}() // complete the request without waiting for the task to finish
w.WriteHeader(http.StatusOK)
}

While both solves the goroutine leak problem, option 1 starts 2 goroutine ( because RunAsync itself starts 1 goroutine ) while option 2 starts just 1 goroutine. Seeing that there’s no additional benefit to using RunAsync for the thing we were trying to accomplish, we opted for option 2. And lo and behold, after we deployed the fix (marked by the red arrow below), the memory usage graph normalized and our service stopped re-starting 🎊

memory usage over time

This was a lot of learning for us. And I am super satisfied with how we were able to systematically zero in on the issue 🙌 . Hopefully, someone somewhere finds this learning insightful and maybe saves them some hair-pulling moments 😉 Until next time, cheers and never stop learning ! Also, stay safe 😷

--

--

Wungmathing Shaiza

Self-taught: From PHP, to Ruby (Rails) and Javascript (Angular 1.x), to now Go and Javascript (React/ReactNative). Learning is fun & fulfilling :)