Memory Profiling a Go service

Published in

Compass True North

6 min readAug 25, 2020

The service

Over the past few months, we have been receiving multiple high memory alerts on one of our services referred to as WriteApi. This service is an event driven service which receives events over SQS. Most of the events are fairly low in latency in the order of milliseconds, but a few of them that are slower, run over hours. Each event received by the service is processed on its own go routine. Since one particular event that was implemented, this is when we started getting the alerts, but we didn’t know exactly why. Here comes pprof to help us analyze the problem.

Go tool pprof

Go’s profiling tool is useful for identifying and correcting potential bottlenecks. The tool can be used in a variety of modes, including

Heap — memory allocations of live objects
Threadcreate — stack traces that led to the creation of new OS threads
Goroutine — stack traces of all current goroutines
Allocs — sampling of memory allocations of live objects
Block — stack traces that led to blocking on synchronization primitives
Mutex — stack traces of holders of contended mutexes

The options are mutually exclusive, so you can only profile one option at any given time.

The pprof server

Profiling can be enabled over HTTP which gives access to web based endpoints.

To start the pprof server,

import (
       //...       "net/http"
       _"net/http/pprof"       "github.com/pkg/profile"
)
//...func main() {
       defer profile.Start(profile.MemProfile).Stop()

       go func() {
               http.ListenAndServe(":8080", nil)
       }()
       //...}

Memory profiling

Investigating the memory errors for our service described above, we would consistently see a rise in memory which is never released.

This graph is over a one month time span. The memory would spike at 2:00am, then would only reset after the service was redeployed. Correlated with the spike in memory is the spike in CPU usage. The CPU however would return to normal after the processing was done. This would result in high memory alarms on the service.
Sample code can be found here which was used to enable this for our Write Api Service:
Sample Code

Writing the profile to a file

Using the heap endpoint after enabling the pprof server for memory profiling, we are able to hit the heap profile to get a snapshot of the current memory. Writing this information automatically to a file gives the ability to deploy the service and have it collect samples on its own. If we were running the service locally, we could use the curl command then dump the results to a file as well.

curl http://localhost:8080/debug/pprof/heap > heap.0.pprof
sleep 30
curl http://localhost:8080/debug/pprof/heap > heap.1.pprof
sleep 30
curl http://localhost:8080/debug/pprof/heap > heap.2.pprof
sleep 30
curl http://localhost:8080/debug/pprof/heap > heap.3.pprof

The advantage of deploying the service with the profiler enabled is that we can get real samples without having to reproduce the conditions locally. In fact, I ran the service over the weekend with the profiler enabled and the CPU and memory footprints are about the same as with the profiler disabled. The graph below is over a 2 months period.

After the samples were collected, the directory was then zipped and exported for offline analysis. The output could also be directed to an external file system, such as S3. I didn’t do this for the interest of time. Also, there’s a big drawback to writing to a non persistent storage. If the pod gets restarted, all the files would have been lost.

Analyzing with pprof

To start the pprof tool for analyzing a file,

go tool pprof heap.out

Where “heap.out” is the name of the file you want to analyze. It is also possible to diff two files

go tool pprof --base heap.1592551172883.pprof heap.1592835572964.pprof

Here I’m doing a diff on an early file, and the last file which was generated.

There are plenty of guides on how to use pprof itself, namely:

So I’ll focus mainly on what was noticed during my analysis. The pprof tool is capable of generating an image file using the png option. The image for the diff command above is shown below:

This is pretty hard to see, so let’s zoom in on the red area

This image shows a significant amount of memory that was being retained from a db query. This looked rather suspicious. This query was also used by another service, but nothing in that service shows any high memory spikes. Looking deeper into how the query was being invoked, I found that the calling function was making a significant memory request, namely, using an skip=0, limit=17,000. This we can see from the 5th red box in the stack resulted in a scanAll operation. This placed a significant burden on the underlying library which had pq.textDecode of 40.22% and reflect.New of 51.58% which combined to 96.71% of the total memory that was being used by this service for a single operation. This was one part of the problem. The other piece of the problem is that the processing for this event was slow, taking over 2–3 hours to complete.

The Fix

Action items:

Change the query to chunk the requests
Parallelize the number of external requests
Break the post processing into its own event

Change the query:

We were fortunate enough that we were able to determine a few action items immediately. This code needed to be refactored. Part of the refactoring was changing the query to chunk up the limit to 1000 requests. This ensured that less memory would be withheld as the processing was happening.

Increase the external requests:

The second part of the changes was to optimize the processing itself. Instead of making 17000 single requests to another team’s API, the code was changed to make 1000 requests in parallel, until the target of 17000 was reached. This team’s API was more than capable of handling 1000 requests/second.

Break the post processing into a new event:

This particular service was event driven. From the top of the memory profile, you can see where a new goroutine was created. After all the 3rd party requests were done, this event would update the local db. After the db updates were completed, these values were then used to calculate some other results to be sent to another team. These values were being sent over Kafka.
I also made changes to create a new event to start the post processing step. This ensures this step would run independently. If the previous stage failed, then this stage wouldn’t run. However, if an error occurred in the post processing, we now had the ability to start this process by itself. After the previous stage was completed, it would send a new event to start this post processing step.
Since there was no dependency between any of the messages in the post processing, I made changes to send all these messages in parallel over Kafka.

End results

After deploying the changes for the last two weeks, the graphs look better, but not drastically improved as I would have hoped for on the memory.

The CPU profile has actually improved as a side effect of the refactoring. The memory has improved, but not as dramatically as I had hoped. Profiling is a work in progress, and this is something we can improve on over time. Using the technique described above, we can get more detailed information about the performance of this service over time.