Analyzing and improving memory usage in Go

Scott Gangemi
SafetyCulture Engineering
9 min readJul 11, 2021

The last article I wrote focused on how the garbage collector (GC) worked in Go. The next step was to combine this with my love of performance optimisation. The intent of this article is to show you how to optimise the memory used in your Go applications.

When exploring this space, I found that a lot of articles focused on contrived examples, which made it difficult to apply this to production code. This article uses a real-world application that I work on every day, and explores some improvements I made to it. By the end of this article, I hope you’ll have some new knowledge about how to identify and tune performance bottlenecks.

The application that we’ll focus on powers the Issues product at SafetyCulture. It, and all of our other services, report application stats to Grafana, including GC pause time. Before any improvements were made, the average GC pause time for this service was about 16ms, with the max GC pause time spiking from around 500ms — 1s.

As Go can do sub-millisecond GC pauses, this seemed high. However, considering this is a constantly running webserver handling a large volume of HTTP and gRPC requests, I wouldn’t expect numbers that low. In any case, I wanted to see how I could reduce this max GC pause time.

GC pause time before any improvements were made

Analysis with pprof

After finding an area to investigate, I ran a pprof analysis. pprof is a tool that’s baked into the Go language that allows for analysis and visualisation of profiling data collected from a running application. It’s a very helpful tool that collects data from a running Go application and is a great starting point for performance analysis. I’d recommend running pprof in production so you get a realistic sample of what your customers are doing.

When you run pprof you’ll get some files that focus on goroutines, CPU, memory usage and some other things according to your configuration. We’re going to focus on the heap file to dig into memory and GC stats. I like to view pprof in the browser because I find it easier to find actionable data points. You can do that with the below command.

go tool pprof -http=:8080 profile_name-heap.pb.gz

pprof has a CLI tool as well, but I prefer the browser option because I find it easier to navigate. My personal recommendation is to use the flame graph. I find that it’s the easiest visualiser to make sense of, so I use that view most of the time. The flame graph is a visual version of a function’s stack trace. The function at the top is the called function, and everything underneath it is called during the execution of that function. You can click on individual function calls to zoom in on them which changes the view. This lets you dig deeper into the execution of a specific function, which is really helpful. Note that the flame graph shows the functions that consume the most resources so some functions won’t be there. This makes it easier to figure out where the biggest bottlenecks are.

When you’re selecting the sample you want to view, these are your options:

  • inuse_space: See memory size in use at time of profiling that hasn't been released.
  • inuse_objects: See object count in use at time of profiling and that hasn't been released.
  • alloc_space: See allocated memory size across the lifetime of the application.
  • alloc_objects: See allocated object count across the lifetime of the application.

To analyze GC, we want to look at the alloc_objects option. This will help us understand how memory objects are allocated over the lifetime of the application. When analyzing the flame graph, there was one function that was responsible for a third of all memory allocation, which we’ll call fetchData. This function seemed like a great place to begin investigating.

When I’m investigating performance improvements I frame it as an experiment, and pprof gave me enough data for my first experiment.

Experiment 1

Objective: Reduce the amount of memory being allocated by fetchData.

Hypothesis: After reading through the code that calls fetchData, I don’t think we need every field that is requested. If we remove this from the request, the amount of allocated memory might be reduced.

Investigation: At SafetyCulture, inspections are a big part of how our customers use our product. Because of this, bringing inspections into other parts of the product is really valuable. fetchData gets inspections data from an API in JSON format. The response size varies, but the largest responses can be quite big, at around 20MB. Unmarshaling and parsing a 20MB JSON object uses up a lot of memory. This is correlated by the stack trace of fetchData mostly consisting of a large number of encoding/json functions. As mentioned in the hypothesis, it looks like a requested field is no longer used. It’s still referenced in the code, but in a path that looks like it will never be executed. Can we remove this field, and will it improve memory usage?

We now have a few questions to answer:

  • Is the suspected code actually dead?
  • Are we able to remove the suspect field from the request without breaking anything?
  • Will removing this field from the request improve memory usage?

Method: Since this field looks unused, I wanted to find out why it was unused. I went back to the PR that first added this field and traced the history of the file. I found a merged PR from a few months ago (2 years after the initial commit!) that made this field redundant. I now felt confident that removing this field would be safe. I deleted the field from the request, deleted the code referring to it, did some testing and merged it.

I monitored the application in production and everything looked ok. It seemed that the part of my hypothesis about the unneeded field was correct. Now, we need to see if the amount of memory allocations is reduced.

Results:

  • Memory utilisation had reduced. On the left, you can see memory use before the changes, and afterwards on the right. The constant blue line is what’s the Go runtime uses, and the coloured spikes are from our application. As you can see, the memory use on average is lower, and the spikes in memory usage are far smaller as well. Before, spikes in memory usage were about ~1018MB, whereas afterwards, the biggest spikes were about ~649MB. On average, memory use before was between 100MB — 250MB, whereas afterwards, it was between 80MB — 120MB. These changes caused less memory to be used and reduced the variance significantly.
  • Allocated objects for fetchData had reduced. Before, 33% of allocated objects were going towards this function. After this had dropped to around 28%.
  • The 99th percentile latency for the endpoint. The changes were deployed on Wednesday 26 (which isn’t displayed on the graph). The bigger spikes in the 99th percentile have dropped (except for the one weird spike on Thursday 27). The 99th percentile latency is now around 800ms, down from 1.2s
  • The max GC pause time has dropped (a bit). It’s not a huge drop, but it’s still a drop. Since deploying on the 26th, the biggest spikes in max GC pause time have dropped, and the volume of large spikes have reduced as well.

Overall, I was really happy with these results! Memory usage was reduced, and the customers who experience the 99th percentile response time will now receive a better experience.

But, can we do more?

Experiment 2

The fetchData function is still responsible for 28% of all memory allocations. This is because the big JSON object is put into memory and unmarshaled into a Go struct. This takes us to the second experiment.

Objective: Reduce the amount of memory allocations from unmarshaling the JSON object.

Hypothesis: If the process of extracting data from JSON can be improved, the number of memory allocations might be reduced.

Investigation: I wanted to understand why this function uses so much memory. First, a POST request fetches the JSON and then the calling function receives the decoded body via json.NewDecoder(response.body).Decode(). Then, json.Unmarshal converts the JSON into a suitable struct.

This is fine for handling most JSON responses, but this JSON response can be massive (up to 20MB). This means that this is a bottleneck for this particular code path. Under the hood, json.Decode writes the entire JSON object to memory, where it is immediately unmarshaled. Because this JSON object can be so big, it’s no surprise that this function uses up so much memory.

Method: I was reading through the docs for the encoding/json package to see what other approaches were available. I came across the Token function that decodes the JSON object as a stream. This means the application can decode the JSON on the fly, instead of writing the entire JSON object to memory. The Token() function lets you manually step through the body of the JSON object to extract the data you need. This method has some tradeoffs. If the JSON response schema changes, your parsing code won’t work anymore. It’s also more verbose and complex when compared to json.Unmarshal. However, if this approach is used sparingly, it can lead to some performance improvements. There’s a number of examples available online, such as this post. As you can see, the code becomes much more complex than json.Unmarshal. If you take this approach, I’d recommend annotating the code so that your colleagues can understand it in the future.

Results:

  • Memory utilisation has reduced even further: As before, left is the memory use before the changes were deployed, and afterwards is on the right. Experiment 1 resulted in memory usage around 80–120MB and this has now dropped to 70MB — 90MB. The big spikes are much less severe as well.
  • The 99th percentile latency for the calling function has dropped even further. In the below screenshot, my changes went out just before 12PM in between Tue 22 and Wed 23, corresponding to the dip in the 99th percentile. Experiment 1 brought the 99th percentile down to around 800ms, but after Experiment 2 this has come dropped even further to 500ms.
  • The max GC pause time has dropped. The code changes went live in the below screenshot at around 8:30 AM. After this time, the max GC pause time is lower for all garbage collections, and many of the GC pauses take around 100ms. This brings the max GC pause time from around 300ms-400ms to around 100–300ms. There are still some high spikes here, but the max GC pause time is much lower.

Overall Results

  • Average memory usage has dropped from around 100MB-250MB to 70–90MB, and spikes in memory usage have dropped from around 1018MB to 120MB.
  • Max GC pause time has dropped. When we first started investigating, the max GC pause time was around 400ms. It’s now around 300ms but is often much quicker. We’re now seeing GC pause time events of 100ms, which never happened before these experiments. There were also occasional spikes in GC pause time that would take a second, and these no longer appear.
  • Most importantly, the endpoint using fetchData has had a big drop in 99th percentile response time. At the start, the 99th percentile was around 1.2s but now sits around 500ms. As far as I’m concerned, this is the most impactful change. It’s easy to lose sight of why performance improvements are important. If you focus on your customer and their experience of using your product, you’ll always head in the right direction.

I hope that after reading this you’ll have more confidence exploring your application’s performance. First, figure out which pieces of your application take up the most resources. Then you can start asking questions about why they use up these resources. You can then take measurable steps towards improvements and improving your customers’ experience.

--

--