Daily code optimization using benchmarks and profiling in Golang - Gophercon India 2016 talk

8 min readFeb 19, 2016

Need some help with analyzing whether algorithm A performs faster than B? Or can I know whether method A performs better than method B? Or is approach A to send JSON response to the client performs better than approach B? How to find answers for everyday performance related questions?

These questions can be answered easily if you are happened to program in Golang. Golang comes bundled with facilities like benchmarking and profiling which helps achieve performance oriented development easier than ever before.

Lets take a look at this simple code snippet which has http handler to parse the query parameter -> convert them to numbers -> add them -> send html result back to the client using a template. What follows the code is also its benchmark and the simple template used.

Running the benchmark

$ go test -run=x -bench=.

Now lets perform CPU profiling to identify the CPU hotspots

go test -run=^$ -bench=. -cpuprofile=cpu.out

This will create an executable that ends with .test and the profile information will be stored in profile.cpu

$ls
bench.test  cpu.out  simple_add.go  simple_add_test.go  template.html

lets look at the type of bench.test

$file bench.test
bench.test: ELF 64-bit LSB  executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), not stripped

Before we begin our optimization lets run the benchmarks again and save its results into a file for later comparison.

$ go test -run=xxx -bench=. | tee bench0

Now lets analyse the CPU profile

go tool pprof bench.test cpu.out

It opens up an interactive cpu profiler

You can check the top20 consuming functions

(pprof) top20
1340ms of 2230ms total (60.09%)
Dropped 78 nodes (cum <= 11.15ms)
Showing top 20 nodes out of 154 (cum >= 220ms)
  flat  flat%   sum%        cum   cum%
 320ms 14.35% 14.35%      320ms 14.35%  runtime/internal/atomic.Xchg
 120ms  5.38% 19.73%      120ms  5.38%  runtime/internal/atomic.Xadd
 110ms  4.93% 24.66%      850ms 38.12%  runtime.findrunnable
  70ms  3.14% 27.80%      350ms 15.70%  runtime.mallocgc
  70ms  3.14% 30.94%      360ms 16.14%  runtime.mapassign1
  70ms  3.14% 34.08%       70ms  3.14%  runtime.usleep
  50ms  2.24% 36.32%       50ms  2.24%  runtime.acquirep1
  50ms  2.24% 38.57%       50ms  2.24%  runtime.heapBitsSetType
  50ms  2.24% 40.81%       80ms  3.59%  runtime.heapBitsSweepSpan
  50ms  2.24% 43.05%       90ms  4.04%  runtime.scanobject
  50ms  2.24% 45.29%       50ms  2.24%  runtime.stringiter2
  50ms  2.24% 47.53%       60ms  2.69%  syscall.Syscall
  40ms  1.79% 49.33%       40ms  1.79%  runtime.memclr
  40ms  1.79% 51.12%       40ms  1.79%  runtime.memmove
  40ms  1.79% 52.91%      230ms 10.31%  runtime.newarray
  40ms  1.79% 54.71%      100ms  4.48%  text/template.goodName
  30ms  1.35% 56.05%       30ms  1.35%  runtime.(*mspan).sweep.func1
  30ms  1.35% 57.40%       70ms  3.14%  runtime.evacuate
  30ms  1.35% 58.74%       30ms  1.35%  runtime.releasep
  30ms  1.35% 60.09%      220ms  9.87%  runtime.unlock

Lets see the CPU consumption by cumulative time

(pprof) top --cum
0.13s of 2.23s total ( 5.83%)
Dropped 78 nodes (cum <= 0.01s)
Showing top 10 nodes out of 154 (cum >= 0.68s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%      1.33s 59.64%  runtime.goexit
         0     0%     0%      1.11s 49.78%  _/home/hackintosh/bench.BenchmarkHandleStructAdd
         0     0%     0%      1.11s 49.78%  testing.(*B).launch
         0     0%     0%      1.11s 49.78%  testing.(*B).runN
     0.01s  0.45%  0.45%      1.10s 49.33%  _/home/hackintosh/bench.handleStructAdd
        0     0%  0.45%      0.89s 39.91%  runtime.mcall
     0.01s  0.45%   0.9%      0.87s 39.01%  runtime.schedule
     0.11s  4.93%  5.83%      0.85s 38.12%  runtime.findrunnable
        0     0%  5.83%      0.84s 37.67%  runtime.goexit0
        0     0%  5.83%      0.68s 30.49%  html/template.(*Template).Execute

Cumulative time is useful in understanding the execution tree and its cumulative CPU usage

The most useful feature is to be able to get the execution tree with its CPU usage in the form a image

web

Or if you wish to get the result in a pdf you could try the following the command

go tool pprof — pdf bench.test cpu.out > cpu0.pdf

If you desire to have detailed look into it here’s the link to the pdf

Lets zoom into the tree and see the places where CPU has been utilized substantially

Now if you trace the lines branching out from benchmark.HandleStructAdd almost one third of the cumulative time on the function is being spent on parsing the file.

We could do this once the program begins and just execute it once the API handler gets the request.

Let’s do that simple change to begin with and see the difference in performance.

Let’s benchmark this and save the result of the benchmark in a file and compare it with the previous benchmark

$ go test -run=xxx -bench=. | tee bench1

Let’s use benchmp to analyse the performance gain from the optimization

$ benchcmp bench0 bench1benchmark                  old ns/op  new ns/op  delta
BenchmarkHandleStructAdd-4 41808      3965      -90.52%

(Here is a tutorial on using benchmark comparison tools )

Wow, you could clearly see performance gain of 90%.

You could the parse package to parse saved benchmark results and create your own tools https://godoc.org/golang.org/x/tools/benchmark/parse

Let’s do the CPU profiling again,

$go test -run=x -bench=. -cpuprofile=cpu.out

Lets profile the code and save the execution tree with the CPU usage into pdf and see analyse it.

go tool pprof --pdf bench.test cpu.out > cpu1.pdf

Lets see the profiling tree visualization to understand

You can access the profile pdf here

You could clearly see that template.ParseFiles is not in the picture now,

Lets see the CPU consumption hotspots once more,

$ go tool pprof bench.test cpu.out
$ list handleStructAdd

Clearly template.Execute is the heaviest consumer of CPU. w.Header().Set [line 37] an w.Write are consuming considerable CPU resources too.

Lets now see top10 CPU consuming function.

Clearly Mallocgc tops the list, Mallocgc is Golang garbage collector, The job of GC is as usual to sweep the heap allocations once it starts spiking up. Memory profiling could help understand the memory consumption and memory utilization hotpspots and then would allow to take steps towards memory optimization.

But how do identify the reason behind the high CPU usage of some these runtime functions and understand its root cause? Let’s say I want to know about the functions which are contributing highly for the mallogc invocation? How do I do that? There’s too much noise in the profile tree..

Hmm, that’s a great question. Here is one way you could understand the roots behind the invocation of runtime functions, Let’s say you would like to know the functions responsible for high invocation of mallocgc.

web mallocgc

Here is an extract (zoomed) from the profile tree for web mallocgc.

From the execution tree you could clearly see the roots of the mallocgc invocation. Here is my analysis from the above tree.

Template.Execute is the primary culprit and you could follow the tree to dig deeper to understand more about its behaviour.
runtime.convT2E is using considerable amount of CPU. Its being used by template.Execute . Its clear from the documentation of template.Execute that its second argument is a interface but we are sending a struct here. So runtime.convT2E kicks in to do the translation from an empty interface to the give struct.
runtime.stringtobyteslice is invoked here to convert the string to byte slice.

So now we know the ways to reduce load on mallocgc, this will reduce both load on CPU and memory… we gotto work on those 4 functions.

Wow, that was really useful! But I still see lot of noise in the profile tree, too many nodes and too many connections. I want to focus only on the highly consuming functions. Is there a way?

go tool pprof --nodefraction=0.2 bench.test pro.cpu

Yes, by profiling using the command above you just told `go tool pprof` to ignore nodes that don’t account for at least 20% of the samples and display only the nodes that account for more than 20%

But to understand more about the heap allocations and to optimize the program we need to resort to memory profiling. Lets look at memory profiling and then come to focus on the above 4 cases.

Take a look at all the functionalities provided by testing.B here .

There are bunch of interesting ones, I find ReportAllocs and RunParallel to be most interesting ones.

Let’s begin memory profiling …

First lets add ReportAlloc() to the benchmark and see the difference the resulting benchmark output

$ go test -bench=.

Here’s how the benchmark results would look like now,

This is clearly shows that there’s a total of 18 allocations per operation with a sum of 1080 bytes used.

Before making any changes lets save the benchmark results to a file so that we could use to measure the performance difference after our changes.

$go test -run=^$ -bench=. | tee profile.1

The goal from here on would be reduce the memory allocations and see how it affects mallocgc .

$go test -run=^$ -bench=. -memprofile=mem0.out

Now lets start the interactive profiler

It can be run with various options for memory profiling, the two most important ones are

 — inuse_objects (show count by number of allocations) — alloc_space   (shows the total allocation size0

In this case I’m interested about the allocation space

$go tool pprof --alloc_space bench.test mem0.out

first lets look at the cumulative memory usage

Lets look at the top10 memory consuming functions.

Let’s go one step beneath and see where is MIMEHeader.Set is consuming memory

Clearly the memory is consumed during Map allocation

setting up the response header is consuming lot of memory along with template execution.

Let us now list down the memory allocations of the handler under scrutiny

From the above screen shot you could clearly see the memory hotspots

Now here are the places here optimizations can be done.

From the documentation of template.Execute you could see that the first argument to it is a io.Writer.

But http.ResponseWriter also satisfies io.Writer, this it can be used to write the response back to the client without the need to set headers or to create the template string and write it back as response using ResponseWriter.Write().

here is the modification

Lets run the benchmark now, save the result and compare with the old benchmark.

$go test -run=^$ -bench=. | tee profile.2$benchcmp profile.1 profile.2

Wow, you could see the CPU usage has increased, but the memory consumption ha gone down….

But how to understand this behavior now? Simple, follow the steps as before and try to identify the hotspots with the power of golang benchmark and profiling tools. There are no exact workflows around code optimization. But with the help of golang benchmarks and profiling its now easier to experiment and then measure/compare performance and understand the behavior and with more experience you could easily take decisions on making more performant choice.

Here are other tools which could come handy in optimizing code

Golang blocking profiler
sync.Pool , to pool and reuse resources
Garbage collector tracer
Memory Allocator tracer
Scheduler tracer
runtime.ReadMemstats

Here is link to my talk on the same at Gophercon India 2016 https://www.youtube.com/watch?v=-KDRdz4S81U .

Daily code optimization using benchmarks and profiling in Golang - Gophercon India 2016 talk

Written by karthic Rao