Go code refactoring : the 23x performance hunt

A few weeks ago, I read an article called “Good Code vs Bad Code in Golang” where the author guides us step-by-step through the refactoring of an actual business use case.

The article focuses on turning “bad code” into “good code”: more idiomatic, more legible, leveraging the specifics of the go language. But it also insists on performance being an important aspect of the project. This triggered my curiosity: let’s dig in!

The program basically reads an input file, and parses each line to populate an object in memory.

The author not only published the source on Github, he also wrote an idiomatic benchmark. This was a really great idea, like an invitation to tweak the code and reproduce the measurements with the command:

$ go test -bench=.
μs per execution (smaller is better)

So, on my machine the “good code” is 16% faster. Can we gain more?

In my experience there is an interesting correlation between code quality and performance. When you successfully refactor your code to make it clearer and more decoupled, you often end up making it faster, as it’s less cluttered with irrelevant instructions that were previously executed in vain, and also because some possible optimizations become obvious and easy to implement.

On the other hand, if you push further in your quest for performance, you will have to give up simplicity and resort to hacks. You will indeed shave milliseconds, but code quality will suffer, insofar as it will become more difficult to read and to reason about, more brittle, and less flexible.

Climbing mount Simplicity, and then descending it

It’s a trade-off: how far are you willing to go?

In order to properly prioritize your performance effort, the most valuable strategy is to identify your bottlenecks and focus on them. To achieve this, use profiling tools! Pprof and Trace are your friends:

$ go test -bench=. -cpuprofile cpu.prof
$ go tool pprof -svg cpu.prof > cpu.svg
A pretty big CPU usage graph (click for SVG)
$ go test -bench=. -trace trace.out
$ go tool trace trace.out
A rainbow trace : lots of tiny tasks (click to open, in Chrome only)

The trace proves that all CPU cores are used (bottom lines 0, 1, etc.), which looks like a good thing at first. But it shows thousands of small colored computation slices, and also some blank slots where some of the cores are idle. Let’s zoom in:

A 3ms window (click to open, in Chrome only)

Each core actually spends a lot of its time idle, and keeps switching between micro-tasks. It looks like the granularity of the tasks is not optimal, leading to a lot of context switches and to contention due to synchronization.

Let’s check with the race detector if the synchronization is correct (if it’s not, then we have bigger issues than performance):

$ go test -race

Yes!! it seems correct, no data race condition was encountered. Test funcs and Benchmark funcs are distinct (see the doc), but here they call the same function ParseAdexpMessage and we can use either with -race.

The concurrency strategy in the “good” version consists in processing each line of input in its own goroutine, to leverage multiple cores. This is a legitimate intuition, as goroutines have the reputation to be lightweight and cheap. How much are we gaining thanks to concurrency? Let’s compare with the same code in a single sequential goroutine (just remove the go keyword preceding the line parsing function call)

μs per execution (smaller is better)

Oops, it’s actually faster without any parallelism. This means that the (non-zero) overhead of launching a goroutine exceeds the time saved by using several cores at the same time.

The natural next step, as we are now processing lines sequentially instead of concurrently, is to avoid the (non-zero) overhead of using a channel of results: let’s replace it with a bare slice.

μs per execution (smaller is better)

We’ve now gained a ~40% speedup from the “good” version, only by simplifying the code, removing the concurrency (diff).

With a single goroutine, only 1 CPU core is working at any given time.

Now let’s have a look at the hot function calls in the Pprof graph:

Spotting the bottleneck

The benchmark of our current version (sequential, with slice) spends 86% of its time actually parsing messages, which is fine. We quickly notice that 43% of the total time is spent matching a regular expression with (*Regexp).FindAll .

While regexps are a convenient and flexible way to extract data from raw text, they have drawbacks, including a cost in memory and runtime. They are powerful, but probably overkill for many use cases.

In our program, the pattern

patternSubfield = "-.[^-]*"

is mostly intended to recognize “commands” starting with a dash “-”, and a line may have several commands. This, with a little tuning, could be done with bytes.Split. Let’s adapt the code (commit, commit) to replace the regexp with a Split:

μs per execution (smaller is better)

Wow, this is an extra 40% perf gain!

The CPU graph now looks like this:

No more regexp huge cost. A fair share of the time (40%) is spent allocating memory, from 5 various functions. It’s interesting that 21% of the total time is now accounted for by bytes.Trim .

This function calls intrigue me: can we do better?

bytes.Trim expects a “cutset string” as an argument (for the characters to be removed on the left and on the right), but we use it only with the single space byte as the cutset. This is an example where you may gain performance by introducing a bit of complexity: implement your own custom “trim” func in lieu of the standard library one. The custom “trim” deals with only a single cutset byte.

μs per execution (smaller is better)

Yay, another 20% shaved off. The current version is 4x as fast as the original “bad” one, while using only 1 CPU core of the machine. Pretty substantial!

Earlier we gave up on concurrency at the line-processing level, however there is still room for improvement by going concurrent, with a coarser granularity. For example, processing 6,000 files (6,000 messages) is faster on my workstation when each file is processed in its own goroutine:

μs per Message (smaller is better, purple is concurrent)

A 66% win (i.e. a 3x speedup), which is good but “not that much” considering that it utilizes all of my 12 CPU cores! This might mean that with the new, optimized code, processing a whole file is still a “small task” for which the overhead of a goroutine and synchronization is not negligible.

Interestingly, increasing the number of messages from 6,000 to 120,000 has no effect on the performance of the sequential version, and it degrades the performance of the “1 goroutine per message” version. This is because launching a very large number of goroutines is possible and sometimes useful, however it does put some pressure of the go runtime scheduler.

We can reduce the execution time further (not by a 12x factor, but still) by creating only a handful of workers, e.g. 12 long-running goroutines, each goroutine processing a subset of the messages (commit):

μs per Message (smaller is better, purple is concurrent)

Tuned concurrency for a big batch of messages removed 79% of the execution time, compared to the sequential version. Note that this strategy makes sense only if you do have a lot of files to process.

The optimal utilization of all the CPU cores consists in a few goroutines that each processes a fair amount of data, without any communication and synchronization until it’s done.

It’s a common heuristic to choose a number of processes (goroutines) equal to the number of available CPU cores, but it’s not always the optimum: your mileage may vary depending of the nature of the task. For example, if your task reads from the filesystem or makes network requests, then it makes perfect sense for performance to have more goroutines than CPU cores.

μs per Message (smaller is better, purple is concurrent)

We have reached a point where the efficiency of our parsing code is hard to improve by localized enhancements. The execution time is now dominated by the allocation and the garbage collection of small objects (e.g. the Message struct), which makes sense because memory management operations are known to be relatively slow. Further optimization of the allocation strategy… is left as an exercise to the crafty reader.

Using a completely different algorithm can also lead to great speedups.

Here, I loosely take inspiration from this talk

Lexical Scanning in Go — Rob Pike

to build a custom Lexer (source) and a custom Parser (source). It’s a proof of concept (I didn’t implement all of the corner cases), it’s less intuitive than the original algorithm, and error handling can be tricky to implement correctly. However, it’s frugal and 30% faster than the previous optimized version.

μs per Message (smaller is better, purple is concurrent)

Yes, that’s a 23x speedup factor, compared to the initial code.

That’s it for today, I hope you enjoyed the journey. Here are a few disclaimers and take-aways:

  • Performance can be improved at many levels of abstraction, using different techniques, and the gains are multiplicative.
  • Tune the high-level abstractions first: data structures, algorithms, proper decoupling. Tune the low-level abstractions later: I/O, batching, concurrency, stdlib usage, memory management.
  • Big-O analysis is fundamental but often it’s not the relevant tool to make a given program run faster.
  • Benchmarking is hard. Use profiling and benchmarks to discover bottlenecks and get insight about your code. Keep in mind that the benchmark results are not the “real” latencies experienced by the end-users in production, and take the numbers with a grain of salt.
  • Fortunately, the tooling (Bench, Pprof, Trace, Race detector, Cover) makes performance exploration approachable and exciting.
  • Writing good, relevant tests is non-trivial. But they are extremely precious to help “stay on track”, i.e. to refactor while preserving the original correctness and semantics.
  • Take a moment to ask yourself how fast will be “fast enough”. Don’t waste time over-optimizing a one-shot script. Consider that optimization comes with a cost: engineering time, complexity, bugs, technical debt.
  • Think twice before obscuring the code.
  • Algorithms in Ω(n²) and above are usually expensive.
  • Complexity in O(n) or O(n log n), or below, is usually fine.
  • The hidden factors are not negligible! For example, all of the improvements in the article were achieved by lowering those factors, and not by changing the complexity class of the algorithm.
  • I/O is often a bottleneck: network requests, DB queries, filesystem.
  • Regular expressions tend to be a more expensive solution than really needed.
  • Memory allocation is more expensive than computations.
  • An object in the stack is cheaper than an object in the heap.
  • Slices are useful as an alternative to costly reallocation.
  • Strings are efficient for read-only usage (including reslicing), but for any other manipulations, []byte are more efficient.
  • Memory locality is important (CPU-cache-friendliness).
  • Concurrency and parallelism are useful, but tricky to get right.
  • When digging deeper and lower-level, there’s a “glass floor” that you don’t really want to break through in go. If you crave asm instructions, intrinsics, SIMD… maybe you should consider go for prototyping, and then switch to a lower-level language to take full advantage of the hardware and of every nanosecond!