Merge Sort is one of those classic “divide and conquer” algorithms you study in your freshman year in CS, and for me it was one of the best tools for developing an intuitive sense for recursion and O(nlog(n)) complexity.
In this writeup we’re going to check out a naive concurrent version of MergeSort in Go, and then using Go’s easy benchmarking tools, we’re going to see what’s wrong with it, and then apply a simple fix.
In case you are not familiar with the algorithm, here’s a great explanation in Harvard’s CS50.tv channel:
The basic idea is this: first, like every recursive idea, we assume we have the solution in hand and we apply it to smaller subsets of the input. In our case a function called “MergeSort”. We then apply MergeSort to a subset of the input: we split the original slice in half, and we feed the left and the right parts to the same function.
Now we assume that we have in hand two sorted slices, the left part and the right one, and we need to merge them (the fun part).
The merge function therefore acts to merge two sorted slices: the idea here is to compare the first element from the left to the one on the right. Then, depending on the desired sort order (asc or desc), we take the larger (or smaller) of the two and append it to the result. Every time we do that, we truncate the section from which we took the element we appended (so if we took the element from the left, the left slice now has one less element), until we are left with 0 elements there.
Let’s make it concurrent
The MergeSort function above can be easily introduced with some concurrency goodness. On lines 13–14, we have:
l = MergeSort(s[:n])
r = MergeSort(s[n:])
What’s this? synchronous computes? This calls for some goroutines! Let’s rewrite MergeSort with some concurrency, call it MergeSortMulti:
Simple enough. For every step of the recursion, we’re going to fire up a goroutine to handle the computation asynchronously. We then wait for the computation to finish on both parts of the slice (this happens concurrently) and merge the result.
Surely this is faster, right? We don’t have to wait for the first recursion to end before we begin with the second one. Wrong!
When Too Much is Too Much
Benchmarking MergeSort vs MegeSortMulti reveals that the former, synchronous version, performs about 14 times faster than the concurrent one. Here’s the benchmarking code for a million elements in the slice:
It reveals that the concurrent version is a lot slower than the normal synchronous version:
BenchmarkMergeSortMulti-8 1 1711428093 ns/op
BenchmarkMergeSort-8 10 131232885 ns/op
What’s going on? Well, it looks like in our quest to make the computation concurrent, we are firing goroutines like crazy. For each step of the recursion, we fire two goroutines. This ends up generating millions of goroutines fighting over the CPU with go’s goroutine queueing mechanism. The result is slower code.
The Solution — Use Concurrency, But Hold Your Horses
So how can we gain the performance benefits of concurrent code without setting up a shower of goroutines? Well, one nice way to limit concurrency in go is by using a Buffered Channel Semaphore. Buffered channels in go are a nice and easy way to block execution based on the number of concurrent units of actions we want to have.
We set a buffered channel with a capacity of 100, and when we spawn the goroutines to perform the asynchronous computation, we revert to using the synchronous version of MergeSort if there are already 100 workers (goroutines) busy with the computation:
The result: a nice 3X performance gain for the concurrent version:
The same benchmarking code now yields:
BenchmarkMergeSortMulti-8 30 36741152 ns/op
BenchmarkMergeSort-8 20 90843812 ns/op