Optimizing parallel Merge Sort with Go tools

George Papadrosou
7 min readApr 4, 2020

--

In a previous post, I wrote a sequential and a parallel implementation of the Merge Sort algorithm. The parallel implementation was faster, but it was based on my intuition. This time, I try to further optimize it using pprof and the trace tool that Go provides.

The current parallel implementation tries to bound the number of active goroutines to the number of available processors. We will use it as a baseline for our optimizations.

Let’s use pprof to see what our program spends more time on. All benchmarks use an input array of 16 million elements.

merge() dominates the runtime as we would probably expect. But the implementation also spends too much time juggling with memory. This happens due to these lines:

left := make([]int64, mid)
right := make([]int64, len(src)-mid)
copy(left, src[:mid])
copy(right, src[mid:])

They create copies of the left and right part of src before passing them to mergesort(), so that merge() can safely merge them back into the original array.

We get rid of them by allocating a large temparray once and pass it to merge() . Merge results are then first written to temp and then copied to src, which is our original array. Here is the diff:

Now we only allocate one extra array and copy memory once per merge operation. Let’s run the benchmark again:

And compare it with the previous one:

The program now runs much faster and uses a lot less memory! Let’s see pprof:

Clearly less time is spent allocating and moving memory. Τhe flat and cumulative times of runtime.memmove and runtime.mallocgc have decreased by about 50%, as we would expect. merge()expectedly takes more time, since it now has to copy the results from temp to src .

Now, let’s use the tracetool to check what our goroutines are doing and how they utilize the available processors.

Let’s see the first trace:

The first thing we can notice is the gaps. These gaps are time slices when our code doesn’t run. This is not what we want as we don’t fully utilize our processing power. But why do we have these gaps?

Let’s zoom in a bit and enable all flow events, so we can see how our goroutines are related with each other (a goroutine may spawn or unblock another goroutine).

Let’s observe the goroutine G23 (light blue) in the middle of the image. It spawns some goroutines that run on the same processor, but it also creates and waits for a new one, G2473, that runs on another processor. By clicking on G23 near its end, we see its end stack trace. It is blocked on the WaitGroup, as it waits for G2473 to finish. Given that the current implementation limits the number of alive goroutines to the number of processors, this is a problem. We want all scheduled goroutines to run in parallel and not wait on any other.

There is also another issue. There are many goroutines that run for too little time, even less than the go scheduler. We can confirm this by looking at the goroutine analysis screen, by scrolling down and looking at the goroutines with the smaller run time.

Let’s address these bottlenecks by first eliminating the very small goroutines. The problem is in this area:

func mergesort(src, temp []int64, semChan chan struct{}) {
if len(src) <= 1 {
return
}

Our program may create new goroutines even with 1 element. We need to use a bigger cutoff here and resort to sequential sort if the input is small enough. But how small? In the below image, you can see the comparison of different cutoff values that I have benchmarked. All cutoff implementations outperform our previous one, which validates our “cutoff” idea.

On my machine and for 16 million of elements to sort, a cutoff of 10000 elements has the best results (combining the runtime and the memory improvements). Generally, we should run all the benchmarks for various input array sizes so that we don’t overfit our implementation to the input. However, I am skipping this now as the post is big already.

So here is the change. If the input array doesn’t have more elements than the cutoff number, we fall back to the built-in sequential sort.

Let’s examine the new trace. Because our program runs faster now, Go has run our benchmark more times, one with b.N=1 and one with b.N=2. That’s why we can distinguish 3 sections in the trace below.

Let’s just zoom in the middle section and see what we get:

We don’t have so many small goroutines as before. The goroutine analysis page also confirms it. No goroutines run for less than the scheduler wait time (3rd column from the right).

Now let’s tackle the other bottleneck, that goroutines block on each other. A solution would be do give our processors more work with more goroutines. But how many?

We will have to benchmark. First, let’s just give it double our processors. On my machine this would be 16 goroutines.

Let’s benchmark this change:

So we got another ~16% improvement in the run time. Let’s see the new trace:

That’s much much better! There are are still some gaps where our code does not run. Let’s check the Synchronization blocking profile that the trace UI provides:

As expected, the program spends a lot of time waiting on a WaitGroup. There is no way (to my knowledge) to get rid of this with this algorithm, since some of our goroutines must wait for the others to finish. A lot of time is also spend reading from the channel that is used to limit the number of alive goroutines.

We can remove this constraint and let our program create as many goroutines as it needs. It probably seems like a bad idea as it would put a lot of stress on the Go scheduler. But we need to validate our hypothesis. Here is the diff:

We get rid of the semaphore channel and just spawn a new goroutine for every “left” input. Comparing its benchmark with the previous once:

It’s actually faster! Let’s see the trace:

Beautiful isn’t it? And we need to zoom a lot to find any gaps, which are negligible too.

We could say at this point that the job is done. Except, we need to clarify one more thing: Why our program is faster since we create more goroutines?

Checking the Goroutines analysis UI of the trace, we only create ~6k goroutines for an input of 16 million elements.

This is due to our fallback to sequential sorting when the array contains less than 10k elements. If we remove that fallback, the program will create millions of goroutines and it will perform badly indeed.

This last image shows the comparison between the sequential mergesort, our baseline “intuitive” parallel implementation and our final one, for an input array of 16 million 64bit integers:

Our optimized parallel implementation is ~2x faster than our baseline parallel one, and ~4.8x faster than the sequential.

Conclusion

Benchmarking and profiling is necessary to validate our thoughts and changes about performance improvements. Go makes this quite easy by providing not only a built-in framework for benchmarks, but also tools to profile and trace the execution of our programs. Sometimes, understanding the output of the tools and trying to find the root cause is tricky and challenging, but this is when we learn.

The code is on Github, with separate branches for every optimization step in case you want to reproduce the results.

--

--