# Sequential and Parallel Merge Sort in Go

Recently, I stumbled upon the PingCAP talent-plan, which is “a series of training courses about writing distributed systems in Go and Rust”. Having started learning Go recently, I decided to start with the “Distributed systems in Go” course, where the first project is to implement the merge sort algorithm to sort 16 million 64 bit integers.

The project requirements are:

• Pass the provided unit test.
• Make it perform better than sort.Slice().
• Have a document to describe your idea and record the process of performance optimization with pprof.
• Have a good code style.

In this post we will just see two implementations of merge sort, a sequential and a parallel (actually concurrent) one. These implementations will be completely unoptimized. In the next post, we will use the parallel implementation as a baseline for further optimizations.

# What is merge sort?

Merge sort is a divide and conquer algorithm that sorts the input array by breaking it into subarrays until they have 1 element, and then merging the results back into a sorted array. The time and space complexities are O(NlogN) and O(n) respectively. We won’t go into details of the asymptotic analysis as there are many great resources about it online.

# Sequential implementation

We can easily implement merge sort using a recursive function. The function splits the array in half, calls itself for each of the subarrays and then merges the resulting arrays in ascending order.

The merge function compares the elements of the two input arrays and writes them in sorted order to the result array (which is the input to `Mergesort`). When it finishes traversing one of the two inputs, it appends the remaining elements of the other to the result.

Running the provided test suite verifies our implementation:

We won’t do any profiling in this post but will just write a “parallel” implementation and just run a benchmark to ensure it is faster than the sequential one. We will improve it in the second post with the pprof and trace tools.

# Parallel implementation

An optimization for our solution would be to take advantage of multiple processors and use multiple goroutines to parallelize the execution of some `Merge` calls. We could say that, since our task is CPU-bound, we want to limit the max number of goroutines running in parallel to equal the number of available processors. Creating more goroutines may even make our program slower, since the Go scheduler will run more and our program less.

To limit the number of simultaneous goroutines, we can use a buffered channel as a semaphore. Writing to a full buffered channel will block and will block us (:D) from creating extra goroutines we cannot utilize. `Mergesort() `will spawn a new goroutine to sort the left subarray but will sort the right subarray itself, while it waits for the other to finish.

Running the provided benchmark suite shows us a ~2.5x improvement over the builtin sort.

While the parallel implementation is faster than the sequential one, it is still quite bad. After all, the sequential implementation ran one 1 processor while the parallel on 8. It is difficult to achieve linear speedup anyway, but for 8x the processing power we have to do something better than 2.5x speedup.

Also, someone could say we were lucky. We should not rely on intuition for performance but use benchmarks and available tools. So in the next post, I use this parallel implementation as a baseline and try to improve it using the `pprof` and `trace` tools that Go provides.