Recently, I stumbled upon the PingCAP talent-plan, which is “a series of training courses about writing distributed systems in Go and Rust”. Having started learning Go recently, I decided to start with the “Distributed systems in Go” course, where the first project is to implement the merge sort algorithm to sort 16 million 64 bit integers.
The project requirements are:
- Pass the provided unit test.
- Make it perform better than sort.Slice().
- Have a document to describe your idea and record the process of performance optimization with pprof.
- Have a good code style.
In this post we will just see two implementations of merge sort, a sequential and a parallel (actually concurrent) one. These implementations will be completely unoptimized. In the next post, we will use the parallel implementation as a baseline for further optimizations.
What is merge sort?
Merge sort is a divide and conquer algorithm that sorts the input array by breaking it into subarrays until they have 1 element, and then merging the results back into a sorted array. The time and space complexities are O(NlogN) and O(n) respectively. We won’t go into details of the asymptotic analysis as there are many great resources about it online.
We can easily implement merge sort using a recursive function. The function splits the array in half, calls itself for each of the subarrays and then merges the resulting arrays in ascending order.
The merge function compares the elements of the two input arrays and writes them in sorted order to the result array (which is the input to
Mergesort). When it finishes traversing one of the two inputs, it appends the remaining elements of the other to the result.
Running the provided test suite verifies our implementation:
> make test
OK: 1 passed
ok pingcap/talentplan/tidb/mergesort 1.288s
We won’t do any profiling in this post but will just write a “parallel” implementation and just run a benchmark to ensure it is faster than the sequential one. We will improve it in the second post with the pprof and trace tools.
An optimization for our solution would be to take advantage of multiple processors and use multiple goroutines to parallelize the execution of some
Merge calls. We could say that, since our task is CPU-bound, we want to limit the max number of goroutines running in parallel to equal the number of available processors. Creating more goroutines may even make our program slower, since the Go scheduler will run more and our program less.
To limit the number of simultaneous goroutines, we can use a buffered channel as a semaphore. Writing to a full buffered channel will block and will block us (:D) from creating extra goroutines we cannot utilize.
Mergesort() will spawn a new goroutine to sort the left subarray but will sort the right subarray itself, while it waits for the other to finish.
Running the provided benchmark suite shows us a ~2.5x improvement over the builtin sort.
> make bench
go test -bench Benchmark -run xx -count 5 -benchmem
BenchmarkMergeSortParallel-8 1 1334344722 ns/op 3489690688 B/op 50331744 allocs/op
BenchmarkMergeSortParallel-8 1 1371397711 ns/op 3489701856 B/op 50331783 allocs/op
BenchmarkMergeSortParallel-8 1 1386620551 ns/op 3489690272 B/op 50331746 allocs/op
BenchmarkMergeSortParallel-8 1 1350215546 ns/op 3489668848 B/op 50331694 allocs/op
BenchmarkMergeSortParallel-8 1 1325400066 ns/op 3489662752 B/op 50331657 allocs/op
BenchmarkNormalSort-8 1 3325640673 ns/op 64 B/op 2 allocs/op
BenchmarkNormalSort-8 1 3318650150 ns/op 64 B/op 2 allocs/op
BenchmarkNormalSort-8 1 3317344986 ns/op 64 B/op 2 allocs/op
BenchmarkNormalSort-8 1 3309457091 ns/op 64 B/op 2 allocs/op
BenchmarkNormalSort-8 1 3303682084 ns/op 64 B/op 2 allocs/op
ok pingcap/talentplan/tidb/mergesort 27.938s
While the parallel implementation is faster than the sequential one, it is still quite bad. After all, the sequential implementation ran one 1 processor while the parallel on 8. It is difficult to achieve linear speedup anyway, but for 8x the processing power we have to do something better than 2.5x speedup.
Also, someone could say we were lucky. We should not rely on intuition for performance but use benchmarks and available tools. So in the next post, I use this parallel implementation as a baseline and try to improve it using the
trace tools that Go provides.