Recently, I stumbled upon the PingCAP talent-plan, which is “a series of training courses about writing distributed systems in Go and Rust”. Having started learning Go recently, I decided to start with the “Distributed systems in Go” course, where the first project is to implement the merge sort algorithm to sort 16 million 64 bit integers.
The project requirements are:
- Pass the provided unit test.
- Make it perform better than sort.Slice().
- Have a document to describe your idea and record the process of performance optimization with pprof.
- Have a good code style.
I cut some corners and didn’t use pprof to find opportunities for optimization. I just went straight into parallelizing it, but using pprof to compare different parallel implementations would make an interesting next post.
What is merge sort?
Merge sort is a divide and conquer algorithm that sorts the input array by breaking into subarrays, sorting each one and then merging the results back into a sorted array. The time and space complexities are O(NlogN) and O(n) respectively. We won’t go into details of the asymptotic analysis in this post as there are many great resources about it online.
We can easily implement merge sort using a recursive function. The function splits the array in half, calls itself for each of the subarrays and then merges the resulting arrays in ascending order.
The merge function compares the elements of the two input arrays and writes them in sorted order to the result array (which is the input to each `MergeSort`). This overwriting happens to preserve the `MergeSort`’s signature as provided in the project’s skeleton. When it finishes traversing one of the two inputs, it appends the remaining elements to the result.
Running the provided test suite verifies our implementation:
> make test
OK: 1 passed
ok pingcap/talentplan/tidb/mergesort 1.288s
An optimization for our solution would be to take advantage of multiple processors and use multiple goroutines to parallelize the execution of some MergeSort() calls. The max number of goroutines running in parallel equals the number of available CPUs. Creating more goroutines will put an unnecessary burden on the Go scheduler, which can even make our parallel implementation worse than the sequential one!
To limit the number of simultaneous goroutines, we can use a buffered channel as a semaphore. Writing to a full buffered channel will block and will block us (:D) from creating extra goroutines we cannot utilize. MergeSort() will try to spawn a new goroutine to sort the left subarray but will sort the right subarray itself, while it waits for the other to finish.
Running the provided benchmark suite shows us a ~2.5x improvement over the builtin `sort.Slice()`.
> make bench
go test -bench Benchmark -run xx -count 5 -benchmem
BenchmarkMergeSortParallel-8 1 1334344722 ns/op 3489690688 B/op 50331744 allocs/op
BenchmarkMergeSortParallel-8 1 1371397711 ns/op 3489701856 B/op 50331783 allocs/op
BenchmarkMergeSortParallel-8 1 1386620551 ns/op 3489690272 B/op 50331746 allocs/op
BenchmarkMergeSortParallel-8 1 1350215546 ns/op 3489668848 B/op 50331694 allocs/op
BenchmarkMergeSortParallel-8 1 1325400066 ns/op 3489662752 B/op 50331657 allocs/op
BenchmarkNormalSort-8 1 3325640673 ns/op 64 B/op 2 allocs/op
BenchmarkNormalSort-8 1 3318650150 ns/op 64 B/op 2 allocs/op
BenchmarkNormalSort-8 1 3317344986 ns/op 64 B/op 2 allocs/op
BenchmarkNormalSort-8 1 3309457091 ns/op 64 B/op 2 allocs/op
BenchmarkNormalSort-8 1 3303682084 ns/op 64 B/op 2 allocs/op
ok pingcap/talentplan/tidb/mergesort 27.938s