Goroutines and Channels Aren’t Free
I used to think the performance overhead of goroutines and channels is basically negligible — especially when compared to things like IO — but a recent experience gave me a reality check.
I’ve been working on a toy relational database as the project for a course I’m taking. I started out by loading tables from CSV files, and eventually I needed to add a binary table format. Unfortunately, my initial attempt performed much worse than CSV:
$ ./csv_scan_benchmark -path table.csv
Done scanning 20000263 records after 15.69807191s$ ./binary_scan_benchmark -path table.bt
Done scanning 20000263 records after 27.01220384s
Scanning the binary table format was almost twice as slow! I found this extremely counterintuitive, since the binary format was more compact and didn’t require any string conversion. Fortunately, I was able to investigate the issue using go’s excellent tools for performance profiling.
The CPU profile for scanning a CSV table looked reasonable; most of the time was spent on IO related system calls:
But the CPU profile for scanning a binary table looked extremely *un*reasonable; only a tiny fraction of the time was spent on IO!
It turns out that this ridiculous CPU profile was caused by the way I used go’s concurrency primitives. At the time, I thought decoupling production from consumption using goroutines and channels would be a clean way to structure the scanner code.
Creating a scanner launches a producer goroutine, which performs IO / decoding and writes to a result channel:
func NewBinaryScan(path string) (*binaryScan, error) {
...
go s.producer()
return s, nil
}func (s *binaryScan) producer() {
for {
record, err = s.readAndDecodeRecord(s.bufferedReader)
s.resultChan <- &result{record, err}
if err != nil {
return
}
}
}
The consumer goroutine, which retrieves results by repeatedly calling NextRecord
, only has to read from the result channel:
func (s *binaryScan) NextRecord() (Record, error) {
result := <-s.resultChan:
return result.record, result.err
}
However, the CPU profile showed that with this approach, goroutines spent a lot of time blocking on channel operations, and the go runtime wasted a lot of resources on scheduling / concurrency primitives.
Rewriting the binary table scanner to do everything from the consumer goroutine was very straightforward:
func NewBinaryScan(path string) (*binaryScan, error) {
...
// No more producer.
return s, nil
}func (s *binaryScan) NextRecord() (Record, error) {
return s.readAndDecodeRecord(s.bufferedReader)
}
However, this small change had a dramatic effect on performance. Here’s the updated CPU profile, which looks much more sensible than before:
And here’s the updated benchmark result:
$ ./binary_scan_benchmark -path ratings.btDone scanning 20000263 records after 8.160765247s
Almost 2x faster than CSV, and over 3x faster than the initial attempt. Now that’s more like it!
I still think proper use of goroutines and channels can result in clean and well-structured code, and that in most cases, they won’t turn out to be the root cause of performance issues. But this experience was a good reminder that even go’s awesome concurrency primitives aren’t free.
[EDIT]
Thanks to Raghav for suggesting more cases to benchmark!
- Adding a
select
with acontext.Context
and a short timeout to the producer: 59.4s - Adding a
select
with acontext.Context
and a short timeout to the consumer: 58.0s - Using a buffered channel with size 1: 23.5s
- Using a buffered channel with size 1000: 17.4s
- Removing the
select
on adone
channel (thedone
channel wasn’t included in the code snippets above): 19.9s - Removing the
select
on adone
channel AND using a buffered channel with size 1000: 14.3s
It looks like there’s another lesson to be learned here: channel buffer size and complexity of select
statements can both have a big effect on performance.
[EDIT 2]
Thanks Stuart Carnie for the suggestion to send batches of records over the channel instead of a single record at a time! Here are the benchmark results I got for various batch sizes:
- 1: 28.83s
- 10: 12.36s
- 100: 8.92s
- 1,000: 8.68s
- 10,000: 8.74s
- 100,000: 9.32s
It turns out that “reducing the number of channel write operations by 3 orders of magnitude”, as Stuart pointed out, also has a significant effect here.