Goroutines and Channels Aren’t Free

Elliott Jin
4 min readFeb 22, 2018

I used to think the performance overhead of goroutines and channels is basically negligible — especially when compared to things like IO — but a recent experience gave me a reality check.

I’ve been working on a toy relational database as the project for a course I’m taking. I started out by loading tables from CSV files, and eventually I needed to add a binary table format. Unfortunately, my initial attempt performed much worse than CSV:

$ ./csv_scan_benchmark -path table.csv
Done scanning 20000263 records after 15.69807191s
$ ./binary_scan_benchmark -path table.bt
Done scanning 20000263 records after 27.01220384s

Scanning the binary table format was almost twice as slow! I found this extremely counterintuitive, since the binary format was more compact and didn’t require any string conversion. Fortunately, I was able to investigate the issue using go’s excellent tools for performance profiling.

The CPU profile for scanning a CSV table looked reasonable; most of the time was spent on IO related system calls:

CPU profile for scanning a CSV table

But the CPU profile for scanning a binary table looked extremely *un*reasonable; only a tiny fraction of the time was spent on IO!

CPU profile for scanning a binary table (initial attempt)

It turns out that this ridiculous CPU profile was caused by the way I used go’s concurrency primitives. At the time, I thought decoupling production from consumption using goroutines and channels would be a clean way to structure the scanner code.

Creating a scanner launches a producer goroutine, which performs IO / decoding and writes to a result channel:

func NewBinaryScan(path string) (*binaryScan, error) {
...
go s.producer()
return s, nil
}
func (s *binaryScan) producer() {
for {
record, err = s.readAndDecodeRecord(s.bufferedReader)
s.resultChan <- &result{record, err}
if err != nil {
return
}
}
}

The consumer goroutine, which retrieves results by repeatedly calling NextRecord, only has to read from the result channel:

func (s *binaryScan) NextRecord() (Record, error) {
result := <-s.resultChan:
return result.record, result.err
}

However, the CPU profile showed that with this approach, goroutines spent a lot of time blocking on channel operations, and the go runtime wasted a lot of resources on scheduling / concurrency primitives.

Rewriting the binary table scanner to do everything from the consumer goroutine was very straightforward:

func NewBinaryScan(path string) (*binaryScan, error) {
...
// No more producer.
return s, nil
}
func (s *binaryScan) NextRecord() (Record, error) {
return s.readAndDecodeRecord(s.bufferedReader)
}

However, this small change had a dramatic effect on performance. Here’s the updated CPU profile, which looks much more sensible than before:

CPU profile for scanning a binary table (optimized)

And here’s the updated benchmark result:

$ ./binary_scan_benchmark -path ratings.btDone scanning 20000263 records after 8.160765247s

Almost 2x faster than CSV, and over 3x faster than the initial attempt. Now that’s more like it!

I still think proper use of goroutines and channels can result in clean and well-structured code, and that in most cases, they won’t turn out to be the root cause of performance issues. But this experience was a good reminder that even go’s awesome concurrency primitives aren’t free.

[EDIT]

Thanks to Raghav for suggesting more cases to benchmark!

  • Adding a select with a context.Context and a short timeout to the producer: 59.4s
  • Adding a select with a context.Context and a short timeout to the consumer: 58.0s
  • Using a buffered channel with size 1: 23.5s
  • Using a buffered channel with size 1000: 17.4s
  • Removing the select on a done channel (the done channel wasn’t included in the code snippets above): 19.9s
  • Removing the select on a done channel AND using a buffered channel with size 1000: 14.3s

It looks like there’s another lesson to be learned here: channel buffer size and complexity of select statements can both have a big effect on performance.

[EDIT 2]

Thanks Stuart Carnie for the suggestion to send batches of records over the channel instead of a single record at a time! Here are the benchmark results I got for various batch sizes:

  • 1: 28.83s
  • 10: 12.36s
  • 100: 8.92s
  • 1,000: 8.68s
  • 10,000: 8.74s
  • 100,000: 9.32s

It turns out that “reducing the number of channel write operations by 3 orders of magnitude”, as Stuart pointed out, also has a significant effect here.

--

--