Debugging Golang concurrency issues

Or how the (awesome) race detector saved my ass

Recently I’ve been working on a Golang application that uses Google Cloud Pub/Sub to migrate large amounts of Google Cloud Datastore entities efficiently and resiliently. Basically, a master process queries entities and publishes them to a Pub/Sub topic, ready to be consumed by multiple workers.

To make sure the Pub/Sub topic and subscription are cleaned up afterwards, I made sure the master waits until all entities are processed before cleaning up and exiting. The following snippet contains the code I used to do just that.

Initial code

This all worked great, the master process ran for as long as it took the workers to process all entities, after which it exited with everything cleaned up nicely. Or so I thought.

While writing my integration tests that run the entire process against the Datastore and Pub/Sub emulators I noticed that sometimes not all entities were processed correctly. Me being suspicious, I tried running the tests using the -race flag, which gave me the following output (I removed the file paths for brevity and possible NDA-ness):

❯ go test -v -race ./...[...]WARNING: DATA RACE
Write at 0x00c420160d40 by goroutine 189:
app/migration.(*master).waitUntilQueueEmpty.func2()
app/vendor/cloud.google.com/go/pubsub.(*Subscription).receive.func2()
Previous read at 0x00c420160d40 by goroutine 184:
app/migration.(*master).waitUntilQueueEmpty.func1()
Goroutine 189 (running) created at:
app/vendor/cloud.google.com/go/pubsub.(*Subscription).receive()
app/vendor/cloud.google.com/go/pubsub.(*Subscription).Receive.func2()
app/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1()
Goroutine 184 (running) created at:
app/migration.(*master).waitUntilQueueEmpty()
app/migration.(*master).Run()
==================
--- FAIL: TestMigration (20.56s)
testing.go:699: race detected during execution of test
FAIL
FAIL app/test 20.607s

Looking at the code I realised that I made the exact same mistake they showed in Golang’s own blog post on race detection: I was reading and writing the lastMessageTimestamp in two different goroutines. Luckily for me, the fix used in the blog post was just as usable for me. Using the code below the race detector was happy and my random failures disappeared.

Revised code

So if you haven’t already, just try running your app or tests using the -race flag. Who knows, you might find that nasty bug you’ve been debugging for days!

Senior Software Engineer at Topic Embedded Systems, currently positioned at ASML