Goroutines, Closures, and You

Greg Poirier
The Opsee Blog
Published in
6 min readMar 30, 2016

The Opsee bastion is built to run a lot of checks against services external to it simultaneously. It does all kinds of things like measure latency for time-to-first-byte, response latency, and evaluates assertions against the bodies and headers of the response returned--all guaranteeing that the execution of the check happens in a timely manner, cancelling checks that exceed a timeout, and then notifying us when that happens via metrics. It's a thing I helped build and architect, and I'm really quite proud of it.

There are a number of components to check execution, but the two most important are the Checker and Runner:

  • Checker - Responsible for maintaining check configuration and scheduling checks. This exists in a separate network namespace that is connected to Opsee's network and communicates to the rest of the components via NSQ.
  • Runner - Manages pools of check execution workers (one pool per check type). the Runner listens to NSQ and dispatches checks to the first available worker. This component communicates to the customer's network via the regular network path through docker0 to the eth0.

    In the Runner, there is a tiny component called the Dispatcher. It's job is to listen to NSQ and then wait on a free Worker. If no Worker is available in 30 seconds, then the check's execution context is cancelled. We use go's net/x/context package for check execution contexts. In particular, context.WithDeadline() and context.WithValue() are of particularly helpful use in general contextual execution of code--not just in the context of network requests.

In the Dispatcher is the following seemingly innocuous code (annotated for your convenience):

func (dispatcher *Dispatcher) Dispatch(ctx context.Context, tg *TaskGroup) <-chan *Task {// A Task is an object that encapsulates our Request and Response interface objects.
// A TaskGroup is basically []*Task with a few helper methods.

finished := make(chan *Task, len(tg))
defer close(finished)
wg := &sync.WaitGroup{}

for _, task := range tg {
select {
// If the context is cancelled, we do not perform the task
// and note it in the task's response.
case <-ctx.Done():
t.Response = Response{Error: ctx.Err()}
finished <- t
// If we get a worker, we wait for it to finish.
// The worker also adheres to the context's deadline, etc.,
// so it will return almost immediately if the context is
// cancelled for any reason.
case w := dispatcher.WorkerFor(task):
wg.Add(1)
go func() {
finished <- w.Do(ctx, task)
wg.Done()
}()
}

wg.Wait()
return finished, nil
}

I think this is some fairly idiomatic Go. This is the latest version of this function, and it's quite performant. With 10 workers of 2 types, the runner has a total of 43 Goroutines for a Bastion executing 25 checks (NSQ clients, main Goroutine, workers, concurrent tasks, etc.). Given that our previous iteration maintained something like 300 Goroutines on a regular basis, I'd say we've come far. In general, a task's execution is entirely bound by network latency to the service. Good job us. Except, there’s a problem. You see it, right? OR DO YOU? Because I didn't.

case w := dispatcher.WorkerFor(task):
wg.Add(1)
go func() {
finished <- w.Do(ctx, task)
wg.Done()
}()

This is just bad Go, and it's all my fault. In a feverish haze over the weekend I completely rewrote the check execution pipeline, and introduced this bug. I'd blame it on the fever, but it wasn't really the fever's fault. After deploying this code to a couple of bastions, I quickly got a bug report from a user.

"Greg. We're seeing something really weird. This service has three instances behind the ELB, and in the UI we're seeing three responses, but they're all from the same instance."

Okay, so I basically started freaking out. We're two days from public beta launch, and there's basically a showstopper bug, it's my fault, self-loathing ensues. This is actually where things break down too, because when you introduce a bug like this, it's really easy to spiral into an unwinding mess of emotions. Especially when you're prone to depression like I am.

So I stared at this bug for days.

In the golang wiki on GitHub there is an article titled "CommonMistakes".

https://github.com/golang/go/wiki/CommonMistakes

There is only one mistake listed there, and I just made it. This is most definitely not the only common mistake, but it's the only one listed there. And I still made it. I may look like an idiot because of this post, but I don't particularly care how I appear, because this can and will happen to you. This post is my hair shirt. It is penance for making such an egregious mistake.

Introducing a bug like this after a year of working in Go makes me wonder if it's prudent to never rely on external bindings for goroutines. When I say external bindings, I am referring to the binding of a variable to a value in a scope external to the goroutine.

val := "val"
go func() { fmt.Println(val) }()

I've never really thought about it before now, but while this makes perfect sense in some cases, in Goroutines it does not so much. Most often it makes sense to pass a variable into a scope; we do it all the time without really thinking about it. For example:

val := "val"
if val == "val" {
fmt.Println(val)
}

There's no danger that val will change between the assignment and the execution of fmt.Println(val). But when you're dealing with Goroutines, there is absolutely no guarantee when that statement will be executed. And that's fine, but it's extremely important to understand exactly what is happening with any variables passed into the scope of the Goroutine. Consider the following trivial example:

package mainimport (
"fmt"
)
func main() {
equal := 0
notEqual := 0
val := "val"
for i := 0; i < 60000; i++ {
val = "val"
go func() {
if val == "val" {
equal += 1
} else {
notEqual += 1
}
}()
val = "not val"
}
fmt.Println("Val was val! ", equal)
fmt.Println("Val was not val. :( ", notEqual)
}

Go ahead. Run it a few times. The results are hilarious. For even more exciting results add a time.Sleep(1 * time.Nanosecond) at the top of the Goroutine.

Now, the thing that makes this kind of bug so insidious is that it's not always apparent that your bound variable is changing. In fact, the example in the "CommonMistakes" article isn't clear if you're not really thinking about what's happening--or familiar with exactly how loops in Go work. First, make sure you read the Golang specification for variable declarations (and in particular short variable declarations). The most important sentence:

"Redeclaration does not introduce a new variable; it just assigns a new value to the original."

In essence this means that there is no difference between:

for i := 0; i < 1; i++ {
something
}

And:

i := 0
something
i = 1

The problem here is that because of the way language syntax works, we see := and think immediately about initialization of a new variable. In other words, if we're not very careful, we will think about loops incorrectly. This is why the for-loop-goroutine bug is so common, I think.

So I changed the Goroutines spawned by the Dispatcher spawns to take arguments.

go func(worker Worker, task *Task) {
finished <- worker.Do(task)
}(w, t)

This introduces the new Goroutine-scoped variables worker and task, and assigns to them the current value of w and t. Easiest fix ever, but it took two engineers’ time to find it. Mark, another engineer at Opsee, and I sat in a Google Hangout for thirty minutes staring at this — and, most importantly, spending time calming me down. We stepped through all of the code, and then it hit Mark.

The value assigned to task was getting redefined as well as the value assigned to worker, which was also problematic in a totally different way. And that’s the thing that’s so tricky about this mistake — it’s non-deterministic code, our old friend.

Your code won't be trivial like the examples. It might be something like this:

val := "something"
go func() { doSomethingWithVal(val) }()
// 9 billion lines of code later
val = "LOL F U BUDDY"

It will be difficult to reason about these things. Get into the habit of passing values into Goroutines. Make sure your tests cover _everything_ about the behavior you've specified. Really think about the code that you're writing. I don't know. There are a million things to take away from a mistake like this, but those are the ones I'm mostly focusing on.

Update — Only hours later…

Many thanks to Kevin Burke for this. Data race detection would have pointed this bug out. Because of reasons, I was under the impression that gb did not support arbitrary flags passed to go test (it is a strange CLI — and one we are actively replacing with govendor/GOPATH). So now our builds include race and deadlock detection.

--

--