Golang: Concurrency is Hard; So What Can We Do About It?

Published in

Dm03514 Tech Blog

14 min readOct 7, 2018

Concurrency In GO

The Go language provides absolutely amazing concurrency primitives and truly achieves making concurrency a first class citizen. Unfortunately ensuring concurrent correctness requires the combination of many different techniques in order to minimize the chances of concurrency related errors. Most of these techniques are not automated (enforced by the compiler) and are largely dependent on experience. I have found that it’s often difficult to reason about common concurrency errors without having encountered them at some point. For many of the organizations I’ve been a part of, this leads to a greater likelihood that less experienced engineers introduce concurrency errors.

This post will walk through what race conditions are, and why concurrent programming is so hard. It will then survey the current solutions for detecting and preventing race conditions in Go. Finally a manual analysis technique ,which I have found to be effective in identifying race conditions, called Candidates and Contexts will be covered. (All code examples can be found at grokking-go github repo)

Dangers Of Concurrency

Concurrency is when two operations are making progress at the same time. Contrast this with synchronous execution where a program is executed one step after another with nothing else happening in the runtime except for the current operation being executed. Concurrent operations are non-deterministic and are therefore unpredictable and extremely hard to reason about.

The difficulty is that in concurrent programming a unit of work can be preempted creating a huge number of potential execution orderings. This non-determinism can mean that shared memory can be acted upon by different threads of work in unexpected ways. Shared memory which isn’t protected, and explicitly made safe to access, can result in race conditions. This results in unsafe code. In contrast, thread safe code is code that’s correct and safe to use in a concurrent environment.

Since concurrency is non-deterministic it can result in extremely insidious bugs. Code may look ok and pass tests, but then at high concurrency or in certain code path executions, it results in explicit raises or subtly broken behavior.

This article focuses on a single class of concurrent bugs: Race conditions.

Race Conditions

Race conditions are where 2 threads are accessing memory at the same time, one of which is writing. Race conditions occur because of unsynchronized access to shared memory.

Explicit Unsynchronized Memory Access

The following code illustrates an http handler with a race condition:

reqCount := Counter{}

http.Handle("/", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
  value := reqCount.Value()
  fmt.Printf("handling request: %d\n", value)
  time.Sleep(1 * time.Nanosecond)
  reqCount.Set(value + 1)
  fmt.Fprintln(w, "Hello, client")
}))log.Fatal(http.ListenAndServe(":8080", nil))

The race condition becomes apparent when load is applied (source). The test below applies 200 requests using 200 goroutines. The expectation is that the counter is incremented for each request but the results are way way off:

$ go test -run TestExplicitRace ./races/ -v -total-requests=200 -concurrent-requests=200
...
handling request: 9
handling request: 9
handling request: 9
handling request: 10
handling request: 11
handling request: 12
handling request: 13
handling request: 14
handling request: 15
handling request: 16
handling request: 17
handling request: 18
handling request: 19
handling request: 20
handling request: 21
handling request: 22
handling request: 23
handling request: 24
handling request: 25
handling request: 26
handling request: 26
handling request: 26
handling request: 26
Num Requests TO Make: 200
Final Count: 27
--- FAIL: TestExplicitRace (0.08s)
        explict_test.go:72: expected 200 requests: received 27
FAIL
FAIL    github.com/dm03514/grokking-go/candidates-and-contexts/races    0.083s

An interleaving of execution steps creates an incorrect result.

The chart above illustrates the issues. It’s read from top to bottom. The counter starts at 0. Multiple http handler go routines read the current value through the Value() function and save it locally:

value := reqCount.Value()

They then do some work simulated by:

time.Sleep(1 * time.Nanosecond)

After which they increment the counter saying that they completed:

reqCount.Set(value + 1)

The issue is that multiple go routines are overwriting the same value! In the case of the chart both are writing the value 1 when actually it should be 1+1!

In the case of an explicit race we can have two concurrent go routines operating on the same memory at the exact same time, one writing and one reading:

This results in undefined behavior. The worst part is that even though there is a race condition there are no explicit errors. The program is just incorrect. Because concurrency is so incredibly difficult, many code bases I have worked on have these sorts of bugs….everywhere (of which I have contributed more than my fair share).

Logical Race Conditions

Even if access to the reqCount Counter from the above example was thread safe there is still an issue of the logical race condition. The following test (source) is executed using a fully synchronized thread-safe counter (covered later), yet the results are still incorrect:

$ go test -run TestLogicalRace ./races/ -v -total-requests=200 -concurrent-requests=200...
handling request: 25
handling request: 25
handling request: 25
handling request: 25
handling request: 25
handling request: 25
handling request: 25
handling request: 25
handling request: 25
handling request: 25
handling request: 20
handling request: 25
handling request: 26
handling request: 27
Num Requests TO Make: 200
Final Count: 26
--- FAIL: TestLogicalRace (0.12s)
        logical_test.go:67: expected 200 requests: received 26
FAIL
FAIL    github.com/dm03514/grokking-go/candidates-and-contexts/races    0.123s

While we no longer have read and writes take place at the same time, the application is acting on stale data (the same way the explicit race condition above is). Concurrent Value() calls can return the same value and result in concurrent threads setting the same value, which is logically incorrect, ie each http handler invocation should result in +1 to the counter, ie counter operations should be serialized across all threads.

Logical Race Conditions are an entirely different class of problem, and this article will only be focusing on the explicit race condition case.

Solutions

Now that we’ve covered what race conditions are, the following are some of the tools available to go engineer to mitigate them:

Race Detector

Go’s race detector enables instrumenting memory accesses in order to determine if memory is ever being acted on concurrently. The go test framework exposes the race detector through the -race flag. Even the very first line of the official race detector documentation writes: Race conditions are among the most insidious and elusive programming errors.

While the race detector is extremely useful tool, it is passive in that a race condition must occur while it is running. This puts the burden on the engineer to identify which routines would benefit from the race detector, write a test using concurrency, and then execute the test with race detection enabled. Even if we did construct a test with high level of concurrency there is still no 100% guarantee that test invocation will result overlapping read and writes. This shouldn’t discount using the race detector as it can, and will, detect races most of the time and is still an amazingly powerful tool.

Unfortunately, the race detector isn’t about prevention, it’s about detection.

Let’s use the race detector against our first example to see how it works. Below executes the explicit race condition test (source here) using the -race flag.

$ go test -run TestExplicitRace ./races/ -v -total-requests=200 -concurrent-requests=200 -race=== RUN   TestExplicitRace
handling request: 0
handling request: 0
handling request: 0
handling request: 0
==================
WARNING: DATA RACE
Write at 0x00c4200164e8 by goroutine 326:
  github.com/dm03514/grokking-go/candidates-and-contexts/races.TestExplicitRace.func1.1()
      /vagrant_data/go/src/github.com/dm03514/grokking-go/candidates-and-contexts/races/counters.go:18 +0x115
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/go/src/net/http/server.go:1947 +0x51
  net/http.(*ServeMux).ServeHTTP()
      /usr/local/go/src/net/http/server.go:2340 +0x9f
  net/http.serverHandler.ServeHTTP()
      /usr/local/go/src/net/http/server.go:2697 +0xb9
  net/http.(*conn).serve()
      /usr/local/go/src/net/http/server.go:1830 +0x7dcPrevious read at 0x00c4200164e8 by goroutine 426:
  github.com/dm03514/grokking-go/candidates-and-contexts/races.TestExplicitRace.func1.1()
      /vagrant_data/go/src/github.com/dm03514/grokking-go/candidates-and-contexts/races/counters.go:14 +0x5b
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/go/src/net/http/server.go:1947 +0x51
  net/http.(*ServeMux).ServeHTTP()
      /usr/local/go/src/net/http/server.go:2340 +0x9f
  net/http.serverHandler.ServeHTTP()
      /usr/local/go/src/net/http/server.go:2697 +0xb9
  net/http.(*conn).serve()
      /usr/local/go/src/net/http/server.go:1830 +0x7dcGoroutine 326 (running) created at:
  net/http.(*Server).Serve()
      /usr/local/go/src/net/http/server.go:2798 +0x364
  net/http.(*Server).ListenAndServe()
      /usr/local/go/src/net/http/server.go:2714 +0xc4
  net/http.ListenAndServe()
      /usr/local/go/src/net/http/server.go:2972 +0xf6
  github.com/dm03514/grokking-go/candidates-and-contexts/races.TestExplicitRace.func1()
      /vagrant_data/go/src/github.com/dm03514/grokking-go/candidates-and-contexts/races/explict_test.go:36 +0xd9Goroutine 426 (running) created at:
  net/http.(*Server).Serve()
      /usr/local/go/src/net/http/server.go:2798 +0x364
  net/http.(*Server).ListenAndServe()
      /usr/local/go/src/net/http/server.go:2714 +0xc4
  net/http.ListenAndServe()
      /usr/local/go/src/net/http/server.go:2972 +0xf6
  github.com/dm03514/grokking-go/candidates-and-contexts/races.TestExplicitRace.func1()
      /vagrant_data/go/src/github.com/dm03514/grokking-go/candidates-and-contexts/races/explict_test.go:36 +0xd9
==================

Awesome! Very early in our test go race detector recognizes a race and alerts us.

WARNING: DATA RACE
Write at 0x00c4200164e8 by goroutine 326:
  github.com/dm03514/grokking-go/candidates-and-contexts/races.TestExplicitRace.func1.1()
      /vagrant_data/go/src/github.com/dm03514/grokking-go/candidates-and-contexts/races/counters.go:18 +0x115
...
Previous read at 0x00c4200164e8 by goroutine 426:
  github.com/dm03514/grokking-go/candidates-and-contexts/races.TestExplicitRace.func1.1()
      /vagrant_data/go/src/github.com/dm03514/grokking-go/candidates-and-contexts/races/counters.go:14 +0x5b

These lines indicate that there is a concurrent read and a concurrent write:

package races

import "sync"

type Counter struct {
	count int
}

func (c *Counter) Value() int {
	return c.count # line 14
}

func (c *Counter) Set(v int) {
	c.count = v # line 18
}

There’s lots written about the race detector. Downsides are the resource overhead incurred by using it and having to write focused tests in order to gain the benefit. Another approach too is to canary a release with a small subset of traffic with -race enabled.

Explicit Synchronization

Explicit synchronization is where variables accesses are protected through synchronization primitives such as a mutex. Explicit Synchronization puts the burden on the engineer to recognize candidates for concurrent execution and the contexts in which they’ll be executed. And then they require that the engineer knows to write the locking code to actually synchronize access.

This is tricky because a variable increment is safe synchronously but concurrently it is unsafe. This is where the dependency on experience that I mentioned above comes into play. Explicitly synchronizing memory access requires predicting and identifying all contexts that a piece of code will be executed in. Since we know our handler is being executed concurrently by go net/http library, we can add explicit synchronization to the counter and provide future developers with thread safety guarantee:

type SynchronizedCounter struct {
   mu *sync.Mutex
   count int
}

func (c *SynchronizedCounter) Inc() {
   c.mu.Lock()
   defer c.mu.Unlock()

   c.count++
}

func (c *SynchronizedCounter) Value() int {
   c.mu.Lock()
   defer c.mu.Unlock()

   return c.count
}

func (c *SynchronizedCounter) Set(v int) {
   c.mu.Lock()
   defer c.mu.Unlock()

   c.count = v
}

We now can offer the assurance that our method is thread safe; all state modifications are protected by a mutex and are serializable.

Static Analysis (go vet)

Static analysis (specifically mutex detection) helps with misuse of mutex and is another supportive reactionary detection. It doesn’t help to directly detect when a variable needs a mutex, but only if a mutex isn’t being used correctly. It requires that the engineer recognizes when a mutex is necessary and adds that mutex, but accidentally misuses the mutex by passing a copy instead of a reference:

type MisSynchronizedCounter struct {
   mu    sync.Mutex
   count int
}

func (c MisSynchronizedCounter) Inc() {
   c.mu.Lock()
   defer c.mu.Unlock()

   c.count++
}

Vet recognizes that a copy of the lock is being passed:

$ go vet -copylocks ./races/counters.go
# command-line-arguments
races/counters.go:52: Inc passes lock by value: races.MisSynchronizedCounter contains sync.Mutex

vet is an important tool to have and trivial to add to any build process but is not sufficient in identifying or detecting race conditions.

Design based

Design based correctness leverages go’s safe primitives and design pattern best practices in order to minimize the likelihood of race conditions. Two common patterns are:

“monitor” goroutine
worker pool

Both of which leverage go channels. The design based approach embodies the go mantra: “Share Memory By Communicating”. The following shows what happens when the counter is refactored to be encapsulated behind a monitor goroutine (source here):

countChan := make(chan struct{})
go func() {
   for range countChan {
      reqCount.Inc()
      fmt.Printf("handling request: %d\n", reqCount.Value())
   }
}()

This is awesome because its sort of a hybrid approach: it allows for scheduling concurrent operations but the concurrent operation is the only thing accessing reqCount , meaning reqCount doesn’t need to be synchronized (other than the main test thread that access it for an assertion after countChan is closed. As we can see the program behaves as expected, and the race is removed (source here):

$ go test -run TestDesignNoRace ./races/ -v -total-requests=200 -concurrent-requests=200 -racehandling request: 1
...
handling request: 190
handling request: 191
handling request: 192
handling request: 193
handling request: 194
handling request: 195
handling request: 196
handling request: 197
handling request: 198
handling request: 199
handling request: 200
Num Requests TO Make: 200
Final Count: 200
--- PASS: TestDesignNoRace (0.64s)
PASS
ok      github.com/dm03514/grokking-go/candidates-and-contexts/races    1.691s

This is an extremely powerful pattern and can be extended to create worker pools. Imagine that instead of counting we were inserting data into a database. We could spawn up a number of go routines to bound the max # of inserts that could occur at a single time (ie 10) and have all of them share the same channel. Each go routine could have its own resources it wouldn’t share with any other goroutine. This allows for each individual go routine to be its own little universe and not require synchronization. Each go routine is its own synchronous little world but is scheduled external to itself to be run concurrently.

Analysis

Analysis is required when using 3rd party or external components in a concurrent context. Go convention is to assume everything is thread UN-safe unless explicit guarantee is provided.

The common candidates for this sort of analysis are:

DB connections
TCP connections
3rd Party SDK/Drivers
Anything 3rd party that is shared between goroutines!!

Suppose our application needs to write to a database. We have db initialization and then would like to have a pool of workers which read and write data to the db concurrently through db.QueryContext . Is it safe to pass around the db instance??

The first step is to check the documentation, and in this case it is:

DB is a database handle representing a pool of zero or more underlying connections. It’s safe for concurrent use by multiple goroutines.

I personally believe this is strong enough assurance because it has yet to bite me. For large or core projects (cassandra, go-aws-sdk, go standard library) thread safety guarantees are usually published.

For smaller projects guarantees are sometimes NOT published and require auditing code for concurrent correctness.

No Concurrency

If concurrency can be avoided it can completely remove this whole class of extremely difficult errors. This has gotten many runtimes very far, take python or ruby as an example. For standard web server pre fork deployments a pool of python/ruby processes are initialized. Each process takes a connection performs work synchronously and then returns a result. Concurrency is achieved out of band by using a reverse proxy such as uwsgi or nginx. This allows for extremely simple local and dev execution environments and offloads concurrency to an out of band process.

This may only work for certain classes of problems as one of the main reasons to use go is how performant it is. Because of how amazing the runtime is an http server on a standard 4 core machine can efficiently handle an impressive amount of load and active connections.

While this seems like sliding backwards, if ruling out concurrency is at all possible there are extremely compelling reasons to. This pattern is very closely related to the design pattern outlined above. Each individual worker function is its own little world and has no knowledge of how it will be scheduled. Because it receives input through a channel (concurrent primitive) and shares nothing with any other worker it is concurrent safe.

As the above show there is no objective active solution. The ideal solution would be that the compiler could magically detect and tell when and where code is being executed concurrently and alert in a high fidelity way of issues.

Since the compiler can’t do this for us yet below outlines a strategy that I regularly employ to help with this:

Candidates and Contexts (C&C)

This is a application level manual analysis approach that I’ve been working on to try and fill some of the gap left by lack of an active check in the compiler. It’s based on identifying candidate statements. These are high risk statements for causing concurrency errors. It then checks if any of those statements are executed within a concurrent context. C&C is about identification since its a precursor to detection. I have found it to be a great tool to determine which routines should have tests that exercise concurrency through using go’s race detector.

A race condition requires two things in order to occur:

shared memory (candidate) and
concurrent access (context)

We can visualize this relationship as follows:

The thought here is that if memory is not shared and/or not executed concurrently it’s not a potential for races. Same as if memory is shared (like a standard global variable) but executed synchronously there won’t be race conditions either (although there are plenty of design and understandability issues with globals). Additionally, if there is concurrent execution but no shared state, there are also no race conditions (this is the monitor pattern listed above, or the python/ruby pattern). Race conditions can occur when there is shared mutable state and concurrent actors operating on that state. Candidates and Contexts is based on identifying the shared state being concurrently operated on.

A candidate by itself doesn’t indicate a race condition just as using a goroutine doesn’t represent a race condition, but candidates that are also being executed concurrently are potential race conditions:

The goal of C&C is to identify areas of code that are a high risk for concurrency errors.

Candidates

The first step of C&C is identifying the candidates. These are the things with a disproportionately large risk of a race condition. A variable allocated within a frame and not shared outside of it has less risk than a global variable or a pointer that is passed between multiple function calls. The candidate detection step is done independently of concurrency.

The candidates that I look for are:

Global Variables
Pointer receivers

Contexts

The next step is to identify contexts. Contexts are areas of code that are being executed concurrently, ie in a goroutine. The issue is that many libraries provide abstractions over goroutines, so that searching a code base for all goroutines (ie grep “go “) could involve deep analysis into multiple degrees of dependencies.

Because of this, the contexts identifies common application level abstractions which exist on top of goroutines. Take HTTP for example. For this approach the context is a handler function but the root is still still just a go routine, because go’s net library invokes each accepted connection as a go routine.

Common Context’s are:

Explicit go routines `grep -rnIF ‘go ‘ .`
HTTP server
Other application level ie a worker pool

Analysis

The next step of C&C is to check for overlaps in candidates and contexts. This will identify code in the Shared Memory / Concurrent Execution quadrant of the matrix above (when a candidate is executed within a context) and should be used to flag code for additional manual analysis.

Example

type Counter struct {
   count int
}

func (c *Counter) Value() int {
   return c.count
}

func (c *Counter) Set(v int) {
   c.count = v
}reqCount := Counter{}

http.Handle("/", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
  value := reqCount.Value()
  fmt.Printf("handling request: %d\n", value)
  time.Sleep(1 * time.Nanosecond)
  reqCount.Set(value + 1)
  fmt.Fprintln(w, "Hello, client")
}))log.Fatal(http.ListenAndServe(":8080", nil))

Candidates

Remember the first step is to look for candidates. This is checking all diffs to see if there are global variables or pointer receivers. In this case there are both:

reqCount is a global variable
reqCount.Value() is a pointer receiver
reqCount.Set() is a pointer receiver

Contexts

Next we identify all areas of a diff that are being executed concurrently, in this case there is only the http.HandlerFunc

Overlap

Finally we check to see if any of the candidates are executed in the concurrent context, and all are. This step would flag each of the candidates as a strong candidate for potential races, which allow for a more in depth analysis to ensure that access to the functions/variables identified are synchronized.

Conclusion

Even with the solutions to concurrency go provides I’ve still found that an unwavering dedication to safety analysis is required to ensure concurrently correct programs in go. I feel like this is reflective of the state of concurrent programming and its immaturity. It’s extremely promising to see languages and tooling that provide more active race condition analysis. Go has become my personal tool of choice and has the balance of performance, concurrency, simplicity and speed out of any tool I’ve operated in, but its definitely]]]]]]] exciting to think of the future when these sorts of active checks are built in and supported by go’s compiler!

Golang: Concurrency is Hard; So What Can We Do About It?

Concurrency In GO

Dangers Of Concurrency

Race Conditions

Solutions

Race Detector

Explicit Synchronization

Static Analysis (go vet)

Design based

Analysis

No Concurrency

Candidates and Contexts (C&C)

Candidates

Contexts

Analysis

Example

Conclusion

Written by dm03514