Write a static code analysis tool for detecting race conditions

11 min readAug 15, 2021

This blog post walks you through how to implement a static code analysis tool for detecting race conditions. While it is written in Go, for code written in Go, the concepts presented can be applied to any language.

The tutorial is based on a tool called Chronos that does static code analysis to find race conditions. You can find the code here.

What are race conditions?

Data race

First, let’s look at a class of race conditions known as "data race". It refers to a situation where a memory operation in one thread attempts to access a memory location while a memory operation in another thread is writing to that memory location.

func race() {
    wait := make(chan struct{})
    x:= 0 
    go func() {
        x++ // read, increment, write 
        close(wait)
    }()    x++ // conflicting access 
    <-wait
    fmt.Println(n) // Output: <undefined>
}

In the following example, the two goroutines try to increment the variable “x”. Incrementing isn’t one operation but made from three sub-operations: reading, incrementing, writing the new value. The variable is accessed at the same time, causing the result to be either 1 or 2- an unpredictable behavior.

We would have to use a mutex to guarantee that the variables are accessed at separate times to solve it.

func norace() {
    wait := make(chan struct{})
    x:= 0 
    mutex := &sync.Mutex{}
    go func() {
        mutex.Lock()
        x++
        mutex.Unlock()
        close(wait)
    }()    mutex.Lock()
    x++
    mutex.Unlock()
    <-wait
    fmt.Println(n)
}

Race condition

The second type is when the program behavior is dependent on the sequence or timing of the program’s processes or threads.

func race() {
    x := 0
    go func() {
        x = 1
    }()
    if x == 1 {
        fmt.Println("Success")
 }
}

In this example, the problem is order. The expectation is for the goroutine to finish before the condition is evaluated as it starts before, but in reality, it might complete after so the assignment won’t happen in time. A mutex won’t solve it as it only handles mutual exclusion. The solution is to use a syncing mechanism such as an unbuffered channel, waitgroup, etc.

func norace() {
    c := make(chan bool)
    x := 0
    go func()
        x = 1
        c <- true
    }()
    <-c
    if x == 1 {
        fmt.Println("Success")
    }
}

Detecting race conditions

Now that we know how to find race conditions and solve them, we need an algorithm to do it programmatically.

In Chronos, we use “guarded access” to record each memory access. Guarded accesses are collected and compared. write-write or read-write accesses from different goroutines that are not guarded by mutexes or other syncing mechanisms are flagged as race conditions.

Guarded access contains data about the access that is flow-agnostic- it’s related to the memory access itself and is the same between flows:

Access kind- Write/Read and Write/Write memory accesses are considered race conditions so we can tell if two memory accesses are conflicting or not.
Code position (from the beginning of the program)- Position serves as an id for memory access.
value modified- Used to tell when the same value is being accessed when comparing memory accesses.

and the data related to the flow (goroutine):

Goroutine id- Goroutine id serves as an id for the flow.
Stacktrace- Used when printing the results the explain the flows easily.
mutexes acquired/released (lockset)
vector clock- will be explained later

If we take the memory access “x:=0” for example, the guarded access is:

func guardedaccess() {
    x := 0
    go func() {
        x = 1
    }()
}

write
position- 60
value modified- x
goroutine id- 2
stack trace- guardedaccess->func1->”x=1"
mutexes- none
vector clock- irrelevant

Detecting mutexes

Same as memory accesses, we track mutex locks and unlocks in a data structure called lockset. Each goroutine holds its own lockset, and it contains the locks that have been acquired or released so far.

func locks() {
    m := sync.Mutex{} // lockset: {locks:{}, unlocks:{}}
    m.Lock()          // lockset: {locks:{m}, unlocks:{}}
    m.Unlock()        // lockset: {locks:{}, unlocks:{m}}
    a := 5            // lockset: {locks:{}, unlocks:{m}}

If the code contains branches, a join operation is made between the branches when a mutex is acquired inside it. In other words, for the mutex to be considered as locked, all branches must acquire it. On the other hand, when a release is made in one of the branches, a union is performed on the branches, so one release is sufficient for it be considered unlocked.

This algorithm is sound but not precise- it will flag more race conditions for the loss of accuracy. It suffices for one branch to release a lock but all branches must acquire a lock for it to be consider locked. It means that locks will be unlocked more often so less guarded accesses are “defended” by mutexes hence more race conditions being flagged.

func mutexes() {
    m := sync.Mutex{}   // lockset: {locks:{}, unlocks:{}}
    if rand.Int() > 0 { // lockset: {locks:{}, unlocks:{}}
        m.Lock()        // lockset: {locks:{m}, unlocks:{}}
    } else {
        fmt.Println("Lock wasn't acquired")
    }
    if rand.Int() > 0 { // lockset: {locks:{}, unlocks:{}}
        m.Unlock()      // lockset: {locks:{}, unlocks:{m}}
    } else {
        fmt.Println("Lock was released")
    }
    if rand.Int() > 0 { // lockset: {locks:{}, unlocks:{m}}
        m.Lock()        // lockset: {locks:{m}, unlocks:{}}
    } else {
        m.Lock()        // lockset: {locks:{m}, unlocks:{}}
    }
                        // lockset: {locks:{m}, unlocks:{}}
}

In the first condition, the lock was acquired only inside the branch, but it’s not relevant anymore since it wasn’t acquired in all branches. As opposed, when doing union across the branches of the second condition, we still have the unlock since an unlock in at least one branch is sufficient. Finally, in the last condition, the lock is acquired in all branches, so the lock operation counts, and the mutex isn’t flagged as released anymore.

Vector clocks

We need to look at vector clocks to understand how to trace channels, waitgroups, and the rest of the order syncing mechanisms.

A vector clock is a data structure used for determining the partial ordering of events in a distributed system. In other words, it helps to track the order of events between different goroutines and create a happen before/after relationship between each other.

Each goroutine has its own vector clock. The vector clock has the ids of the other goroutines and maps between them to the number of events (memory accesses) taken in each goroutine.

Initially, all clocks are zero.
Each time a goroutine has memory access, it increments its own logical clock in the vector by one.
Each time a goroutine communicates with another one, it increments its own logical clock in the vector by one and sends its own clock as well.
When the other goroutine receives it, it increments its own logical clock in the vector by one and updates each element in its vector by taking the maximum value in its own vector clock and the value in the received vector.

In the above example, we have the proccesses p1,p2 and p3. Each process has it’s own empty clock. Every time an event happpens, a proccess increments it’s own clock in the vector. When we sync, as in p1->p2, p1 increments it’s own clock and send it to p2. p2 increments it’s clock and take the maximum number for each clock in the vector.

Putting it in the code

func norace() {
                               // VC1: {1:0}
    c := make(chan bool)       // VC1: {1:1}
    x := 0                     // VC1: {1:2}
    go func() // Clock synced! // VC1: {1:3, 2:0} VC2: {1:3, 2:0}
        x = 1                  // VC1: {1:3, 2:0} VC2: {1:3, 2:1}
        c <- true              // VC1: {1:3, 2:0} VC2: {1:3, 2:2}
    }()
    <-c       // Clock synced! // VC1: {1:3, 2:2} VC2: {1:3, 2:2}
    if x == 1 {                // VC1: {1:4, 2:2} VC2: {1:3, 2:2}
        fmt.Println("Success")
    }
}

The clocks were synced twice. When we created a new goroutine and the other when the message was sent to the unbuffered channel. To avoid complications, in Chronos, clocks are synced only when a new goroutine is created.

Once we have the vector clocks, we can create the relationship between the memory accesses- they happened one after another or are in parallel, hence a race condition. To do that, we can consider the following clocks VCᵢ {c1ᵢ,c2ᵢ} and VCⱼ{c1ⱼ,c2ⱼ} where VCᵢ and VCⱼ are the clocks and inside are vector components for each clock. Event e₁ has the clock VCᵢ and event e₂ has the clock VCⱼ.

e₁ is before e₂ (e₁<e₂) if c1ᵢ<c1ⱼ and c2ᵢ≤c2ⱼ e₁ : {5,2} e₂:{4,2}
e₁ is after e₂ (e₁>e₂) if c1ᵢ>c1ⱼ and c2ᵢ≥c2ⱼ e₁ : {7,2} e₂:{4,1}
If none, they are concurrent (e₁||e₂) to each other e₁ : {3,2} e₂:{4,1}

In the example above, the write of “x=1” precedes the read “x==1” in the condition, because for “VCᵢ: {1:4, 2:2} VCⱼ: {1:3, 2:2}” 4>3 and 2≥2.

On the other hand, if we didn’t send a message through the channel, the clocks weren’t synced, and we can conclude the read and writes are parallel to each other. “VCᵢ: {1:3, 2:0} VCⱼ: {1:2, 2:1}” 3>2 but 0≥1 isn’t true.

func norace() {
                         // VC1: {1:0}
    x := 0               // VC1: {1:1}
    go func()            // VC1: {1:2, 2:0} VC2: {1:2, 2:0}
        x = 1            // VC1: {1:2, 2:0} VC2: {1:2, 2:1}
    }()
    if x == 1 {          // VC1: {1:3, 2:0} VC2: {1:2, 2:1}
        fmt.Println("Success")
    }
}

Summary

Performing static analysis

We know how to detect race conditions once we have the information about the guarded accesses, the mutexes, and the syncing mechanism. But the question is how to infer it from the code. In this part, we will talk about how to perform static analysis.

SSA

SSA stands for static single assignment. It’s a representation of the code compilers use when doing optimization by requiring each variable only to be assigned once.

y = 1        y1=1
y = 2 -----> y2=2
x = y        x1=y2

“y” is declared twice and is assigned each time to ensure the assignment only happens once.

In Go, the package tools/ssa can generate SSA code, allowing us to use the generated code from within our code!

Basic block

In the package, functions are represented using the following struct:

type Function struct {
    Signature *types.Signature
    Pkg       *Package          // enclosing package
    Prog      *Program          // enclosing program
    Params    []*Parameter      // function parameters
    Blocks    []*BasicBlock     // basic blocks of the function
...
}

Signature, Pkg, Prog, and Params are self explanatory, but what are Blocks (Basic blocks)? In SSA, functions are divided into basic blocks where each block is multiple lines of code without branches. This makes it highly amenable to analysis.

type BasicBlock struct {
    Index   int                // index of this block within
                               // Parent().Blocks
    Instrs       []Instruction // instructions in order
    Preds, Succs []*BasicBlock // predecessors and successors
}

Using “Preds” (predecessor) and “Succs” (successor), the blocks create the control flow graph (CFG) of the function. A graph that represents all the paths that might be traversed during execution. This way, we can mimic the execution of how the code might run.

Instructions

A block has a set of instructions. “Instruction“ is an interface for representing all possible instructions in Go such as “ssa.BinOp”, “ssa.TypeAssert”, “ssa.Go”, and so on.

type Instruction interface {
    String() string
    Parent() *Function
    Block() *BasicBlock
    Operands(rands []*Value) []*Value
    Pos() token.Pos
}

Operands and Pos are the most important. Operands() returns the values used in the instruction, and Pos() returns the position in the code.
Now, it’s possible to fill the position and the value when creating a guarded access for the memory access. Also, according to the instruction, we can deduce if it’s a read or a write.

Putting it all together

Let’s consider the following code. We don’t have any branches in our code, so there’s only a single basic block. We iterate over the instructions of the basic block until we find the ones that are interesting to us.

func norace() {
    wait := make(chan struct{})
    x:= 0 
    mutex := &sync.Mutex{}
    go func() {
        mutex.Lock()
        x = 5
        mutex.Unlock()
        close(wait)
    }()    mutex.Lock()
    x++
    mutex.Unlock()
    x=4
    <-wait
    fmt.Println(n)
}

For “x=5” it’s the “Store” instruction.

type Store struct {
	Addr Value
	Val  Value
	// contains filtered or unexported fields
}

To create the guarded access, the operation is “write”, the position is 132, and the value accessed is “Addr”- x.

For the mutex, we can look for “ssa.Function” instruction as we have already seen before and check if the call is made to “(*sync.Mutex).Lock” or “(*sync.Mutex).Unlock”. Then we can add it to the lockset and attach it to the guarded access.

Finally, for the vector clock, we can look for the “ssa.Go” instruction. Then we sync the clocks by adding a new entry for the vector clock of the main goroutine for the new goroutine and copy the vector clock to the new one. The clock is incremented every time memory access is performed inside the goroutine and added to the guarded access when it’s created.

type Go struct {
	Call CallCommon
	// contains filtered or unexported fields
}

Only the value, the vector clock, the lockset and read/write are relevant when comparing guarded accesses. We can do the same for all the memory accesses to “x”.

x := 0 — write, vc: {1:2}, ls: {}
x = 5 — write, vc:{1:3, 2:1}, ls: {locks:[mutex], unlocks:[]}
x ++ — write, vc:{1:4, 2:0}, ls: {locks:[mutex], unlocks:[]}
x =4 — write, vc:{1:5, 2:0}, ls: {locks:[], unlocks:[mutex]}

Then we compare guarded accesses from different goroutines:

x = 5 and x:=0 don’t conflict as {1:3, 2:1} > {1:2}
x=5 and x++- don’t conflict as they are both guarded by the same lock
x=5 and x=4- x=4 isn’t guarded by any locks and the vector clocks are {1:3, 2:1} || {1:5, 2:0} so that’s a race condition.

Conclusion

This post covered the implementation points in an abstract way. It also didn’t cover how to traverse the control flow graph if there are branches or handle race conditions of variables from different functions. I decided not to go into those topics as they are complex. I hope you will have a look at the source code if you are interested.

If you have any questions or feedback or found out mistakes, it would be great if you could contact me in any way you like.