Dissecting Golang sync.Mutex

Denis Shilkin

Published in

Gett Tech

11 min readOct 4, 2023

Previous post: Dissecting Golang sync.Once

Introduction

Generally, web applications for which we use Golang the most don’t need a lot of manually written concurrent code. Standard web server serves each incoming request in a separate goroutine and for most of the cases this is enough. Our services are usually simple and frankly speaking not heavily loaded. Also, our cloud infrastructure allows us to scale up the system and to some extent hides bottlenecks related to suboptimal code. If you work on CRUD applications then the vast majority of bottlenecks are located in the database schemas and queries. But what if your service becomes more and more popular? Given an I/O-bound nature of web services and growing number of cross-service dependencies you may face performance issues related to the lack of concurrency. Good examples may be parallelising independent http calls or in-memory cache being updated and invalidated in the background. This is the place where concurrency comes to play.

Golang sync package provides various synchronisation primitives that help us to develop concurrent applications:

sync.Once
sync.Mutex
sync.RWMutex
sync.WaitGroup
sync.Cond
sync.Map
sync.Pool
a few other auxiliary types

In this post, we are going to dive deep into the implementation of sync.Mutex, just as we did with sync.Once.

Before we start, let me remind you what mutex is: mutex stands for “MUTual EXclusion”. As it follows from its name we use mutex to access the resource by only one thread at a time. By “thread” I mean any logical thread whether it’s goroutine or operating system thread. The “resource” may be memory, device or any other entity you want to serialise access to.

Mutex provides us two operations: lock and unlock. If the mutex is unlocked then lock operation passes and all subsequent lock operations will be blocked until the first thread unlocks the mutex. Then one of the blocked threads locks the mutex and it goes again and again.

This is the simplified definition of the mutex. But we need to consider that the real semantics may be a bit more complex, i.e. mutex can be recursive or not (in Java it’s also known as a Reentrant Lock), it can also provide an optional “try lock” method even with timeout. It depends on the implementation. These topics are out of scope of the post. We are going to be focused on sync.Mutex from the standard Golang library and its Lock and Unlock methods.

Also, in this article I’m referring to a runtime wait queue without going into details. This is a complex thing which deserves a separate article. For the purpose of this post, we can think of it simply as an abstract queue.

So, without further ado, let’s now try to come up with our own mutex implementation

type Mutex struct {
    // ?
}


func (m *Mutex) Lock() {
    // ?
}


func (m *Mutex) Unlock() {
    // ?
}

Naive implementation

The naive implementation may look like a well known spin lock. We are trying to acquire the mutex in a loop by setting the flag to 1. Then we change the flag value to 0 when releasing the mutex.

package mutex


import (
    "sync/atomic"
)


type SpinLock struct {
    acquired int32
}


func (l *SpinLock) Lock() {
    for !atomic.CompareAndSwapInt32(&l.acquired, 0, 1) {}
}


func (l *SpinLock) Unlock() {
    atomic.StoreInt32(&l.acquired, 0)
}

Semantically this is correct but as you may notice we are in a “Busy-waiting”. We are wasting our CPU time since if the mutex is locked all goroutines keep on trying to Compare-And-Swap the flag.

We can conclude that we need to yield the control to some other goroutine which is ready to do its work. We need to tell the scheduler that we are waiting for the release of the mutex and meanwhile we don’t want to occupy the CPU. How can we express this?

Semaphore-based implementation

Semaphore is a well known synchronisation primitive in itself. In a few words, semaphore actually allows the maximum number of threads entering the critical section at once, so if it allows only 1 , it’s equal to lock in terms of serialisation primitives. From implementation perspective it allows threads to decrement the counter and if the counter is equal to zero a thread has to wait until it becomes greater than zero and only then it can decrement it.

Let’s take advantage of the blocking nature of Golang channels and implement mutex using the Golang channel with buffer size 1.

type SemaLock struct {
    sema chan struct{}
}


func NewSemaLock() *SemaLock {
    sema := make(chan struct{}, 1)
    sema <- struct{}{}


    return &SemaLock{
        sema: sema,
    }
}


func (l *SemaLock) Lock() {
    // competing goroutine empties the channel
    // all others wait for a non-empty channel
    <-l.sema
}


func (l *SemaLock) Unlock() {
    // put it back allowing other competitors
    // to read from the channel
    l.sema <- struct{}{}
}

Looks better since now we are not actively waiting we are rather “sleeping”. This means that only one goroutine will be “woken up” at a time to acquire the mutex. But what is the downside of this approach? Let’s try to find out.

Imagine the flow of arriving goroutines competing for the mutex. Which of them will be most likely “woken up”? Obviously those that are running, i.e. occupying the CPU. So we basically rely on the scheduler since the scheduler controls this process. Despite reading the channel is a blocking operation and it triggers rescheduling, we still may see a “long-waiting tail” of goroutines that are not being scheduled longer than the others.

So we need to somehow wake up long-waiting goroutines to make the mutex “fair”. Before analysing real sync.Mutex, let’s compare performance of our implementations in terms of CPU consumption and speed.

Performance

In this benchmark I’m changing the number of goroutines concurrently trying to acquire the lock from 10 to 100,000 and increment the counter. Operation is complete when the counter is incremented by all goroutines. I also added a real mutex to this comparison.

My hardware configuration is:

MacBook Pro 2017
CPU 2,9 GHz Quad-Core Intel Core i7
Memory 16 GB 2133 MHz

func BenchmarkSema(b *testing.B) {
    benchmark(b, mutex.NewSemaLock())
}


func BenchmarkSpin(b *testing.B) {
    benchmark(b, &mutex.SpinLock{})
}


func BenchmarkMutex(b *testing.B) {
    benchmark(b, &sync.Mutex{})
}


func benchmark(b *testing.B, l sync.Locker) {
    b.Helper()


    for _, c := range concurrency {
        for i := 0; i < b.N; i++ {
            b.Run(fmt.Sprintf("concurrency-%d", c), func(b *testing.B) {
                incrementCounterConcurrently(l, c)
            })
        }
    }
}

The CPU profile for all implementations to compare


+========================+=================+====================+
|         Frame          | CPU Consumption |      CPU Time      |
+========================+=================+====================+
| mutex.(*SpinLock).Lock | 97.80%          | 146,410,000,000 ns |
+------------------------+-----------------+--------------------+
| mutex.(*SemaLock).Lock | 31.00%          | 9,180,000,000 ns   |
+------------------------+-----------------+--------------------+
| sync.(*Mutex).Lock     | 3.50%           | 150,000 ns         |
+------------------------+-----------------+--------------------+

The line chart of nanoseconds per operation is the following

You can see that spin lock shows the same results as standard mutex but its CPU profile is 6 orders of magnitude higher.

Also, Semaphore-based implementation shows slower time than these two. This is because of the rescheduling we generated using “sleep” functionality. Actually, the top 3 CPU frames for SemaLock are the following

+========================+=================+==================+
|         Frame          | CPU Consumption |     CPU Time     |
+========================+=================+==================+
| mutex.(*SemaLock).Lock | 31.00%          | 9,180,000,000 ns |
+------------------------+-----------------+------------------+
| runtime.mcall          | 27.00%          | 7,980,000,000 ns |
+------------------------+-----------------+------------------+
| runtime.systemstack    | 14.60%          | 4,330,000,000 ns |
+------------------------+-----------------+------------------+

So let’s put Semaphore-based implementation aside and try to find out what makes sync.Mutex as fast as spin lock, but at the same time much less demanding on the processor?

sync.Mutex

sync.Mutex combines both spin lock and wait queue based on semaphore. It also has two modes: normal and starvation.

Roughly, its algorithm may be described like this:

the goroutine is trying to rapidly acquire the lock actively waiting for a certain condition (spin lock)
If it fails then it goes to the tail of the wait queue (FIFO) to be eventually woken up by some unlocker
being woken up it repeats spin lock procedure
If it lasts more than 1ms then the goroutine goes to the head of the wait queue (LIFO) and switches the mutex into starvation mode
in starvation mode each arriving locker is being immediately enqueued at the tail of the wait queue (FIFO)
in starvation mode each unlocker hands off the control directly to the goroutine from the head of the wait queue
in starvation mode each unlocker checks if the wait queue is empty or the overall waiting time is less than 1ms and if so then the mutex switches to a normal mode

Starvation mode is less efficient than the normal but it allows to avoid long-waiting tail. If some goroutine is trying to acquire the mutex longer than expected then it has a priority over the others.

Lock step

These are two animated representations of switching between normal and starvation modes.

Switching from normal to starvation mode

Switching from starvation to normal mode

Unlock step

The unlock step in these two modes mainly differ by handing off the control:

in normal mode scheduler picks some goroutine from the scheduler’s queue
in starvation mode the unlocking goroutine hands of the control directly to the goroutine from the head of the wait queue

This is exactly what we are missing in Semaphore-based implementation. This is the way of telling the scheduler: don’t schedule any goroutine, schedule only those that have been waiting for a long time.

This behaviour also doesn’t come for free since all goroutines for the certain runtime processor P are affected. Even if these goroutines are not competing for the mutex. That’s why starvation mode is less efficient than normal.

Other implementations

If you have ever looked at the other popular mutex implementations you may have found that the combination of spin lock and waiting mechanism is not something golang-specific.

Glibc library implements pretty much the same ideas.

/* pthread_mutex_lock.c */

do
{
    /* In each loop, spin count is exponential backoff plus
    random jitter, random range is [0, exp_backoff-1]. */
    spin_count = exp_backoff + (jitter & (exp_backoff - 1));
    cnt += spin_count;
    if (cnt >= max_cnt)
    {
        /* If cnt exceeds max spin count, just go to wait queue. */
        LLL_MUTEX_LOCK (mutex);
        break;
    }

    do
        atomic_spin_nop ();
    while (--spin_count > 0);

    /* Prepare for next loop. */
    exp_backoff = get_next_backoff (exp_backoff);
}
while (LLL_MUTEX_READ_LOCK (mutex) != 0
      || LLL_MUTEX_TRYLOCK (mutex) != 0);

The popular C++ library Qt uses futex which is idiomatically close to the spinlock and wait queue but in a system call. This is a userspace optimization of mutex. It stands for “Fast Userspace muTEX”. Futex supports WAIT and WAKE operations and serves as a wait queue.

Practical aspects

Optimisations made in the Golang sync.Mutex are interesting but they actually don’t affect the way you use mutex. So now let’s move on to the practical things.

Do not copy sync.Mutex

sync.Mutex contains an internal state that being copied will cause problems. Theoretically, once locked and then copied, mutex will remain locked since you will unlock the copy but not the original. Practically, in a real source code you likely will find the problem when the mutex is a part of copied structure:

type Counter struct {
   m sync.Mutex // exclusive lock
   c map[string]int
}


// This is the wrong code. 
// Use of value receiver leads to copying of sync.Mutex
// which leads to error "fatal error: concurrent map writes" in runtime.
//
// Right way is using pointer receiver:
//
//  func (c *Counter) Inc(key string) {
func (c Counter) Inc(key string) {
   c.m.Lock()
   defer c.m.Unlock()

   c.c[key]++
}


func main() {
   counter := Counter{c: map[string]int{"foo": 0}}
  
   var wg sync.WaitGroup

   for i := 0; i < 1000; i++ {
      wg.Add(1)
      go func() {
         counter.Inc("bar")
         wg.Done()
      }()
   }

   wg.Wait()
}

More about pointer receivers.

Use sync.RWMutex if it is possible

sync.RWLock is also known as a “shared mutex” or “shared lock”. It allows to perform read operations “simultaneously” (not necessarily a real parallelism) while turning on an exclusive mode only when write operation occurs. It allows us to significantly optimise the speed of read access to a resource especially if the vast majority of operations are read operations. This can be reasonable for in-memory caches. We can modify our example to demonstrate how it works:

type Counter struct {
   m sync.RWMutex // shared lock
   c map[string]int
}


// Inc exclusively acquire the lock.
func (c *Counter) Inc(key string) {
   c.m.Lock()
   defer c.m.Unlock()

   // note that we do
   // modify memory here
   c.c[key]++
}


// Value acquires the shared lock allowing
// all readers get the value of a counter
// “simultaneously”.
func (c *Counter) Value(key string) int {
   c.m.RLock()
   defer c.m.RUnlock()

   // note that we don’t 
   // modify memory here 
   return c.c[key]
}

Avoid deadlocks

According to the Wikipedia:

“In concurrent computing, deadlock is any situation in which no member of some group of entities can proceed because each waits for another member, including itself…” You also may consider a more complex version of this problem called “Dining Philosophers Problem”.

Practically, we need to think about how to avoid this. Is there any rule of thumb that allows us to avoid at least some stupid mistakes? It is.

First of all, note that sync.Mutex is not recursive (reentrant) so avoid locking it by one goroutine more than once:

type Counter struct {
   m     sync.Mutex
   value int
}


func (c *Counter) Inc() {
   c.m.Lock()
   defer c.m.Unlock()

   c.inc()
}


// inc is private method that mistakenly
// locks mutex second time which lead to
// an error:
// fatal error: all goroutines are asleep - deadlock!
func (c *Counter) inc() {
   c.m.Lock()
   defer c.m.Unlock()

   c.value++
}

Second of all, avoid locking more than one mutex or any blocking primitive in a reverse order than the different goroutine does. This means that each place in the code where you see more than one mutex for one logical entity should be reviewed more pedantically.

type Counter struct {
   // m0 and m1 are red flags
   m0     sync.Mutex
   m1     sync.Mutex
   value0 int
   value1 int
}


// Inc0 locks both m0 and m1 which is also a red flag.
// Note that we still may lock m0 and m1 in a right order
// but using more than one mutex is anyway prone to deadlocks.
func (c *Counter) Inc0() {
   c.m0.Lock()
   defer c.m0.Unlock()

   // lines of code

   c.m1.Lock()
   defer c.m1.Unlock()

   c.value0++
}


func (c *Counter) Inc1() {
   c.m1.Lock()
   defer c.m1.Unlock()

   // lines of code

   c.m0.Lock()
   defer c.m0.Unlock()

   c.value1++
}

Make sure to test for races

If we use concurrency then we can potentially face the data race (AKA race condition). This usually happens when one goroutine is modifying the resource that is not enough protected. If at the same time some other goroutine is reading or modifying the resource then it can cause unexpected behaviour. For example the value of some object may be in an inconsistent state or we may get an “index out of range” panic.

I recommend using a race detector built in the Golang toolchain. Just run your tests with the `-race` option and use it on CI to catch the problem as soon as it appears.