Switching a goroutine from an OS thread to another one has a cost and can slow down the application if it occurs too often. However, through time, the Go scheduler has addressed this issue. It now provides affinities between the goroutines and the thread when working concurrently. Let’s go back years ago to understand this improvement.
During the early days of Go, Go 1.0 & 1.1, the language faces degraded performances when running concurrent code with more OS thread, i.e., a higher value of
GOMAXPROCS. Let’s start with an example used in the documentation that calculates the prime numbers:
Here is the benchmark with Go 1.0.3 when calculating the first hundred thousand prime numbers with multiple values of
Sieve 19.2s ± 0%
Sieve-2 19.3s ± 0%
Sieve-4 20.4s ± 0%
Sieve-8 20.4s ± 0%
To understand those results, we need to understand how the scheduler was designed at this time. In the first version of Go, the scheduler had only one global queue where all threads were able to push and get the goroutines. Here is an example of an application running with a maximum of two OS threads —
M on the following schema — defined by setting
GOMAXPROCS to two:
Having only one queue does not guarantee that a goroutine will resume on the same thread. The first thread ready will pick up an awaiting goroutine and will run it. Therefore, it involves goroutines to go from a thread to another one, and it is costly in terms of performance. Here is an example with a blocking channel:
- Goroutine #7 blocks on the channel and is waiting for a message. Once the message is received, the goroutine is pushed to the global queue:
- Then, the channel pushes messages and goroutine #X will run on an available thread while the goroutine #8 blocks on the channel:
- The goroutine #7 now runs on the available thread:
The goroutines now run on different threads. Having a single global queue also forces the scheduler to have a single global mutex that covers all goroutines scheduling operations. Here is the CPU profile created thanks to
GOMAXPROCS set to height:
Total: 8679 samples
3700 42.6% 42.6% 3700 42.6% runtime.procyield
1055 12.2% 54.8% 1055 12.2% runtime.xchg
753 8.7% 63.5% 1590 18.3% runtime.chanrecv
677 7.8% 71.3% 677 7.8% dequeue
438 5.0% 76.3% 438 5.0% runtime.futex
367 4.2% 80.5% 5924 68.3% main.filter
234 2.7% 83.2% 5005 57.7% runtime.lock
230 2.7% 85.9% 3933 45.3% runtime.chansend
214 2.5% 88.4% 214 2.5% runtime.osyield
150 1.7% 90.1% 150 1.7% runtime.cas
lock are all related to the global mutex of the Go scheduler. We clearly see that the application is spending most of its time in the lock.
These issues do not allow Go the advantages of the processors and has been addressed in Go 1.1 with a new scheduler.
Affinity in concurrency
Go 1.1 came with the implementation of a new scheduler and the creation of local goroutine queues. This improvement avoids locking the entire scheduler if there are local goroutines and allows them to work on the same OS thread.
Since threads can block on system calls and the number of blocked threads is not limited, Go introduced the notion of processors. A processor
P represents a running OS thread and will manage the local goroutines queues. Here is the new schema:
Here is the new benchmark with the new scheduler in Go 1.1.2:
Sieve 18.7s ± 0%
Sieve-2 8.26s ± 0%
Sieve-4 3.30s ± 0%
Sieve-8 2.64s ± 0%
Go now really takes advantage of all the available CPU. The CPU profile has also changed:
Total: 630 samples
163 25.9% 25.9% 163 25.9% runtime.xchg
113 17.9% 43.8% 610 96.8% main.filter
93 14.8% 58.6% 265 42.1% runtime.chanrecv
87 13.8% 72.4% 206 32.7% runtime.chansend
72 11.4% 83.8% 72 11.4% dequeue
19 3.0% 86.8% 19 3.0% runtime.memcopy64
17 2.7% 89.5% 225 35.7% runtime.chansend1
16 2.5% 92.1% 280 44.4% runtime.chanrecv2
12 1.9% 94.0% 141 22.4% runtime.lock
9 1.4% 95.4% 98 15.6% runqput
Most of the operations associated with the lock have been removed, the ones marked as
chanXXXX are only related to the channels. However, if the scheduler has improved the affinity between a goroutine and a thread, there are some cases where this affinity can be reduced.
To understand the limits of the affinity, we have to understand what goes to the local and global queues. The local queue will be used for all operations expect system calls such as blocking operations on channels and selects, waiting on timers and locks. However, two features could restrict the affinity between a goroutine and a thread:
- Work-stealing. When a processor
Pdoes not have enough work in its local queue, it will steal goroutines from others
Pif the global queue and the network poller are empty. When stolen, the goroutines will then run on another thread.
- System calls. When a syscall occurs (e.g. files operations, http calls, database operations, etc.), Go moves the running OS thread in a blocking mode, letting a new thread processing the local queue on the current
However, by better managing the priority of the local queue, those two limitations could be avoided. Go 1.5 aims to give more priority to the goroutine communicating back and forth on a channel, and thus optimizing the affinity with the assigned thread.
Ordering to enhance affinity
A goroutine communicating back and forth on a channel results in frequent blocks, i.e. frequent re-queue in the local queue, as seen previously. However, since the local queue has a FIFO implementation, the unblock goroutines do not have guarantee to run as soon as possible if another goroutine is hogging the thread. Here is an example with a goroutine that is now runnable and was previously blocked on channels:
The goroutine #9 resumes after being blocked on the channel. However, it will have to wait for #2, #5, and #4 before running. In this example, the goroutine #5 will hog its thread, delaying the goroutine #9 and putting it at risk to be stolen by another processor. Since Go 1.5, the goroutines returning from a blocking channel will now run in priority thanks to a special attribute of its
The goroutine #9 is now marked as the next one runnable. This new prioritization allows the goroutine to run quickly before being blocked on the channel again. Then, the other goroutines will now have running time. This change had an overall positive effect on the Go standard library improving the performance of some packages.