ℹ️ This article is based on Go 1.13.
Goroutines are light; they just need a memory stack of 2Kb to run. They are also cheap to run; switching a goroutine to another one does not require many operations. Before jumping into the switch itself, let’s review how the switch works at a higher level.
Before continuing this article, I strongly suggest reading my article “Go: Goroutine, OS Thread and CPU Management” to understand the notions explained here.
Go schedules the goroutines onto the threads based on two kinds of breakpoints:
- When a goroutine blocks: system call, mutex, or channel. The blocked goroutine goes into sleeping mode/into a queue and allows Go to schedule and run an awaiting goroutine.
- During a function call, at the prolog, if the goroutine has to grow its stack. This breakpoint allows Go to schedule another goroutine and avoid the running one hogging the CPU.
In both cases, the
g0 that runs the scheduler replaces the current goroutine by another one, ready to run. Then, the chosen goroutine replaces
g0 and runs on the thread.
For more information about
g0, I suggest you read my article “Go: g0, Special Goroutine.”
Switch a running goroutine by another involves two switches:
- The running
g0to the next
In Go, a goroutine switch is really light. In order to save, it only needs two things:
- The line where the goroutine stopped before being unscheduled. The current instruction to run is recorded in a program counter (
PC). The goroutine will later resume at the same point.
- The stack of the goroutine, in order to restore the local variable when it runs again.
Let’s see how it works in practice.
For the sake of the example, I will use goroutines that communicate through a channel, one that produces data and some that consume them. Here is the code:
The consumers will basically print the even numbers from 0 to 99. We will focus on the first goroutine — the producer — that adds numbers to the buffer. When the buffer gets full, it will block when sending a message. At this point, Go has to switch to
g0 and schedule another goroutine.
As seen previously, Go first needs to save the current instruction in order to restore the goroutine at the same instruction. The program counter (
PC) is saved in an internal structure of the goroutine. Here is an example with the previous code:
The instructions and their addresses can be found with the command
go tool objdump. Here are instructions of producer:
The program goes instruction by instruction before blocking on the channel at the function
runtime.chansend1. Go saves the current program counter to an internal property of the current goroutine. In our example, Go saves the program counter with the address
0x4268d0 that is inside the runtime and the method
g0 wakes the goroutine up, it will resume at the same instruction, looping on the values and pushing into the channel. Let’s move now to the stack management during the goroutine switch.
Before being blocked, the running goroutine has its original stack. This stack contains temporary memory like the variable
Then, when it blocks on the channel, the goroutine will be switched to
g0 along with its stack, a bigger one:
Before the switch, the stack will be saved in order to be restored when the goroutine will run again:
We now have a complete view of the different operations involved in a goroutine switch. Let’s see now how it impacts performance.
We should note that some architecture— like
arm — needs to save one more register,
LR the link register.
To measure the time a switch could take, we will use the program seen previously. However, it will not give a perfect view of the performance since it can depend on the time it takes to find the next goroutine to schedule. This way the goroutine switch could also impact the performance; a switch from a function prolog has more operations to do than a switch from a goroutine blocking on channels.
Let’s summarize the operation we are going to measure:
gblocks on channel and switch to
PCis saved along with the stack pointer in an internal structure
g0is set as the running goroutine
g0‘s stack replaces the current stack
g0is looking for a new goroutine to run.
g0has to switch with the selected goroutine:
PCand stack pointer are extracted from its internal structure
- The program jumps to the
PC‘s address extracted
Here are some results:
The switches from
g are the fastest phases. They contain a small fixed number of instructions contrary to the scheduler that checks many sources to find the next goroutine to run. This phase could even take more time, according to the running program.
This benchmark gives an order of magnitude estimate of the performance. It should be taken with a pinch of salt; There is no standard tool to measure that. Also, the performance depends on the architecture, the machine (I’m running it on my Mac 2,9 GHz Dual-Core Intel Core i5.), and the running program.