Infinite Goroutines
All this started when I was working on my first Go application. While working, I came across a situation where I had to write millions of data points to memory (Say key-value to hashmap) with minimum latency.
This article details how millions of go-routine can degrade the performance & increase the overall execution time and how this can be solved with a minimum number of goroutine working as a thread pool.
Let’s understand this by code -
I have a simple StudentDirectory of struct type with a map of [string, string] type and a mutex lock, the map holds all key-value pairs. To achieve maximum parallelism I created multiple StudentDirectory i.e. equals to the number of available cores (Say 4).
type StudentDirectory struct {
dir map[string]string
lock sync.Mutex
}var studentDirMap map[int]*StudentDirectory
var dirSize int = 4
var wg sync.WaitGroup
My first thought, concurrency in Go …very simple, just do go write(key, value) in a loop and problem is solved..!!
Spanning a new goroutine for each task that writes data into one of student directory —
func prepareStudentDirMap(dirSize int) {
studentDirMap = make(map[int]*StudentDirectory)for i := 0; i < dirSize; i++ {
dir := &StudentDirectory{
dir: make(map[string]string),
}
studentDirMap[i] = dir
}
}func PrepareWriteWithoutChannel() {
prepareStudentDirMap(dirSize)
wg = sync.WaitGroup{}
}func WriteWithoutChannel(key string, value string) {
wg.Add(1)
go func() {
directory := studentDirMap[int(murmur3.Sum32([]byte(key)))%dirSize]
directory.lock.Lock()
directory.dir[key] = value
directory.lock.Unlock()
wg.Done()
}()
}func CloseWriteWithoutChannel() {
wg.Wait()
}
Latency was 1073 ns/op for a workload of 5M. Was it satisfactory? let’s implement the same with the thread pool model and compare results.
I replaced this with 4 go-routine each listening to channel (task queue) for a task and another goroutine that fills this queue with task details like key, value, student directory to write into, etc.
type ChannelRec struct {
key string
value string
}type StudentDirectory struct {
dir map[string]string
lock sync.Mutex
}var channelRecs chan *ChannelRec
var studentDirMap map[int]*StudentDirectory
var dirSize int = 4
var wg sync.WaitGroupfunc PrepareWriteWithChannel() {
prepareStudentDirMap(dirSize)
channelRecs = make(chan *ChannelRec, 90000000)
wg = sync.WaitGroup{}
wg.Add(dirSize) for j := 0; j < dirSize; j++ {
go writeToDir(channelRecs, &wg)
}
}func WriteWithChannel(key string, value string) {
cRec := &ChannelRec{
key: key,
value: value,
}
channelRecs <- cRec
}func writeToDir(recs chan *ChannelRec, wg *sync.WaitGroup) {
for {
rec, valid := <-recs
if !valid {
break
} else {
directory := studentDirMap[int(murmur3.Sum32([]byte(rec.key)))%dirSize]
directory.lock.Lock()
directory.dir[rec.key] = rec.value
directory.lock.Unlock()
}
}
wg.Done()
}func CloseWriteWithChannel() {
close(channelRecs)
wg.Wait()
}
The latency was 759 ns/op for the same workload, that’s an improvement of ~30 %.
To understand the behavior, I tested with different workloads and compared the results —

While working on a similar problem where a channel is used as a task queue, the size of a channel has to be decided carefully. In the above example, the channel is created with a size of 90M to avoid possible blocking of write thread (that fills the channel with task detail), which further enables utilizing more CPU power towards executing tasks and fewer possible context switches.
Learning
- “After a certain point, adding more system resource doesn’t speedup latency of the execution of task” Amdahl’s law
- Execution time will always be more while executing each small task with a new thread (or goroutine) than delegating it to a thread pool. The process like memory allocation, starting & stopping of thread, etc. are involved while creating a thread.
- Maximum parallelism that can be achieved is equal to the number of virtual cores. I created 4 goroutines because creating more will lead to context switch and it will further reduce the performance. This can be tested by changing dirSize param in the above code.
I tested this in my MacBook Pro with a processor of 2.2 GHz i7 core and memory of 16GB.
Thanks for taking the time to read this. For more please check githib.
