What’s false sharing and how to solve it (using Golang as example)
Before explaining false sharing, it’s necessary to briefly introduce how cache work in CPU architecture.
The minimize unit in CPU’ cache’ is a cache line (for nowadays, a common size of cache line in CPU is 64 byte). Thus when CPU read a variable from memory, it would read all variables nearby that variable. Figure 1 is a simple example:
When core1 read variable a from memory, it would read variable b into cache at the same time. (BTW, I think the main reason why CPU read batch variables from memory are base on the theory of Spatial Locality: when CPU access one variable, it may possibly read the variable next to it soon.)
There is a problem in that cache architecture: if one variable has existed in two cache lines in different CPU cores like figure 2:
When core1 update the variable a:
It would make core2’s cache miss when core2 read the variable b, even if variable b was not modified. So core2 would reload all variables in cache line from memory, like figure 4:
That’s what false sharing is: one core update a variable would force other cores to update cache either. And all we know that CPU read variables from the cache are much faster than from memory. Thus while that variable always exists in multi-cores, that will significantly affect performance.
The common way to solve that problem is cache padding: padding some meaningless variables between variables. That would force one variable to occupy a core’s cache line alone, so when other cores update other variables would not make that core reload the variable from memory.
Let’s use a snip Golang code below to brief this concept of false sharing.
Here is one Golang struct with three uint64,
And here is another struct that I add uint64 to pad between variables:
Then I write a simple code to run benchmark:
The benchmark ran on that MBA 2014 was below:
$> go test -bench=.
BenchmarkNoPad-4 2000000000 0.07 ns/op
BenchmarkPad-4 2000000000 0.02 ns/op
The result of the benchmark shows that it increases performance from 0.07 ns/op to 0.02 ns/op, That’s much improvement.
You can also test this in other languages like Java, and I believe you would get the same result.
There are two key points you should know before you apply that to your production:
- Make sure the size of cache line in the CPU in your system: this is relative to the size of cache padding you use.
- Padding more variables mean you consume more memory resource. Running benchmark with your scenario and make sure what you pay is worthful.
All my sample code is on GitHub.