Cache Thrashing and False Sharing

Ali Gelenler
9 min readFeb 7, 2024

--

Introduction:

Cache Thrashing is a flaw that occurs in any caching mechanism when the cache is constantly populated and evicted without allowing to take advantage of the cache for the client processes. It will result the eviction of useful data in an undesirable way.

A common example of Cache Thrashing is False sharing that occurs in CPU cache line mechanism. Before diving into False Sharing let’s explore the cache line mechanism in common CPU architecture. Then we can look how to resolve false sharing issue in Java.

CPU Caches:

Cpu Architecture

The CPU cache is generally consist of three caches: L1, L2, and L3 where L1 is the closest one to CPU and L3 is closest to main memory. Each CPU has its own cache which takes place between the CPU and the main memory.

L1-cache is the fastest one and usually takes place in the processor chip. The size capacity is typically in between 8KB-64KB and it uses SRAM which is faster than DRAM which is generally used in main memory.

L2-cache comes between the L1-cache and the RAM and has a bigger storage capacity typically in between 64KB-4MB.

L3-cache comes after L2 cache and its closer to RAM. L3-cache is generally exists on motherboard instead of the processor and it is used to keep the common data between multiple cores.

During the data fetching first L1-cache is looked up, then L2-cache, then L3-cache and at the end if the data cannot be found in CPU caches, it is read from the main memory.

For CPU caching there exists different strategies in multi core environment to achieve some level of coherence. For example a replicated cache holds all the keys in all nodes, on the other hand a distributed cache holds keys only on some of the nodes to provide redundancy and fault tolerance while providing a more scalable solution.

Cache Lines:

When CPU is reading data from main memory or from cache levels, it doesn’t read a single byte but instead reads a block of bytes, usually 64 bytes. We call this block of bytes as cache lines. Cache lines most probably store more than one variable since it consists of multiple bytes.

In case CPU needs to access a variable that is already fetched as part of a previous cache line then it will be faster to read. However since CPU maintain consistency on cache line basis, if any single byte of a cache line is changed, all cache line is invalidated and this invalidation takes place for all CPUs in the cluster. In case different CPUs need to work on variables shared on the same cache line then it will create cache thrashing and will cause false sharing.

False Sharing:

False sharing occurs when multiple CPUs or CPU cores are working on variables stored on the same cache line, while the variable that is being worked on by each CPU is actually different. That will cause invalidating the cache line for a CPU even though it hasn’t changed anything on the variable that it is interested in.

For example let’s say we have 2 threads that operates on 2 different cores in a multi-core environment. If we have a shared variable X which is read by thread 1 and thread 2, both thread will put that variable into their cpu cache lines. Then if thread 1 updates the variable X thread 1’s cache line must be invalidated. This will also trigger invalidation on thread 2’s cache line. This is an expected behaviour and it is called true sharing. Here if we make the variable X volatile we can guarantee that behaviour, on the other hand for non-volatile variables, since we will not force the cpu to flush its memory barrier we may not see the invalidation.

Say we have an additional Y variable in the same cache line with X, and thread 1 updates X and thread 2 does only want to operate on Y. However both thread has the cache line that holds X and Y, since we read 64 bytes block of memory. In that case thread 2 told to invalidate its cache because of the change in variable X which thread 2 is not interested in. This situation is called false sharing.

If false sharing starts to happen many times, our system can suffer performance problems since a CPU will need to wait the cache line to be loaded while it would do many iterations in that time. This is called stall and it can cause a silence performance problems.

To solve this performance problem and to prevent false sharing we can try Contended annotation which comes by Java 8.

@Contended Annotation:

With JEP 142, we can annotate the fields or classes by @Contented annotation which we think a candidate for false sharing. JEP 142 generally depends on making the padding at class loading time and touching to Allocation Code (to be sure the allocation of objects are correctly aligned), JIT (to know which allocations need to be aligned) and GC (to be sure the object remains aligned after GC)

This annotation is a hint for JVM to place the annotated sections in different cache lines. The result may contain padding or a combination of padding with an aligned allocation. The side effect is of course an increased memory usage since we are using additional space for padding.

Note that Contended annotation does not work on user classpath by default and only work for classes on bootclasspath. So we need to add -XX:-RestrictContended VM argument on JVM startup. This is because it is an internal API and actually shouldn’t be required to use if a real performance benefit is not proven.

public class ContendedModel {
@jdk.internal.vm.annotation.Contended
volatile int int1;
@jdk.internal.vm.annotation.Contended
volatile int int2;
}

In the above example we wanted to keep int1 in a padded cache line and int2 in a different padded cache line.

If we want to keep two variables in the same cache line we can force it by defining a value for Contended annotation like below.

public class ContendedModel {
@jdk.internal.vm.annotation.Contended("ContendedModel_group1")
volatile int int1;
@jdk.internal.vm.annotation.Contended("ContendedModel_group1")
volatile int int2;
}

Please note that to be able to use this internal Contended annotation we need to add “ — add-exports java.base/jdk.internal.vm.annotation=ALL-UNNAMED” option to the module compiler options.

We can use jol tool to see if Contended annotation work. Add the following dependency for that.

<dependency>
<groupId>org.openjdk.jol</groupId>
<artifactId>jol-core</artifactId>
<version>0.9</version>
</dependency>

And run the following code with the ContendedMemoryLayout class below.

public class ContendedMemoryLayout {
public static void main(String[] args) {
System.out.println(VM.current().details());
System.out.println(ClassLayout.parseClass(ContendedModel.class).toPrintable());
}
}

It will print the following output by default along with the VM details. As we see no padding took place and no space loss is printed in the last line.

...
falsesharing.ContendedModel object internals:
OFF SZ TYPE DESCRIPTION VALUE
0 8 (object header: mark) N/A
8 4 (object header: class) N/A
12 4 int ContendedModel.int1 N/A
16 4 int ContendedModel.int2 N/A
20 4 (object alignment gap)
Instance size: 24 bytes
Space losses: 0 bytes internal + 4 bytes external = 4 bytes total

Let’s use -XX:-RestrictContended JVM argument and run the same code.

falsesharing.ContendedModel object internals:
OFF SZ TYPE DESCRIPTION VALUE
0 8 (object header: mark) N/A
8 4 (object header: class) N/A
12 128 (alignment/padding gap)
140 4 int ContendedModel.int1 N/A
144 128 (alignment/padding gap)
272 4 int ContendedModel.int2 N/A
276 4 (object alignment gap)
Instance size: 280 bytes
Space losses: 256 bytes internal + 4 bytes external = 260 bytes total

This time we see that we have 256 bytes space loss because of the padding as a result of Contended annotation. Here the Contended annotation let us sure that the int1 and int2 variables of the ContendedModel class are placed in different cache lines by applying some padding.

After checking the memory layout effect of Contended annotation, let’s run an example to manipulate the two variables of ContendedModel object by different threads and see the effect of false sharing.

Below ContendedExample class uses ContendedModel object and pass it to 2 different threads. First thread works on int1 variable and set a new value in a loop, while the second thread works on int2 variable and set a new value in a loop.

public class ContendedExample {
private static final long NUM_OF_ITERATION = 10_000_000_000L;

public static void main(String[] args) {
ContendedModel contendedModel = new ContendedModel();
Thread thread1 = new Thread(new Thread1Runnable(contendedModel));
Thread thread2 = new Thread(new Thread2Runnable(contendedModel));
thread1.start();
thread2.start();
}

public static class Thread1Runnable implements Runnable {
ContendedModel contendedModel;

public Thread1Runnable(ContendedModel contendedModel) {
this.contendedModel = contendedModel;
}

public void run() {
long start = System.nanoTime();
long i = NUM_OF_ITERATION;
while (--i != 0) {
this.contendedModel.int1 = (int) i;
}
System.out.println("End of thread 1! Last value of int1 is " + contendedModel.int1 + ". It took " + (System.nanoTime() - start) / 1000000000.0 + " seconds!");
}
}

public static class Thread2Runnable implements Runnable {
ContendedModel contendedModel;

public Thread2Runnable(ContendedModel contendedModel) {
this.contendedModel = contendedModel;
}

public void run() {
long start = System.nanoTime();
long i = NUM_OF_ITERATION;
while (--i != 0) {
this.contendedModel.int2 = (int) i;
}
System.out.println("End of thread 2! Last value of int2 is " + contendedModel.int2 + ". It took " + (System.nanoTime() - start) / 1000000000.0 + " seconds!");
}
}
}

I will first run this class by removing the Contended annotations in the ContendedModel class as shown below.

public class ContendedModel {
volatile int int1;
volatile int int2;
}

Then I will get the following log output. As you see it took around 12 seconds for each thread to count from 10_000_000_000L to 0 and manipulate the int1 and int2 variables.

End of thread 2! Last value of int2 is 1. It took 11.878212417 seconds!
End of thread 1! Last value of int1 is 1. It took 11.913652584 seconds!

Now I will put the Contended annotation to both int1 and int2 variables as shown below.

public class ContendedModel {
@jdk.internal.vm.annotation.Contended
volatile int int1;
@jdk.internal.vm.annotation.Contended
volatile int int2;
}

This time after running ContendedExample class I will get the following log output. As you see this time it took around 4 seconds for each thread to count from 10_000_000_000L to 0 and manipulate the int1 and int2 variables.

End of thread 2! Last value of int2 is 1. It took 4.019336625 seconds!
End of thread 1! Last value of int1 is 1. It took 4.019737542 seconds!

Here we can clearly see the effect of false sharing. In the first run the int1 and int2 variables are held on the same cache line causing false sharing and preventing to cache the values in cpu caches while manipulating the int1 and int2 variables. However in the second run Contended annotation is used, which applies padding so that each variable is hold in different cache line so that different threads didn’t effect each other while working on different variables, int1 and int2.

Please note that these results are very platform dependent, I run these examples on a Apple M1 Pro machine. However since false sharing can happen in any OS, the performance benefit would be seen in any OS architecture when Contended annotation is used as in the above example.

Another thing to note here is the use of volatile keyword in the int1 and int2 variables. When volatile variable is used we see that the Contended annotation improves the performance dramatically, when compared to non-volatile variables. This is because when a variable is defined as volatile it is forced to be read from and written to memory. That memory can be cpu cache memory or the main memory. This totally depends on the hardware and the cache coherence protocol used in the hardware. So because of this requirement of writing the volatile variable’s value to memory, it is slower to update and read a volatile variable. With Contended annotation, the false sharing and undesirable invalidation of caches are prevented so that the read of a variable can be done from the cache itself instead of memory so that the overall performance is improved.

Note that Contended annotation already used in many JDK classes such as Thread, ForkJoinPool and ConcurrentHashMap. In most cases it probably won’t be necessary to use Contended annotation in client java code, however knowing its effect can still be helpful to identify silent performance problems.

Conclusion:

False sharing is an example of cache thrashing, which occurs on CPU cache line invalidation where the variables are hold unintentionally on the same cache line. By using Contended annotation, JVM takes care of allocation on heap and warn JIT and GC about the contended objects so that undesirable cache line invalidation can be prevented. Depending on the CPU cache architecture and the cache coherence protocol there can be important performance benefits by preventing false sharing.

--

--