Effects of CPU caches

The other day, at the office, we were discussing on the choice between an array and a linked list. Frankly I’d always thought that a linked list was a great alternative for complex list manipulations, and except for the cost of the pointers linking the element between them, there was no other cost when iterating over a linked list compared to an array.

But things are not that simple.

Lately, I found myself highly interested in how the memory is working. Then I decided to experiment what I read about CPU caches and evaluate the cost of a linked list according when (or where) we allocate each element.

I didn’t invent anything. All that I will talk about is really well and better described in the document of Ulrich Drepper about memory. I followed the same experiments, and if you want to dive deeper into this subject I highly recommend to read the entire paper. Keep in mind that numbers are not the most important in my explanations, they are not highly relevant, there are exposed more to point out the differences.

I don’t pretend to understand all that is involved in each test I ran, I have more to learn. The most important for me was first to experiment myself with these concepts and then to be aware of the effect of CPU caches.

Experiment’s fondations

All experiments have been constructed the same: I created a linked list of a struct in golang, and iterated over it. With go testI could easily launch a benchmark and observe the results. If the results of a benchmark of the testing go package are expressed in speed (ns) per loop, here these results will mean a speed per element.

When playing CPU caches, choosing the size of the working set, meaning the total amount of memory used at one time, is essential, as well as the size of each element.

Below is described the foo struct I used, which is a linked list. The first member of this struct is an array of int64 (8 bytes), which allowed me to be in control of the size of each item of the linked list.

For example with N=7 elements allocated in the array I’d end with a struct of 64 Bytes (on a 64-bit machine the member n takes 8 bytes as well).
const N = 7           // for an element size of 64 Bytes
const S = 8 + (8 * N) // the struct size in bytes
type foo struct {
i [N]int64 // let us control the size of each element
n *foo // the next element
}

Then choosing my working set size (WSS) was easy, I had to tune the number of elements in the list.

For instance, to work with a WSS of 2¹⁰ Bytes, I’ll iterate over 16 elements if each element takes 64 Bytes (N=7).

Because the point of this experimentation was to iterate over a list by accessing the memory and so to measure the time for each iteration, I had to be sure every executed operation at each iteration was the most stripped-down set of instructions. Hard to be shorter than one cpu instruction for the given loop:

func loop(el *foo, b *testing.B) {
b.ResetTimer()
for it := 0; it < b.N; it++ {
// for N=0 : movq (CX), CX
// for N=7 : movq 56(CX), CX
el = el.n
}
}

To perform tests in several situations, I wanted to be in control on the way the memory was allocated. I chose to allocate large arrays of the struct foo, as large as needed for the memory to be spread. Then I picked as much necessary elements of the so allocated array to reach the targeted WSS and linked them in the desired order (sequential, random, densely packed or not …).

// compute the number of elements needed to reach the WSS
func computeLen(workingSetSize uint) int {
return (2 << (workingSetSize - 1)) / S
}
// here we make a simple continuous array, and link all elements
// the resulting elements are layed out sequentially, densely packed
func makeContinuousArray(workingSetSize uint) *foo {
    l := computeLen(workingSetSize)
a := make([]foo, l)
    for i := 0; i < l; i++ {
a[i].n = &a[(i+1)%l]
}
    return &a[0]
}

Finally, to understand correctly the numbers, it was important to know the system where the benchmarks were ran. I runned the tests on a MacBook Pro with the following characteristics (the command sysctl can give you all these information) :

Processor: 2.4 GHz Intel Core i5
L1d size: 32 KiB (2¹⁵ Bytes)
L2 size: 256 KiB (2¹⁸ Bytes)
L3 size: 3 MiB (between 2²¹ and 2²² Bytes)
LineSize: 64 Bytes
TLB: 1024 shared elements

Observing the CPU caches

As an introduction, I wanted to experiment myself the effect of the several caches of the CPU (L1d, L2 and L3).

The goal here was to iterate on a linked list of elements, densely packed (see above), and observe the difference between different working set sizes. I did this with several values of N.

According the chosen size, the working set can partially or completely fit into different CPU cache level.

Benchmark results for N=0

Above we can observe the result of the benchmark for N=0. As you can see the result is not quite explicit. The reason here is because the CPU is doing great with prefetches. The processor load a 64 Bytes cache line and in anticipation of using consecutive memory regions, the next line is prefetched. So while iterating over these loaded items (8 items with N=0), the next cache line is halfway loaded.

Benchmark results for N=7

Let’s now observe the result with N=7. Each element occupies an entire cache line, so the prefetch effect in this case is considerably lowered because each iteration requires to load a new cache line.

Figure 1 — Sequential Read for Several Sizes

Above (Figure 1) are plotted the different benchmarks from N=1 to N=31 on a chart. For each line we can easily deduce the different cache level sizes where the curve flattens at various plateaus.

As we can see, starting from a WSS of 2²² KiB our working set doesn’t fit anymore into the L3 cache, then the lines have to be fetched from the memory. This observation surely explain the slowdown, but it doesn’t explain why, by increasing N, it’s like exponential.

The hardware prefetch is not really efficient when N>7, because each list element is spread over several cache lines, plus the prefetching is not able to cross page boundaries. As you may know, the virtual memory is splitted into pages of 4KiB (2¹²). This means that for N=7 the processor cross a page boundary every 64 elements, and for N=31 every 16 elements.

But this serious decrease in performance has to be something else. We have to take in account here another cache we didn’t talk about yet: the TLB — Translation Lookaside Buffer. You’re certainly aware that for each page the processor has to translate a virtual memory address into a physical address. And because these translations are expensive, the processor use the TLB to cache those correspondences and avoid doing this operation every time. For efficient reasons the TLB is kept really small: 1024 elements on my MacBook, and with a WSS of 2²² KiB we just reach this limit (2²²/2¹² = 1024 elements).

So, all those elements put together (bigger structs, page crossing, TLB missing), we are able to understand the heavy impact on performances with N=31 and a WSS>=2²².

That said, I’m not completely certain, looking at the assembleur instruction movq, that when N>7 the entire memory of the structure is loaded by the CPU — I’m pretty sure of the contrary actualy. The working set is initialized at the beginning of each benchmarks, though.
To improve this experiment I probably should access some of the elements inside the array.
During the following experiments I’ll use only N<=7.

TLB Influence

In the previous paragraph we observed that overflowing the TLB had terrific impact on performances.

However we were in an optimal situation: the items were allocated successively, which was maybe not really realistic. Indeed, elements of a linked list are seldom packed together but more often scattered on various memory emplacements.

I took another extreme scenario, not realistic either, but with which l could magnify the TLB influence.

Instead of arranging items successively, I placed one item per page, and to avoid any prefetching attempt I took N=7. Now for each item, the processor was loading a new cache line and cross a new page. No doubts the bench results will be awful.

The way I constructed the linked list is quite different here.
I wanted to allocate as many pages as the number of elements in the linked list according the tested WSS. For example, with a WSS of 2¹⁰ Bytes I had to allocate 16 pages and 262,144 pages with a WSS of 2²⁴ (as a recall N=7). As a side note, if we do the math, allocating 262,144 pages means allocating 1GiB.

To proceed the allocation, I filled the desired number of pages with the struct foo: I needed 64 instances of struct foo to fill one page).

Then I created the linked list by picking the first element of each page.

Below you can find the code to construct the linked list for this test :

const PAGE_SIZE = 4096  // for an element size of 64 Bytes
func dispatchOnePerPage(workingSetSize uint) *foo {
    l := computeLen(workingSetSize)
// compute how many items fit in one page
d := PAGE_SIZE / F
// compute the number of items to allocate pages
ls := d * l
// allocate pages
a := make([]foo, s)
    // link to the next element on the next page
for i := 0; i < l; i++ {
a[i*d].n = &a[((i+1)%l)*d]
}
    return &a[0]
}

On the following figure (Figure 2), we can easily observe the performance loss by hitting the TLB (orange curve). The slowdown becomes worse when experiencing cache misses with a WSS of 2¹⁶. At this point we certainly reach the TLB size limit (1024 elements).

Figure 2 — TLB influence

For comparisons purpose, the green curve represents the sequential read of the first paragraph at N=7.

Random memory access

Now that I had played with two extreme cases — having the elements of a linked list densely packed or arranged on different pages — I had to experiment linked-list in more realistic situations.

The goal was to perform random walks, with different ordering, to point out the differences. Here again, to limit the effect of processor prefetch optimisation, I worked with N=7.

For the first experimentation I simply shuffled the entire array. As I was expecting the performance were catastrophic compared to a sequential walk. This first experiment is plotted as the red curve into the following figure (Figure 3).

func makeShuffledArray(workingSetSize uint) *foo {
    l := computeLen(workingSetSize)
a := make([]foo, l)
    // we create an array of permutations
// using the same seed to produce the same result each time
rand.Seed(42)
p := rand.Perm(l)
    // shuffle the array
for i := 0; i < l; i++ {
a[p[i]].n = &a[p[(i+1)%l]]
}
    return &a[p[0]]
}
Figure 3 — Sequential vs Random

The next experiments did not go that well. I knew the bad performances of the previous results, especially when the L3 overflows, were due to TLB misses and an important rate of L3 misses as well. To lower the TLB effect — always by following the paper I based this article on — for the next experimentations, I divided the array into equally continuous regions, and shuffled each region individually. The size of the regions is measured in number of pages.

const REGION_SIZE = 128
func makeArrayShuffledPerRegion(workingSetSize uint) *foo {
    l := computeLen(workingSetSize)
a := make([]foo, l)
    // compute how many foo we need to fill a region
s := REGION_SIZE * (PAGE_SIZE / F)
    // initialize the permutation array
p := make([]int, l)
for i := range a {
p[i] = i
}
    rand.Seed(42)
    // compute the permutations inside each region
for i := 0; i < l; i++ {
r := i / s // the region index
j := r*s + rand.Intn(i%s+1)
p[i], p[j] = p[j], p[i]
}
    // shuffle the array
for i := 0; i < l; i++ {
a[p[i]].n = &a[p[(i+1)%l]]
}
    return &a[p[0]]
}

I first ran a test with region size of 64 pages, and surprisingly the result was worst than I was expecting. I ran a lot of tests, varying the page size, until I found a breaking point, around 34 pages. I even tested with 1 page size; in this situation the TLB misses has exactly the same impact than with a sequential walk, we’re crossing the pages at the exact same time. You can observe all these results in the figure above.

But I wasn’t figuring out what was going on here, and why the performances were falling apart around 34 pages. I needed more tools to measure and understand the situation.

I heard about perf, a great tool to catch CPU events, that would certainly have helped me; but unfortunately perf doesn’t work on osx. Since all the virtualization tools do not handle CPU events, I took the decision to run some tests on another linux machine, with obviously different configurations.

I surely could have re-run all the benchmarks on this new system, then update all my charts, but to be honest, I didn’t had the courage. I just decided to analyse the CPU events with a WSS=2²⁸ large enough to overflow the L3 on this machine as well. Keep in mind that on this machine the TLB has 512 elements.

I’ll try to be brief on this part.
To avoid measuring the initialization phase, which allocate the necessary memory for the benchmark, I recreated a new program that will at each run (one run per use case) walk indefinitely through the linked list. Then, once I was sure that the initialization was done, and the loop had started (a println was enough for this purpose), I ran the following command, that collect the CPU events during 10 seconds.

perf stat -e dTLB-load,dTLB-load-misses,LLC-load,LLC-load-misses,LLC-prefetches,LLC-prefetch-misses,L1-dcache-loads,L1-dcache-misses,cycles:u,instructions:u -p PROCID sleep 10

I gathered all the information on a spreadsheet and I first plotted the cache misses onto the graph (Figure 4) below.

Figure 4 — Cache Misses

Fair enough, we can easily observe that starting from a region size of one page, the misses on the Last Level Cache (L3 in our case) are increasing — surely due to the randomization — and starting from a region size of 512 pages (the number of elements into the TLB on this machine) the TLB misses are increasing.
This can surely explains the shape of some curves on our Figure 3, but not what I was looking for, the breakdown around a region size of 34 pages.

So I decided to plot the number of events themselves. To be able to compare them, and because the number of instructions executed (and so the number of events emitted) during the measure time of 10s was completely different between for instance a sequential and a random walk, I had to use a different scale.

On the following chart, the bars, representing the number of events, are expressed in PTI — rate Per Thousand Instructions — and to compare the performance differences between each test case, I plotted the IPC which is the number of instructions per cycles, on the right axis using a logarithmic scale.

Figure 5 — Sequential Vs Random CPU Events
Sincerely, I don’t completely understands all the values — the measures given by perf depend on lot on the CPU family — but I’m still working on!

What I can see, is that starting from a region size of 64 pages, there is no sign of prefetching at all. Remember, with N=7, we have 64 elements per page, so there is a great chance, with a shuffled region of 64 pages, that for almost each element we have to cross a page boundary, disabling completely the hardware prefetch.
My guess is that at some point, around a region size of 32 pages — meaning a page crossing every 2 elements — the prefetching is totally disabled by the processor.

Conclusion

So far, I only ran experiments involving reading operations. Of course, it would have been highly interesting to run more tests with writing operations or even multi-threading operations.

However I can already draw some additionals questionings to ask myself when writing some code. Is it worth it to use a pre-allocated array if I can anticipate the size in order to take advantage of pre-fetching and avoid page crossing? Can I avoid/reduce TLB misses while iterating? Can I avoid spreading lists over too much pages? More generally, can I benefit from cache-line loading by packing elements in one line? Can I measure/control my working set size to cleverly use CPU caches.

To conclude, I would say that writing optimized code is hard, endless and involves a lot of knowledge, but is very exciting ;) Still so much things to understand!

Again, I’m not an expert. I’m greedy of any comments to improve/correct my understanding of all of this! 😊
Thanks for reading, hope you enjoyed as much as I did!

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.