Which code is faster? Part 2

Published in

Floyd Programming Language

4 min readMar 16, 2019

Disclaimer: If you find errors, don’t just laugh — please help me understand. I want to get this right.

Another disclaimer: this post is about getting maximum performance in a small and hugely critical piece of code. Like the inner loop of a game’s software renderer or inside a hotspot in a hash map implementation. This excludes 99.9% of the code in most projects. These critical functions are often leaf functions — they don’t call other functions from their hot inner loops.

Here is the code again:

As mentioned in previous post, I use Google Benchmark to do these micro benchmarks.

Notice that Google Benchmark only measures what’s within the for(auto _ : state){ … }. Each iteration of this loop is its own measurement iteration. Constructing the vectors is not part of the measurement and the same vector is read over and over.

I designed these (silly) tests to get the CPU to use a working set bigger than the 8 MB L3 cache and cause all data to be read from main RAM for each test iteration. I sum the data just to stop the optimiser to remove the loads — I’m not trying to make a fast sum function.

I fully expected both functions to be completely memory bound, mostly waiting for data to be loaded from RAM. I expected the actual instructions to be mostly irrelevant to the benchmark. I was wrong.

The BM_read_vector_uint8() performs about 8x slower than BM_read_vector_uint32()!

When actually reading more data than fits in the L3 cache — these are the two tests that do 33.5544 bytes/test — only approx 54% of maximum memory bandwidth is ever reached.

THE COMPUTER

I’m running these tests on a 2014 iMac Retina.

Run on (8 X 4000 MHz CPU s)
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 262K (x4)
L3 Unified 8388K (x1)
machdep_cpu_brand_string: Intel(R) Core(TM) i7–4790K CPU @ 4.00GHz
machine: x86_64
model: iMac15,1
ncpu: 8
physmem: 2147483648
usermem: 3896381440
tbfrequency: 1000000000
cpu_freq_hz: 4000000000
bus_freq_hz: 100000000
mem_size: 17179869184
page_size: 4096
cacheline_size: 64

The computer has a max memory bandwidth of 24.6 GB/s according to the Intel sheets. It could be that my computer has slower DRAM:s or something, but the numbers do seem to suggest 24.6 GB/s could be true on this computer. “About this Mac“ ”tells me “16 GB 1600 MHz DDR3”.

I’m using “CL” as short for “cache line” moving on.

25.6 GB/s memory bandwidth and 64 byte cache lines => 400 M CL/s = 2.5 ns/CL.
The CPU runs at 4 GHz = 0.25 ns / clock. Modern CPUs are superscalar and can run several instructions per clock, even without using SIMD.
2.5 ns / 0.25 ns = 10 CPU clocks per CL.

INNER LOOP INSTRUCTIONS

These are the actual instructions for the inner loops of the two functions. Compiled by Xcode 10.1 with Clang and -0fast:

Both loops have been unrolled. The uint32 version more aggressively and also uses SIMD. I don’t know why uint8 version is less unrolled and don’t use SIMD. Theoretically the compiler could have generated similar code for both functions.

There is probably a good reason why Clang generates slower code for the uint8 version but that reason is far away from me — the programmer writing the source code.

The uint32 version reads an entire CL each loop iteration, using 19 instructions.
The uint8 version only reads 4 bytes per loop iteration and needs 16 loop iterations to process the entire CL => 244 instructions per CL.

I believe this is what goes on while running these functions for 60 ns:

This is a nano second time axis showing instruction and CL loads from DRAM for the first 60 ns of each inner loop. The blue lines shows when a loop iteration depends on a green CL to have been loaded.

This makes it painfully clear why BM_read_vector_uint8() is slow and is CPU bound, not memory bound!

It also looks like even this silly example could take advantage of more than one thread — one hardware thread has trouble keeping up with the memory performance.

Also: getting this type of diagram from your IDE would be immensely helpful when tuning code.

Thanks to Lars Sjödin, Fredrik Kiby Zetterman, Jonas Nockert, Viktor Sehr, Johan Kotlinski for playing this game with me.

Which code is faster? Part 2

THE COMPUTER

INNER LOOP INSTRUCTIONS

Written by Marcus Zetterquist