Which code is faster? Part 3

Marcus Zetterquist
Floyd Programming Language
3 min readMar 19, 2019

MY EXPECTATIONS VS REALITY

As I said in part 1 and 2, I expected both functions to be completely memory bound, so the number of instructions would be almost irrelevant and the functions should execute in almost the same time.

I expected the graph below to be green (memory loads) with sprinkles of gray (instructions).

I kept thinking this even when I disassembled the functions and saw the uint8 version was a lot less optimised than the uint32 version.

To be clear: I’m not surprised that unrolling loops and using fewer and wider instructions makes faster instruction execution, only that it would matter here.

Reality is that since the functions read data linearly in RAM the CPU’s prefetcher does a fantastic job loading CLs before the instructions even try to load the data. The load instructions never stalls.

My slow DRAMs can deliver a new CL every 10 CPU clock cycle!

This means that:

if the code needs more than 10 CPU clocks to process 64 bytes, it will become CPU bound!

This was a huge surprise to me, even when looking at disassembly and reading CPU data sheets.

WHAT I HAVE LEARNED

  1. When you access memory in a great access pattern so prefetcher can do its thing well, we get data fast. Surprisingly fast. Even simple loops can become CPU-bound rather than memory bound. For other code that usually don’t access memory perfectly — by following pointers for example — the code is usually memory bound. Then you have plenty of instructions to waste.
  2. Clang’s SIMD implementation still maxes out at approx 54% of memory bandwidth when we access more data than fits in L3 cache. I totally expected to max-out at 100% of the memory bandwidth. I believe this shows its hard for one CPU core to max out the memory bandwidth. But maybe something like memcpy() can use all bandwidth.
  3. Trivial code can have 8x performance differences. Depending on what the compiler optimiser can/decides to do. Not 8% or 80%. 800%! If this was an important inner loop, the program could be “broken”. Your game would run at 7 FPS instead of 60 FPS.
  4. Looking at the source code it’s impossible to predict if compiler will generate 1x or 8x code. And later on, your program could go from 1x to 8x when doing trivial updates to the source code or update the compiler etc. A function critical to you program can break “spontaneously” and you will get no warning.

Current state-of-the-art compiler technology is very unreliable when it comes to micro-performance: you write a function, dial up optimisation settings and hope it’s fast.

WHAT YOU CAN DO

To make important functions fast:

  1. Learn how a specific piece of hardware works: cache line size, cache hierarchy, memory bandwidth, prefetcher and instruction set.
  2. Calculate the ideal performance of your function on this hardware.
  3. Setup micro benchmarking for your actual code and se how close to the ideal it gets. I recommend Google Benchmark: https://github.com/google/benchmark
  4. Disassemble the compiler generated code and try to figure out if this is the best possible.
  5. Tinker with your source code and see if benchmarks improve.
  6. When happy, setup “some sort of performance tests” that fails if the compiler shifts from 1x code to 8x code later on. You don’t want this to happen without knowing it.

You need to redo this work separately for each important piece of hardware and platform. You need to have alternate functions for each hardware and switch between them at runtime or compile time (or install time).

Another thing: this assumes the optimised function is used in one special scenario. If the function is used in many situations, you might want to have a vanilla version for the common use.

It’s obvious to me there is some tool missing here. A tool that lets you do explicit micro-tuning and working together with the compiler.

--

--