RDTSC the only way to benchmark.

Fletch
Geek Culture
Published in
5 min readAug 18, 2021

If you are benchmarking small single functions that denominate down to only a few lines of instructions or even just functions that complete execution in a very small amount of time then the only way to get an accurate benchmark is using the RDTSC function.

Traditionally when one might decide to performance benchmark segments of code they may choose some timer of varying resolution, the lower the resolution the more iterations the benchmark is run for to get a better approximation.

Traditionally computers all have a second resolution ticker in time.h using the time() function. Using this low precision timer we can do something like this:

Executions in 3 seconds: 1,098,775,381
Executions per millisecond: 366,258
Executions per microsecond: 366
~0.4 executions every nanosecond

We specify time(0) as the input to the sqrt() function because this keeps the sqrt() function supplied with values the compiler cannot predict and thus ensures that the compiler does not optimise our while loop in a way that would cause the result of our benchmark to no longer represent what we intend. Also at the end we have to do something with our r variable to, again, prevent the compiler optimising our benchmark in undesirable ways we increment a variable with the sqrt() output of each iteration and then print the variable to console at the end. In our case, we convert the ‘float r’ to a single character and print it to the console as to not clutter our console output.

We can increase the precision of our result using the above function by increasing the sample interval from 3 seconds to a higher value, or we can get a little more sophisticated by using the timer.h clock() function which offers a higher resolution timer thus yielding more accurate results for the same 3-second interval.

On windows, as part of the win32 API you have QueryPerformanceCounter() which is accurate to less than one microsecond, although as far as I am aware when calling this function you have to lock it to a single CPU core or OS-level context switching will muddy the results. This can be done by calling QueryPerformanceCounter in a thread that has been set to the affinity of a single CPU core using the SetThreadAffinityMask() function.

Also, windows has two millisecond resolution timers which are popular GetTickCount() and the WinMM function timeGetTime() with a higher precision of 5ms or more. Although they are millisecond timers, the actual precision can less than a millisecond.

Although with Windows it would seem that GetSystemTime() is the more popular option these days as referenced in this StackOverflow response.

So we’ve covered low and high precision timers available on most operating systems time() and clock(), and windows specific timers. While they are adequate they require larger sample sizes to claw back at the precision lost by the timing function.

The best solution is of course to use the RDTSC function which counts ticks in clock cycles of the CPU; that means for every tick registered by the RDTSC one iteration or one cycle of the CPU has taken place. On average a modern CPU as of 2021 would operate at around 4.2 GHz which means it is executing 4.2 Billion cycles per second.

Although even RDTSC can return varying results each time it is called, so this function also needs to be sampled many times before we can come to a stable average. For my Intel processor running at 4.2 GHz I find that 3,000,000 samples is an adequate amount to obtain a stable average.

Using the RDTSC timer only requires that you include x86intrin.h to make use of the __rdtsc() function. Although Intel likes to claim otherwise.

You can also just call the instruction directly grabbing only its lower 32 bits.

So this is how our high precision RDTSC implementation would look on the sqrt() function:

sqrt() Cycles: 17

This gives us a very precise benchmark in actual CPU cycles as to how long a single function will take on average. We could also output the highest and lowest values alongside the average like so;

sqrt() Min Cycles: 12
sqrt() Max Cycles: 7,709
sqrt() Avg Cycles: 17

Finally to put all this together into one large high precision test pertaining to a handful of math.h functions:

You should be able to compile this with any optimisation flags such as
-Ofast and it will not break the desired result. If you would like to see the other snippets of code included in this article refer to this gist for the extended version.

It is true that your benchmarks will include the latency of calling the __rdtsc() function, but it is a lesser evil, I personally don’t think that it is worth trying to calculate the average latency of the __rdtsc() calls to take away from the final average.

You may also be interested to see a benchmark I did on Multiply-accumulate operations and more recently I did a benchmark of my favourite random functions here.

I really liked this benchmark code by Jm15itch.

Other various benchmarks I have done; AtomicBench, BitBoolBench, SineBench, Audio Interpolator Bench, Neural Unit Vector Bench.

https://godbolt.org/ is a useful tool which outputs the compiled assembly instructions of your C or C++ code. Otherwise you can pass the -S flag to the GCC Compiler, or use objdump -d on a compiled binary.

gcc bench.c -Ofast -lm -o bin
gcc -S bench.c -Ofast -lm -o bin.asm
objdump -d bin > objdump_bin.asm

Intel, “How to Benchmark Code Execution Times”, 2010
https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf
(this paper was recently removed by Intel, presumably because it’s from 2010 and no longer holds much relevancy in 2024, you can still find it online mirrored on various sites if desired such as here I originally only linked it here as a reference to follow up if anyone was interested in finer details but it has no real relevancy to the techniques listed above in this article other than it’s about the various different RDTSC functions and their usage from 2010)

--

--