Geek Culture
Published in

Geek Culture

RDTSC the only way to benchmark.

If you are benchmarking small single functions that denominate down to only a few lines of instructions or even just functions that complete execution in a very small amount of time then the only way to get an accurate benchmark is using the RDTSC function.

Traditionally when one might decide to performance benchmark segments of code they may choose some timer of varying resolution, the lower the resolution the more iterations the benchmark is run for to get a better approximation.

Traditionally computers all have a second resolution ticker in time.h using the time() function. Using this low precision timer we can do something like this:

Executions in 3 seconds: 1,098,775,381
Executions per millisecond: 366,258
Executions per microsecond: 366
~0.4 executions every nanosecond

We specify time(0) as the input to the sqrt() function because this keeps the sqrt() function supplied with values the compiler cannot predict and thus ensures that the compiler does not optimise our while loop in a way that would cause the result of our benchmark to no longer represent what we intend. Also at the end we have to do something with our r variable to, again, prevent the compiler optimising our benchmark in undesirable ways we increment a variable with the sqrt() output of each iteration and then print the variable to console at the end. In our case, we convert the ‘float r’ to a single character and print it to the console as to not clutter our console output.

We can increase the precision of our result using the above function by increasing the sample interval from 3 seconds to a higher value, or we can get a little more sophisticated by using the timer.h clock() function which offers a higher resolution timer thus yielding more accurate results for the same 3-second interval.

On windows, as part of the win32 API you have QueryPerformanceCounter() which is accurate to less than one microsecond, although as far as I am aware when calling this function you have to lock it to a single CPU core or OS-level context switching will muddy the results. This can be done by calling QueryPerformanceCounter in a thread that has been set to the affinity of a single CPU core using the SetThreadAffinityMask() function.

Also, windows has two millisecond resolution timers which are popular GetTickCount() and the WinMM function timeGetTime() with a higher precision of 5ms or more. Although they are millisecond timers, the actual precision can less than a millisecond.

So we’ve covered low and high precision timers available on most operating systems time() and clock(), and windows specific timers. While they are adequate they require larger sample sizes to claw back at the precision lost by the timing function.

The best solution is of course to use the RDTSC function which counts ticks in clock cycles of the CPU; that means for every tick registered by the RDTSC one iteration or one cycle of the CPU has taken place. On average a modern CPU as of 2021 would operate at around 4.2 GHz which means it is executing 4.2 Billion cycles per second.

Although even RDTSC can return varying results each time it is called, so this function also needs to be sampled many times before we can come to a stable average. For my Intel processor running at 4.2 GHz I find that 3,000,000 samples is an adequate amount to obtain a stable average.

Using the RDTSC timer only requires that you include x86intrin.h to make use of the __rdtsc() function. Although Intel likes to claim otherwise.

You can also just call the instruction directly grabbing only its lower 32 bits.

So this is how our high precision RDTSC implementation would look on the sqrt() function:

sqrt() Cycles: 17

This gives us a very precise benchmark in actual CPU cycles as to how long a single function will take on average. We could also output the highest and lowest values alongside the average like so;

sqrt() Min Cycles: 12
sqrt() Max Cycles: 7,709
sqrt() Avg Cycles: 17

Finally to put all this together into one large high precision test pertaining to a handful of math.h functions:

You should be able to compile this with any optimisation flags such as -Ofast and it will not break the desired result. If you would like to see the other snippets of code included in this article refer to this gist for the extended version.

It is true that your benchmarks will include the latency of calling the __rdtsc() function, but it is a lesser evil, I personally don’t think that it is worth trying to calculate the average latency of the __rdtsc() calls to take away from the final average.

You may also be interested to see a benchmark I did on Multiply-accumulate operations. is a useful tool which outputs the compiled assembly instructions of your C or C++ code. Otherwise you can pass the -S flag to the GCC Compiler, or use objdump -d on a compiled binary.

gcc bench.c -Ofast -lm -o bin
gcc -S bench.c -Ofast -lm -o bin.asm
objdump -d bin > objdump_bin.asm

Since writing this article I have discovered that using the GetSystemTime() function on Windows is also a popular option:

Intel, “How to Benchmark Code Execution Times”, 2010




A new tech publication by Start it up (

Recommended from Medium

Why do all coders use MacBooks?Are they just cooler?

Day 8- Print a string in zigzag form in “k”rows — Python

MetaGaming x IM community AMA RECAP

Formulas in Fibery

Creating app icons for macOS 11 and up

Padding of macOS app icons

Ports & Adapters architecture on example

State of React Native Apps in 2020

Answers To Your Most Common Hive Questions

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


More from Medium

How to use a dynamic library generated by Zig-lang in C++ codebase

Send an SMS with Guile and Twilio API

Destructuring and Optional Chaining

What’s new in PVS-Studio in 2021?