Learning the value of good benchmarking technique with C++ magic squares

If you’re a working programmer and haven’t already seen it, you should check out Challenge your performance intuition with C++ magic squares. This article by wordsandbuttons, as with its follow-up, C++ magic squares demystified with Valgrind and disassembly, takes a long look at some deceptively simple code.

It asked you to estimate the performance impact of various optimisations, both exposing you to exploratory optimisation work and making you practice testing and validating your assumptions. The follow-up looked at some possibilities under the assembly microscope, a useful technique to know even for those of us that don’t get to use it frequently.

Unfortunately, though the articles explored the design space, they didn’t do so in a structured manner, creating and testing hypotheses about the run-time characteristics of the code. The lack of performance analysis allowed the benchmark to introduce bias, which we’ll explore later, and some of the results were likely to be regressions. Whilst I cannot hope to produce an Aleksey Shipilëv-quality analysis (see The Black Magic of (Java) Method Dispatch for a perfect example), hopefully I can provide some insight, and explore how a more scientific approach can pay dividends.


3x3 magic squares and a direct test

An N×N magic square, for anyone who skipped the original posts, is a square of the unique digits from 1 to N² where each row, column and diagonal sums to the same value: ½N(N²+1), which is 15 for a 3x3 magic square.


8 3 4 8 → 3 → 4 → 15
↓ ↘ ↓ ↙ ↓
1 5 9 1 → 5 → 9 → 15
↓ ↙ ↓ ↘ ↓
6 7 2 6 → 7 → 2 → 15
↙ ↓ ↓ ↓ ↘
15 15 15 15 15

We can pack this square in a std::string, representing it in reading order as "834159672". A function taking a std::string& as input with characters [0–9] can test for magic squares fairly simply. This code is offered as a solution.

auto c15 = '5' * 3;
uint_fast64_t ideal_char_map =
static_cast<uint_fast64_t>(0x1FF) << 49;
uint_fast64_t char_map_one = 1u;

bool check_if_magic(const std::string& sq)
{
if ((sq[0] + sq[1] + sq[2] != c15)
|| (sq[3] + sq[4] + sq[5] != c15)
|| (sq[6] + sq[7] + sq[8] != c15)

|| (sq[0] + sq[3] + sq[6] != c15)
|| (sq[1] + sq[4] + sq[7] != c15)
|| (sq[2] + sq[5] + sq[8] != c15)

|| (sq[0] + sq[4] + sq[8] != c15)
|| (sq[2] + sq[4] + sq[6] != c15))
return false;

auto char_map = ideal_char_map;
for(auto i = 0u; i < 9; ++i)
char_map ^= char_map_one << sq[i];
if (char_map != 0)
return false;

return true;
}

This first tests rows, then columns, then diagonals, and only finally checks that all of the number are unique.

To measure performance, this is tested inside the following test harness which iterates through all valid inputs, from "111111111" to "999999999". It is worth taking a moment to make sure you understand the process this uses to generate inputs, as we will revisit it.

static string buffer = "000000000";
void generate_or_check(int index_or_check = 8)
{
if(index_or_check == -1){
if(check_if_magic(buffer))
cout << buffer << " ";
return;
}

for(auto i = 1u; i < 10; ++i){
buffer[index_or_check] = '0' + i;
generate_or_check(index_or_check-1);
}
}

Testing with gcc version 7.2.0, running g++ -std=c++14 -O2, I get a reported time of 1.74 seconds using the minimum of five runs.

Exploring the benchmark

First it’s useful to quickly test to see if inlining is causing any complications by running with __attribute__ ((noinline)). This gives us a runtime of 1.72 seconds, or within error, showing that inlining is both not helping and not hurting. There will be benchmarks where this matters, so it’s important to keep inlining disabled.

Since we now know that the benchmark is isolated from the harness, we can check to see what the benchmark overhead itself is. We might like to check with a simple function call that just returns false.

bool __attribute__ ((noinline)) check_if_magic(const std::string&) {
return false;
}

However, verifying the assembly shows that although the function is not inlined, it is hoisted outside the loop. We can suppress this with inline assembly; I’ve kept it minimal for readability and verified that this is sufficient through directly checking the output.

bool __attribute__ ((noinline)) check_if_magic(const std::string&) {
asm volatile ("");
return false;
}

This gives us timings of… 1.71 seconds. Oops.

Somehow only 1% of our test seems to be spent measuring the function, with the rest just measuring benchmark overhead. Removing __attribute__ ((noinline)) and asm volatile (""); makes the time fall to 0.77 seconds, which shows that any tests we might have seem that gave a result between 1.7 and 0.8 not trustworthy measurements, and without further analysis could be down solely to inlining or global optimisation differences.

When I spoke with wordsandbuttons, he said he verified that no inlining happened in his test cases by looking at the assembly produced. Unfortunately those analyses won’t apply to this post since we use a newer compiler and different testing harnesses, but it is simple enough to always disable inlining anyway.

Letting our function breathe

There are 9⁹, about 387 million, different inputs. This means each test takes about 4.4 ns, or 14 cycles. That’s not a lot of time to generate 9 bytes and call a function on each slice.

What happens if we remove the recursion?

void generate_or_check() {
std::string buffer = "000000000";
for (buffer[8] = '1'; buffer[8] <= '9'; ++buffer[8]) {
for (buffer[7] = '1'; buffer[7] <= '9'; ++buffer[7]) {
for (buffer[6] = '1'; buffer[6] <= '9'; ++buffer[6]) {
for (buffer[5] = '1'; buffer[5] <= '9'; ++buffer[5]) {
for (buffer[4] = '1'; buffer[4] <= '9'; ++buffer[4]) {
for (buffer[3] = '1'; buffer[3] <= '9'; ++buffer[3]) {
for (buffer[2] = '1'; buffer[2] <= '9'; ++buffer[2]) {
for (buffer[1] = '1'; buffer[1] <= '9'; ++buffer[1]) {
for (buffer[0] = '1'; buffer[0] <= '9'; ++buffer[0]) {
if (check_if_magic(buffer)) {
std::cout << buffer << " ";
}
} } } } } } } } }
}

Now a noinline call to return false takes 0.75 seconds and the full test takes 0.96 seconds, a difference of 0.21. This difference was hidden before, likely because of the CPU’s implicit parallelism; now we can see it.

One thing that is interfering with the results is the use of std::string. String types are very heavily specialised for their use case, which hurts use-cases other than that. In particular, short strings are generally used in an opaque fashion, created, combined and destroyed frequently but only ever split up or indexed on occasion. This motivates the addition of platform specific optimisations like short string optimisation, where characters are stored inside what would normally be the space for the string’s pointer, length and capacity. This is not appropriate for us; it adds platform variability and means we suffer from more complex code paths without ever benefitting.

Replacing the string with a char[10], then, reduces the time from 0.75 to 0.65 for an empty function and from 0.96 to 0.83 for a full function call. This is about 5.7 cycles per iteration; the benchmarked code takes about 1.6 of that.

1.6 cycles isn’t very long

It might initially seem crazy that a 20-line function takes just 1.6 cycles; that’s five lines per cycle, and there’s even a loop in there! So let’s analyse that claim a little more.

The large if is effectively a chain of many ifs, and writing it as one does not change the timings.

if (sq[0] + sq[1] + sq[2] != c15) { return false; }
if (sq[3] + sq[4] + sq[5] != c15) { return false; }
if (sq[6] + sq[7] + sq[8] != c15) { return false; }
if (sq[0] + sq[3] + sq[6] != c15) { return false; }
if (sq[1] + sq[4] + sq[7] != c15) { return false; }
if (sq[2] + sq[5] + sq[8] != c15) { return false; }
if (sq[0] + sq[4] + sq[8] != c15) { return false; }
if (sq[2] + sq[4] + sq[6] != c15) { return false; }

We want to analyse control flow through this chain of ifs. Before doing so, it’s important to verify that the compiler is compiling as you’d expect; if it’s reordering or coalescing statements then you need to account for that. Luckily for us, the compiler is making a straightforward translation. Then we can draw out some statistics.

(sq[0] + sq[1] + sq[2] != c15) => 387420489 hits,   8.4% passed
(sq[3] + sq[4] + sq[5] != c15) => 32417901 hits, 8.4% passed
(sq[6] + sq[7] + sq[8] != c15) => 2712609 hits, 8.4% passed
(sq[0] + sq[3] + sq[6] != c15) =>    226981 hits,   9.3% passed
(sq[1] + sq[4] + sq[7] != c15) => 21195 hits, 10.4% passed
(sq[2] + sq[5] + sq[8] != c15) => 2191 hits, 100.0% passed
(sq[0] + sq[4] + sq[8] != c15) =>      2191 hits,   8.6% passed
(sq[2] + sq[4] + sq[6] != c15) => 189 hits, 21.7% passed
=> 41 hits

As we can see, with the exception of one superfluous test with a 100% pass rate and the last, 20% successful test, each test reduces the space of possibilities by a factor of 10 or more. The first three tests handle 99.95% of the test cases, and the rest of it doesn’t matter to timings unless it’s many times more expensive.

We can analyse this by rerunning with only the first two tests enabled. However, we run into the problem that we end up printing almost 200k strings, which is many times more expensive than the test, so drowns out any comparisons. Let’s fix this by writing them to an in-memory buffer.

char volatile output[10] = {'\0'};
char buffer[10] = {'\0'};
for (buffer[8] = '1'; buffer[8] <= '9'; ++buffer[8]) {
...
for (buffer[0] = '1'; buffer[0] <= '9'; ++buffer[0]) {
if (check_if_magic(buffer)) {
for (size_t i = 0; i < 9; ++i) {
output[i] = buffer[i];
}
}
} ... }
std::cout << (char const *)output << " ";

Surprisingly, this actually reduces timings even for the full case (the trivial case remains unaffected), down to 0.75 seconds, likely because the compiler is now able to avoid a second memory access in the inner loop, presumably because of either register pressure or that the copy loop is easier for the compiler to optimise around.

Using only the first three ifs in check_if_magic reduces the time a tiny bit to 0.72 seconds. This is unexpected, given the circumstances, but checking the assembly verifies that this is just because the compiler is better able to keep temporaries in registers and can avoid some push-pops. Unfortunately we aren’t able to directly see the cost of the ifs this way, but the claim seems to hold up.

How expensive is that if?

There are two costs to each if: the cost of the calculation and the cost of misprediction. Generally speaking, a modern desktop out-of-order CPU can run 4 arithmetic instructions per cycle (IPC). perf tells us that our code is running at an IPC of 3.0, though this includes generate_or_check and the function call, so it’s likely check_if_magic is running at a higher IPC. The assembly starts

check_if_magic(char const*):
movsx esi, BYTE PTR [rdi]
movsx ecx, BYTE PTR [rdi+1]
xor eax, eax
movsx r8d, BYTE PTR [rdi+2]
lea edx, [rsi+rcx]
add edx, r8d
cmp edx, DWORD PTR c15[rip]
je .L18
rep ret

so most calls are 9 instructions long, or presumably 2-3 cycles. Note that xor eax, eax is generally free. The earlier measure of 1.6 presumably had the out-of-order machinery hide some of these operations in cycles inside generate_or_check.

We can test this. perf stat says that the empty function results in 4.60 billion instructions over 2.14 billion cycles, whereas the full function results in 7.43 billion instructions over 2.44 billion cycles. This means we’re running 2.83 billion instructions extra (7.3 more per call) in 0.3 billion cycles, for an incredible IPC of 9.4, which is much higher than the maximum CPU throughput! This confirms the result: almost all of the function’s arithmetic cost is hidden.

The other costs are branch misses. One would naïvely expect about 8–9% of the branches to miss if the branches were unpredictable; the core would end up with a static always-return false prediction and that would be wrong 8.4% of the time for the first test. However, when the full function is disabled, perf stat says that there are about 1.64 billion branches, 10m–10.3m of which miss and, when the full function is enabled, perf stat says there are 2.06 billion branches, or 420 million more, 14.1m–14.3m of which miss. Taking the difference to see the cost of the function, this means we have just over one branch per call, and we have a miss rate of about 1%.

Having just over one branch per call makes a lot of sense, but the low miss rate probably surprises many of you. This is the consequence of both the in-order testing framework and fairly simple branch prediction strategies. Though the exact mechanics of these cores’ branch predictors is obviously secret, some of this can be explained using simple inference.

Generally speaking, a predictor is able to learn a static pattern of a short length. The value sq[0] + sq[1] + sq[2] != c15 mostly follows such a predictable pattern. Then the pattern of branch-taken or branch-not-taken is shown below.

111:                  1  1  1  1
115: { 1 1 1 1 1 1 1 1 } × 5
159: { 0 1 1 1 1 1 1 1 } × 5 (miss)
214: { 1 1 1 1 1 1 1 1 } × 4 (miss)
249: { 0 1 1 1 1 1 1 1 } × 6 (miss)
313: { 1 1 1 1 1 1 1 1 } × 3 (miss)
339: { 0 1 1 1 1 1 1 1 } × 7 (miss)
412: { 1 1 1 1 1 1 1 1 } × 2 (miss)
429: { 0 1 1 1 1 1 1 1 } × 8 (miss)
511: { 1 1 1 1 1 1 1 1 } × 1 (miss)
519: { 0 1 1 1 1 1 1 1 } × 9 (miss)
599: { 1 1 1 1 1 1 1 1 } × 1 (miss)
618: { 0 1 1 1 1 1 1 1 } × 8 (miss)
689: { 1 1 1 1 1 1 1 1 } × 2 (miss)
717: { 0 1 1 1 1 1 1 1 } × 7 (miss)
779: { 1 1 1 1 1 1 1 1 } × 3 (miss)
816: { 0 1 1 1 1 1 1 1 } × 6 (miss)
869: { 1 1 1 1 1 1 1 1 } × 4 (miss)
915: { 0 1 1 1 1 1 1 1 } × 5 (miss)
959: { 1 1 1 1 1 1 1 1 } × 4 (miss)
995: 1 1 1 1 1

That is, we swap between the states of having fallthrough once every 8 and having no fallthrough at all. The fallthrough within a state are predictable, since they’re a simple repeat. It’s only changing between states that can cause misses.

If you predict false only if the last 8 branches were exactly F, so an F follows an F, you will be mispredict 18 times out of those 729 branches, annotated on the chart as (miss). This gives a mispredict rate of just 2.5%.

But the actual mispredict rate is even lower, at 1%! The other way to improve predictions is through the global history. There are two simple properties such a predictor might be able to use: switches between the two cycles happen on a cycle boundary when the middle digit is a 1 or when the last digit is a 9. These directly correspond to the history of branches, so can be learnt at least to some extent, though precisely how well depends on microarchitectural details.

Refining the benchmark

Now, is this predictable behaviour wanted? I’d argue it’s not: if you’re doing an in-order traversal you might as well just enumerate the 8 solutions, or hoist the conditionals, or something of that sort. If this function’s runtime is actually important, you probably have an unpredictable source.

Generally speaking, the most reliable way to test these cases is with a simple precalculated stream of inputs. This minimises the testing cost in the inner loop. However, to avoid too much memory pressure it’s best to restrict the size to no more than a few megabytes and run over it a few times. Note that using an array too short allows the branch predictor to learn its properties, so it’s best to keep it at least somewhat sizable.

#include <cstring>
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
bool __attribute__ ((noinline)) check_if_magic(char const *sq) {
...
}
void generate_or_check(std::vector<char> &test_vectors) {
char volatile output[10] = {'\0'};
    for (size_t i = 0; i < 81; ++i) {
for (size_t i = 0; i < test_vectors.size(); i += 9) {
if (check_if_magic(&test_vectors[i])) {
for (size_t j = 0; j < 9; ++j) {
output[j] = test_vectors[i + j];
}
}
}
}
    std::cout << (char const *)output << " ";
}
int main() {
std::mt19937 rng(0);
std::uniform_int_distribution<char> digit_dist('1', '9');
    std::vector<char> test_vectors(43046721, '\0');
for (char &digit : test_vectors) {
digit = digit_dist(rng);
}
    auto start = std::chrono::system_clock::now();
generate_or_check(test_vectors);
auto end = std::chrono::system_clock::now();
    std::chrono::duration<double> difference = end - start;
std::cout << difference.count() << "\n\n";
}

This does the same number of tests as the complete sweep, though it does it over a random sample of about 1% of the possible vectors.

For an empty function we get a time of 0.50 seconds, down from 0.65, and for the full function we get 0.96, up from 0.83, for a massive total time of 0.46 seconds, where before it was 0.18!

perf stat says the empty function executes 8.51 billion instructions — mt19937 costs a lot — over 5.1 billion cycles. The full function executes 10.95 billion instructions over 6.5 billion cycles. This is an additional 2.44 billion instructions, running at an IPC of a mere 1.7! Last time it was 9.4! Note that the previously there was an additional 2.83 billion instructions; the reduction is likely down to instructions being flushed before completion during branch mispredictions, though I am just guessing.

This catastrophic reduction in IPC is almost certainly in part due to the 39 million branch misses, or one every 35 cycles. Since a branch miss takes 15–20 cycles, that’s a huge fraction of our time. 39 million branch misses suggests a miss rate of 10%, which is 1% higher than I expected. I’m not sure why, though one possibility is that the predictor is trying too hard to be smart, rather than locking onto an always-return false prediction, and suffers for doing so.

Round 1. Direct solution vs. the oddity heuristic

Now we can finally start actually looking at the original changes, and comparing our results to the official ones. Knowing what you do now, would you assume that an “early-out” using the following is going to be profitable?

if ( (sq[0] & 1) != 0 || (sq[1] & 1) == 0 
|| (sq[2] & 1) != 0 || (sq[3] & 1) == 0
|| (sq[4] & 1) == 0
|| (sq[5] & 1) == 0 || (sq[6] & 1) != 0
|| (sq[7] & 1) == 0 || (sq[8] & 1) != 0)
return false;

One immediately sees that we’re actually competing against just the first pair of original ifs, not the whole function, and that the main cost of that if was not the arithmetic but the predictability. Unless this is much more predictable, which it isn’t likely to be, it will be a net loss.

Looking back at when I first did this, I previously expected a difference of ~25%, thinking that the mispredict frequency before would be around 10% originally and 50% after, so a factor-5 increase in mispredict penalty, and then I compensated for the assumed cost of the rest of the function. The challenge, using predictable inputs to the benchmark, said the new version was 10% faster.

In fact, it the time taken skyrockets to 3.2 seconds, which is actually a factor of 5.4 worse when adjusting for fixed overheads. It shouldn’t be a surprise that my early guess of 25% was too low, now we know that miss predictions are at least 50% of the function’s cost, but increasing those by a factor of 5 should only make the function take three times as long! Why is this so slow, then?

The secret is that the function can mispredict multiple times. It turns out it averages 0.93 misses per call, a 0.85 per call increase!

~56%: (sq[0] & 1) != 0  (taken)
~20%: (sq[0] & 1) != 0  (mispredict)
(sq[1] & 1) == 0 (mispredict, taken)
~14%: (sq[0] & 1) != 0  (mispredict)
(sq[1] & 1) == 0
(sq[2] & 1) != 0 (taken)
 ~5%: (sq[0] & 1) != 0  (mispredict)
(sq[1] & 1) == 0
(sq[2] & 1) != 0 (mispredict)
(sq[3] & 1) == 0 (mispredict, taken)
... etc ...

The probability is the chance of that particular possibility; for the 20% the first test needs to fail, which means sq[0] is even (4 in 9), and the second test needs to pass, which means sq[1] is also even (4 in 9). The total probability is the multiple of the odds, 16 in 81, which is just under 20%. Both branches miss because it’s always more likely for a particular value to be odd.

The total misprediction cost is roughly is the sum of the number of mispredicts multiplied by their probability, (56%×0) + (20%×2) + … + 0.17%×5, which comes out to around 0.85, which is what we measure.

Round 2. Direct solution vs. the central 5 heuristic

Now that we know, we’re likely to be a lot more cautious when we see another early-out. The next round takes the initial solution and sticks the following in front of it.

if(sq[4] != '5')
return false;

The fixed arithmetic cost is low, just a load and a compare. It even has a decently predictable branch that triggers 8 in 9 times, so an expected 11% mispredict rate. This seems like a bad trade, since we now know we’re not arithmetically bound, so the savings aren’t worth more mispredictions. However, since the mispredictions are still fairly infrequent, it’s unlikely to be disastrous like the previous one.

When I initially looked at this, I thought the new version would be 50% faster, mostly guided by the fact the previous official result suggested branch prediction didn’t matter, so I looked at the reduction in arithmetic cost. This turned out to match the official results.

But, indeed, with the new test harness we can see the cost to branch prediction, and the time increases from 1.00 seconds to 1.08 seconds, owing mostly to an extra 12 million mispredicts, just like we reasoned. Note that this suggests each new misprediction is taking us just under 23 cycles!

Round 3. Shifts vs. no shifts

The next question is sold as the difference between the original test using shifts. It replaces the following, bold line.

auto char_map = ideal_char_map;
for(auto i = 0u; i < 9; ++i)
char_map ^= char_map_one << sq[i];
if (char_map != 0)
return false;

The replacement has a hardcoded result array, and instead of shifting it performs lookup.

auto char_map = 0u;
for(auto i = 0u; i < 9; ++i)
char_map ^= bit_shifts[sq[i]];
if (char_map != 511)
return false;

We now know that this part is free, so this cannot cause any material change to the timings. And, if we make this change inline, we see no difference. I noticed this the first time I tried this puzzle, but strangely the new version was measured as 25% faster! How can that be?

This is because there’s also a sneaky confounder: the if was changed as well!

if ((sq[0] + sq[1] + sq[2] != magic_number)
|| (sq[3] + sq[5] != not_so_magic_number)
|| (sq[6] + sq[7] + sq[8] != magic_number)

|| (sq[0] + sq[3] + sq[6] != magic_number)
|| (sq[1] + sq[7] != not_so_magic_number)
|| (sq[2] + sq[5] + sq[8] != magic_number)

|| (sq[0] + sq[4] + sq[8] != magic_number)
|| (sq[2] + sq[4] + sq[6] != magic_number))

When the function was predicting well, we were arithmetic bound so the second-line change would have reduced arithmetic pressure. If anything it’s surprising the change was so large, since the first line was left unchanged so the difference should be sub-10%, and checking the assembly in GCC 5.4.0 only strengthens that resolve.

Now that we’re mispredict bound, it’s going to matter even less, though we might see the second line being slightly more or less predictable. Running the code locally, I see no change to the time taken.

Round 4. Fair magic test vs. precached answers

The next test asks to predict the performance of a linear traversal over all eight precached inputs, rather than a first-principles calculation. I’ve modified the test to take and compare a char const *, because producing a std::string on input would destroy our timings, but the std::array<std::string, 8> we’re performing lookup into is unchanged for now.

const std::array<std::string, 8> all_magic_squares = {
"816357492", "492357816",
"618753294", "294753618",
"834159672", "672159834",
"438951276", "276951438"
};
bool __attribute__ ((noinline)) check_if_magic(char const *sq) {
for (auto i = 0u; i < 8; ++i) {
if (!all_magic_squares[i].compare(0, 9, sq, 9)) {
return true;
}
}
    return false;
}

I’ve mentioned that std::string is inefficient, and that will affect us here. Small string optimization will mean the string we check against will have its values stored inline, and this check will be predictable, so there is that upside, but we’re still doing doubly indirect access and we’re not worried about the strings not being in cache so we don’t actually save anything.

Now that we know we’re comparing against just the first couple of conditions, it shouldn’t be surprising that doing 8 comparisons is slower, especially given they are slow ones. However, it probably is unexpected that the new test takes 11.8 seconds, which adjusting for the fixed test overhead means a difference of a factor of more than 20! This is crazy because the new version doesn’t even suffer mispredictions!

The problem seems to be std::string::compare, which I assumed would just resolve to a memcmp but doesn’t seem to. Using memcmp directly,

        if (!memcmp(all_magic_squares[i].c_str(), sq, 9))

gives us a time of 2.4 seconds, which seems reasonable. Using char const * reduces this overhead even more to 1.8 seconds, and further using an array of char[9] gives a time of 1.6 seconds, so an adjusted time of 1.1 seconds versus the original 0.5 seconds.

This is a good demonstration of the silent costs of abstraction, and I hope it made you think twice before overusing std::string in scenarios it was not explicitly designed for.

This was also a demonstration of how you can trade branchy code for arithmetic code. In this case the original, branchy code was best, but if you think about how much arithmetic we’re doing extra to remove a few bad branches that’s actually pretty impressive.

Round 5. String array vs. string set

The next test compares the std::array with a std::set, testing an asymptotically faster data structure on short, statically defined data (this was the first benchmark not tested against the primary implementation). Since we’ve abandoned std::string, it’s sensible to do the same for the std::set variant, and the code tested is shown below. I think casting from char const * to std::array<char, 9> const * is well-defined; please correct me if it is not.

const std::set<std::array<char, 9>> all_magic_squares = {
{ '8', '1', '6', '3', '5', '7', '4', '9', '2' },
{ '4', '9', '2', '3', '5', '7', '8', '1', '6' },
{ '6', '1', '8', '7', '5', '3', '2', '9', '4' },
{ '2', '9', '4', '7', '5', '3', '6', '1', '8' },
{ '8', '3', '4', '1', '5', '9', '6', '7', '2' },
{ '6', '7', '2', '1', '5', '9', '8', '3', '4' },
{ '4', '3', '8', '9', '5', '1', '2', '7', '6' },
{ '2', '7', '6', '9', '5', '1', '4', '3', '8' }
};
bool __attribute__ ((noinline)) check_if_magic(char const *sq) {
auto &arr = *reinterpret_cast<std::array<char, 9> const *>(sq);
return all_magic_squares.find(arr) != all_magic_squares.end();
}

A std::set is a red-black binary tree. Since we don’t care about mutations, the principal expectation is that it’s a balanced binary tree. For eight items, this means a tree of three full layers and one extra element.

                              1
/ \
/ \
2 3
/ \ / \
4 5 6 7
/
8

The average element access will take about one and a half traversals; the exact tests are unlikely to miss because they are almost always false, but you need to reach the end. This is obviously slower, and that you’re doing it indirectly through a data structure only makes it worse.

On the original benchmark, my guess was a small speed hit, since std::string hid timing differences and the lack of branch prediction issues were fairly obvious by this point, but the actual results were a factor-2 speed improvement. This makes sense when you consider that the in-order traversal will always go through the branches in a predictable order, so the set variant hides arithmetic cost, though I am surprised that it was so effective.

With the new benchmark I get a time of 8.3 seconds, a factor-16 increase in time. This follows largely from the 1.5 branch mispredictions per call, on top of the increased number of instructions for traversal now that string comparisons are relatively less expensive and their reduction doesn’t confer much reduction in arithmetic cost.

Round 6. String set vs. string unordered set

Given a std::set tested well on the original benchmark, it made sense to explore the alternatives. This next test swapped it out for a std::unordered_set, a hash map using separate chaining.

Using std::unordered_set requires a std::hash, but we’ve abandoned the hashable std::string for a std::array. This is easiest to get back by stealing from C++17’s std::string_view.

struct array_hash {
template <size_t N>
std::size_t operator()(const std::array<char, N> &arr) const {
return std::hash<std::string_view>()({&arr[0], N});
}
};
const std::unordered_set<std::array<char, 9>, array_hash>
all_magic_squares = { ... };

We expect this to reduce the number of mispredictions, since rather than an array traversal we just test against the relevant value, but we’re still going to mispredict if there is a mix of empty buckets, length-1 buckets and length-2 buckets. In addition, hashing elements can be slow.

Since the original test was arithmetic bound and predicted well, I expected a 50% slowdown relative to the std::set; hash maps work best for less trivially sized data. The original results were slightly more optimistic about the hash map, but mostly agreed.

In fact, the new version also has a similar slowdown, taking 12.4 seconds. A bunch of this seems to be down to having 48 billion instructions rather than 35 billion for the set, especially given the number of branch mispredictions fell from 600 million to 280 million, and using a more efficient std::hash to reduce the instruction count to 270 million does reduce the time taken to a much closer 9.4 seconds.

struct array_hash {
template <size_t N>
std::size_t operator()(const std::array<char, N> &arr) const {
return *(std::size_t *)&arr * 52341237;
}
};

Round 7. String array vs. string array plus the oddity heuristic

Going back to the linear traversal over std::array, which you might not recall took 1.6 seconds, or 1.1 adjusting for benchmark overhead, how do you think adding the oddity heuristic would change the timing characteristics?

As we previously found, the oddity heuristic applied to the original code took 3.2 seconds, and since it handles most of the cases we would expect that measurement to transfer over directly. Indeed, we see as expected that the time taken is also 3.2 seconds.

Since the original code had the inefficient std::string array and a predictable codepath, my prediction then matched what we saw the heuristic do before, which was run far faster, and this showed in the results.

Round 8. Direct solution vs. answers in uint64_t

The next question introduces a new technique, where the array is not of type std::array<char[9], 8>, but std::array<uint64_t, 8>, and then compares against the very first code block. This is read from the input by first returning false if sq[4] != '5' and then reading the rest of it into an integer.

const std::array<uint64_t, 8> magic_numbers {
3545515123101087289, 3690191062107239479,
3544956562637535289, 3978984379655991859,
3689073941180135479, 4123101758198592049,
3977867258728887859, 4122543197735040049
};
bool __attribute__ ((noinline)) check_if_magic(char const *sq)
{
if (sq[4] != '5') {
return false;
}
    uint64_t magic_number = *(reinterpret_cast<const uint32_t*>(sq));
magic_number <<= 32;
magic_number += *(reinterpret_cast<const uint32_t*>(sq+5));
    for(auto i = 0u; i < 8; ++i) {
if(magic_number == magic_numbers[i]) {
return true;
}
}
return false;
}

We know how well the first test does, and the rest of the checks will roughly match, maybe beat, the char[9] variant, so we expect somewhere between 1.1 and 1.6 seconds. My original commentary for this round was based around testing sq[4] being faster, because it was predictable then, versus the very slow long test after. This ended up significantly in favour of the early-out bonus.

For the new, unpredictable test harness, it turns out we get 1.35 seconds, balancing the much larger cost of a long traversal against the somewhat-predictable early out.

Round 9. Array on uint64_t vs. set of uint64_t

The penultimate test replaces the array of uint64_t with a std::set, rerunning an older test with a newer type. Since we know the std::set search was four times slower than the array traversal but that the difference is only a ninth as relevant, we would expect a time of around, say, 1.8 seconds. My original guess used a similar balancing of known results, as did the original results.

Surprisingly to me, though, under the new tests the std::set takes a solid 1.4 seconds. Delaying the test of sq[4] shows that the use of uint64_t in the std::set makes the search take only 4.5 seconds, or 4.0 seconds adjusted for benchmark overhead, which is twice as fast as the previous use of the set. Somehow this has allowed the number of branch mispredictions to plummet from 590 million to 320 million. It seems either we’re luckier or sets are simply faster with simple comparisons and small types.

I’m not going to investigate this in detail, but the offer is open to readers.

Round 10. Array on uint64_t vs. unordered set of uint64_t

As we come to our final round, we repeat the last test with a std::unordered_set. This takes 2.1 seconds, or 1.6 seconds adjusted for benchmark overhead, and in the spirit of learning I shall leave a deeper explanation as an exercise to the reader. The answer should explain why this is much slower than the adjusted 0.9 from the previous, and reference other previously measured results.


Conclusion

When you’re next faced with a performance question, try to ask why you’re seeing what you are and then prove it with tests. I’ve corrected myself several times writing this based off of the data I’ve found, but just as many times I noticed the data was wrong because it didn’t match my model.

When people skilled at performance analysis tell you to measure, they are not telling you to fire and forget. Measuring is not the method with which you gradient descent to your solution, it’s how you verify and expand your understanding. Only with a working model are you able to act with purpose.

You see a lot these days of people talking about the infeasibility of understanding the levels below. Compilers are too smart, they say, and you should leave them to do their own thing. CPU pipelines are too complicated to reason about, it goes, so you can only guess and measure. Any optimisation, the stories tell, is premature if it hasn’t first slowed down your production server.

Hopefully I have shown that this is perhaps less true than it sounded before, and if just one person reading this is encouraged to cast off the veil of magic, to discover what there is to be discovered, this post will have been well worth writing.

Thanks to wordsandbuttons for the posts this one builds off of and for reviewing an early copy of this post.