When competing with C, fudge the benchmark

8 min readJun 26, 2017

EDIT: This post used to state the C version being 60 times faster, this was due to an error which is now rectified. Cross-lang benchmarking — still hard.

A few days ago, I encounted an article writing about the performance of Haskell vs. C:

On Competing with C Using Haskell

Mark Karpov wrote in his article on Migrating text metrics to pure Haskell how he originally did foreign calls out to C…

two-wrongs.com

The blog post also links to another article where the author rewrote his Haskell library with pure Haskell, along with some microbenchmarks. They are very interesting and I definitely learnt a few things, but they fall short on their benchmark methodologies. Cross-language benchmarks are notoriously hard to perform correctly. Many benchmarks have been debunked when put under a microscope. Examples here, here, and here.

I took the benchmarks and tried, to the best of my abilities, verify the set of benchmarks, and the results are:

the C version of this algorithm is about 4 to 8 times faster.

Rather than just giving you the numbers, and sparking holy war amongst language enthusiasts. Just exactly WHY, was the C version faster? Also, what does this tell us about C and Haskell?

If Hammingway wrote C…

The algorithm in question is the Hamming distance algorithm, or, the number of characters that are different in two strings, assuming the strings are equal length. So, given

“abcde”, and “abcdf”

You’d have a Hamming distance of 1, and so on.

In the text-metrics article, the original author used the following C snippet

The article also provided a series of Criterion benchmarks, showing the code running with strings under 160 characters.

Here we run into some common problems with cross-language benchmarks

Running the benchmark under an extremely short amount of time
When running extremely short pieces of code, in this case, around 500 nanoseconds (The original measurements did not have units, this is an educated guess). 500 Nanoseconds is an extremely small unit of time that could be easily caused by a few cache misses, or be thrown off if your code gets scheduled off the processor during it’s UNIX timeslice, interrupted, etc. When profiling, it is important to always run your code over an extended period of time, preferably seconds, in order to smooth out any irregularities that may occur.
In my benchmarks, I’ve made the C code and the Haskell code run on inputs of 100 thousand, 1 million, and 10 million long strings to get some reliable measurements. Each measurement is done 1000 times to make sure we can obtain some decent results.
Writing non-idiomatic code in the target language
To throw off benchmarks in another language, write code that isn’t idiomatic so the compiler can’t optimize for it well. In this example, although I would have either preferred the use of direct pointers (*a != *b then incrementing both pointers) or use a[i] != b[i] (which is semantically equivalent to *(a+i)), there is no performance penalty. Still, be on the lookout for these problems.

After setting up a testbench for this code (which you can find here). I’ve compiled under gcc -O3 (A script is included to compile the code), and ran it on an Intel i7–6700K under Ubuntu 16.04. I’ve obtained the following results:

Cool, what about the Haskell version?

The Types are strong with this one…

I ran into a problem:

I didn’t know any Haskell.

After pleading to half of my friends, one of them was nice enough to sit with me and help me set up a working Haskell stack, get a working profiling bench up and running, as well as battling the type checker. Much thanks to him!

Anyways, I’ve copied the fastest algorithm I could find from the page (The one where they compare the internal arrays). Passed -O2 to GHC indicating max optimizations. The benchmarking was done under Criterion. I ran it under the same machine (i7–6700K, 32GB of RAM, Ubuntu 16.04)

My full code can be found here.

If we put the results next to each other, we get this.

You can see that the C code is consistently faster compared to the Haskell code by a factor of around 9, except on larger inputs.

Now, before you grab your pitchforks…

Why is the C version faster?
and
Why did we get a smaller speed up when we reach 10 million characters?

The 2nd question is easier to answer: my CPU has a last level cache of 8MBs, 10 million 2 byte characters results in about 20 million bytes. Which will not fit in the last level cache, which slows the code down and lowers the speedups we get from whatever code differences we have.

The first question is harder to answer, the algorithms, from a glance, all look the same. This suggest we must dig down deeper, to assembly code we go!

Side Note: How do you disassemble Haskell?

GHC, through a series of intermediate representations (STG, C--, etc), eventually compiles down to assembly, and links it against the Haskell runtime. What you can do is, add -g to the GHC compiler options, it will add debugging symbols and help your disassembler tell which functions are which. For this simple example, I’m using the linux utility objdump.

To disassemble a binary, I run the following command:

objdump -D -M intel ./program_to_disassemble > result.asm

(-M intel is optional)

What do we get when we dump the Haskell code?

For those who don’t understand assembly, a quick walkthrough:

A lot of the first bits of code does a bunch of function call busywork, those can be safely ignored
16364 finally seems to be the part where we compare the two characters, r10 and r9, then increments r8, which seems to be the hamming distance counter, and then jumps away.
The rest of the code keeps loading different parts of the string for it to check and we return.

Nice and simple. relatively straightforward assembly.

Now, what happens if we disassemble the C version?

I lied, this actually goes on for several pages…I have no idea what I’m doing

Hang on, I think I see something here:

Looking at the Intel Programmer’s Manual, or Google, we see some very interesting results for the v…blah blahs that we see here.

PCMPEQB/PCMPEQW/PCMPEQD- Compare Packed Data for Equal

Performs a SIMD compare for equality of the packed bytes, words, or doublewords in the destination operand (first…

www.felixcloutier.com

SIMD instructions! Of Course!

If you don’t know what SIMD instructions are, in a nutshell:

Modern CPUs basically have special hardware for processing data in parallel, With SIMD (Single Instruction, Multiple Data) instructions, CPUs can perform operations on several pieces of data at once, rather than having to loop through and compare one character at a time, we can compare entire chunks at a time.

Looking at the Intel manuals, VPCMPEQW will compare packed words (A word is 16 bits) via the 128 bit wide registers. In other words, 8 characters at once! There are also SIMD instructions for adding the results together, so not only we do the comparison 8 at a time, we also add the results about 8 at a time. This should give us a 8 times speedup, which, looking at the tables, is roughly on the right ballpark.

Not only did we get data, we have an explanation for the data, Success!

If you have read up to here, congratulations!

Okay, What are you trying to prove?

Honestly, all of this analysis really doesn’t prove that you can’t get “C-like speeds” with Haskell. Nor does it prove that Haskell code will always be slower than C code.

This is a microbenchmark, aka, some small piece of code embedded in a larger code base. Most likely not representative of anyone’s real workload. With simple code, the compiler could probably do some fancy cheats that you would not see with a larger and more complex code sequence.
The C code is executing in the C environment. If it was embedded in Haskell, you’d get extra foreign function call overheads.
This code hits what Vyacheslav Egorov calls the “performance sweet spot” of C. As in, this code fits in the perfect type of optimizations that the GCC compiler can take and have a field day with. Not all code you write can be optimized by SIMD registers.

Quoting him again, in the same article:

You can hit the sweet spot, no doubt about that. But you have to follow rules to hit it. And those rules are strict (application runs best when it’s actually pretty static in it’s behavior) and sometimes unknown (different VMs rely on different heuristics). And hitting the sweetest part requires some strange engineering decisions as demonstrated by the mr. C.’s example above.

The optimized Haskell did exactly that, it relied on the Text internals and did a bunch of coding that essentially looked like C. It’s neither elegant nor would any sane person do this for an entire project.

So no, I didn’t particularly like the benchmark. I think every benchmark should contain an analysis to why the results are what you measured, rather than just the numbers. Hence the entire reason of this post.

Side Note:

GHC, apparantly, had a branch that contained SIMD optimizations. As detailed by: https://ghc.haskell.org/trac/ghc/wiki/SIMD

Unfortunately, that doesn’t seem to be merged. I didn’t get a chance to try this, nor try the LLVM backend which may also have vectorization primitives. Maybe in the future Haskell will optimize this to “C-like” speeds. But for now, we’ll wait.

Wrapping up

tl;dr

Benchmarking is HARD, there’s a thousand pitfalls you can jump into, e.g. code is limited on some other resource, not writing the code idiomatically, and so on. Figure out why your benchmarks are at X speed, rather than 2X or 3X of that. Debuggers and profilers help greatly in this regard.
Benchmarking gets screwed over by mistakes in code! The initial version of this blog post had C numbers 60 times faster because of a bug in the Haskell version causing it to produce longer strings than the C version.
Cross-language benchmarks are very complex things and usually people on the internet get them wrong all the time. Never believe benchmark numbers people pull out, do your own verification.
You can ALWAYS dig deeper to find a performance problem.

Anyways, I’d love to hear your thoughts on this, as well as any comments you have. All code is up on GitHub if you’d like to replicate my results, and I can be reached on Twitter here:

Frank He (@NoPtrException) | Twitter

The latest Tweets from Frank He (@NoPtrException). new Noob(). Available for Doxing

twitter.com