Once more on Haskell vs C performance

J Ho
Superstring Theory
Published in
3 min readApr 3, 2017

Bottom line: Haskell is faster :)

We’ve been experimenting with machine learning in Haskell lately (functional approach is perfect for ML tasks but hasn’t been used widely yet at all). As you probably know, bread and butter of any machine learning algorithm is matrix multiplication. So, before delving into Accelerate or Repa libraries we set out to do some quick-n-dirty checks on basic Vector performance.

As a result, with the latest ghc (8.*) optimized via llvm 3.7 and -O2 haskell runs 15–25% faster than gcc -O3 while computing dot-product of 2 vectors. Standard ghc compilation with -O2 (without llvm) is about the same as C. ghc -O runs significantly slower.

Haskell performs about the same with sum or foldl' and the function that calculates the dot product is extremely elegant (as you would expect):

dotV5 v1 v2 = U.sum $ U.zipWith (*) v1 v2

Detailed criterion benchmarked results are below, but on average multiplication of 2 10,000,000 element double vectors runs 12.3ms.

C code is pretty basic (and somewhat ugly). On Xeon Thinkpad P50 with Ubuntu, compiled with gcc -O3 this runs about 15.3 ms — more than 20% slower than haskell! Now, granted, we didn’t compare how specialized libraries such as LAPACK and similar would perform — this is still down the road — but this is yet another quick evidence, similar to this (and a bunch of others) that Haskell indeed can perform as well or better than C, while giving you the power of much more concise, elegant, error-free code.

In Haskell, we tried different approaches to see how well the advertised stream fusion framework works with unboxed Vectors: zip 2 vectors then left fold, strict left fold, special fold (sum), ifolds with some light hand-optimization (perform much better than standard folds with -O but -O2 shows you don’t need the trickery and can rely on the compiler).

Detailed results:

benchmarking main/small/foldl
time 37.09 ms (36.60 ms .. 37.83 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 36.99 ms (36.75 ms .. 37.20 ms)
std dev 491.1 μs (324.1 μs .. 704.2 μs)
benchmarking main/small/foldl'
time 12.43 ms (12.29 ms .. 12.59 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 12.69 ms (12.59 ms .. 12.88 ms)
std dev 328.0 μs (232.9 μs .. 483.7 μs)
benchmarking main/small/sum
time 12.38 ms (12.33 ms .. 12.43 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 12.43 ms (12.39 ms .. 12.48 ms)
std dev 116.1 μs (59.69 μs .. 166.4 μs)
benchmarking main/small/foldl no zip
time 12.47 ms (12.38 ms .. 12.60 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 12.51 ms (12.46 ms .. 12.61 ms)
std dev 181.1 μs (134.5 μs .. 262.7 μs)
benchmarking main/small/foldl' no zip
time 12.38 ms (12.34 ms .. 12.42 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 12.40 ms (12.38 ms .. 12.44 ms)
std dev 78.49 μs (45.18 μs .. 131.7 μs)

Lazy fold is very slow as you would expect (as are right folds, which are not quoted above), but strict left folds, sum and a bit hand-optimized ifolds all perform very similarly. Full code of all functions tested is below:

--

--