Are Mutable References Fast?
I was profiling a simple loop (years ago). It was similar to:
countForever ref = forever $ modifyIORef' ref (+1)
I counted four assembly instructions.
Addition takes ~300 picoseconds, but there is memory access so let’s say two nanoseconds
To profile, I wrote:
iorefLoop = do
ref <- newIORef (0 :: Int)
replicateM_ 100000000 $ modifyIORef' ref (+1)
I expected this to take ~ 0.2 seconds, but …
$ bench ./bin/ioref-loopbenchmarking ./bin/ioref-loop
time 332.3 ms (330.4 ms .. 333.8 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 334.4 ms (333.2 ms .. 335.9 ms)
std dev 2.822 ms (2.182 ms .. 3.942 ms)
Was this a good time? Was I expecting something unrealistic from Haskell? As a sanity check, I wrote an analogous program in C:
int main (int argc, const char** argv)
{
int ref = 0;
while (ref < 100000000)
{
ref++;
}
}
I compiled and ran it like:
$ gcc -O2 ./src/CLoop.c -o ./bin/CLoop && bench ./bin/CLoopbenchmarking ./bin/CLoop
time 3.999 ms (3.794 ms .. 4.235 ms)
0.980 R² (0.972 R² .. 0.992 R²)
mean 3.940 ms (3.838 ms .. 4.093 ms)
std dev 593.3 μs (422.8 μs .. 816.0 μs)
variance introduced by outliers: 90% (severely inflated)
This was much faster than I had expected.
Maybe
bench
isn’t accurate when programs are fast?
I went on IRC (different times because GHC makes faster code now):
jfischoff ~ Is there anyway to make modifyIORef' faster?
jfischoff ~ I am surprised that in a second I was only able to
↪ update this ref 100 million times:
↪ timeout 1000000 $ forever $ modifyIORef' x (1+)
jfischoff ~ where as c++ was able to do the same in 4 milliseconds
glguy ~ c++ was able to do 1 update every 0.04 nanoseconds?
glguy ~ an update rate of 25 gigahertz?
dv- ~ gcc probably just replaced it with a constant
jfischoff ~ dv-: perhaps
glguy ~ That or C++ unlocks the fast mode of an Intel processor
Burn.
$ gcc -O2 CLoop.c -S
Here’s the assembly:
; CLoop.s
; Notice, there are no jumps.
; There is no looping.
_main: ## @main
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp0:
.cfi_def_cfa_offset 16
Ltmp1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp2:
.cfi_def_cfa_register %rbp
xorl %eax, %eax
popq %rbp
retq
.cfi_endproc
dv-
was right. The compiler got rid of the loop, I’m assuming, because I wasn’t using the result. I added a volatile
keyword to prevent optimizations:
//CountForver.c
int main (int argc, const char** argv)
{
volatile int ref = 0;
while (ref < 100000000)
{
ref++;
}
}
Compiling:
$ gcc -O2 ./src/CLoopVolatile.c -o ./bin/CLoopVolatile
$ bench ./bin/CLoopVolatilebenchmarking ./bin/CLoopVolatiletime 158.2 ms (156.7 ms .. 159.5 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 159.1 ms (158.4 ms .. 160.2 ms)
std dev 2.423 ms (1.755 ms .. 3.772 ms)
The assembly of the loop is:
loop:
incl -4(%rbp) # Increment the value at the address stored
# in rbp displaced by 4.
movl -4(%rbp), %eax # move the value at the address stored in rbp
# displaced by 4 to eax.
cmpl $100000000, %eax # Compare to 100000000 to eax.
jl loop # Jump to the start of the loop if the eax
# was less than when compared above.
So, incrementing C’s mutable refs is over twice as fast as IORef
? Maybe, though, handwritten assembly could be faster? Maybe the volatile
keyword threw off optimizations too much?
; AsmLoop.asm
EXTERN _exit
GLOBAL start
SECTION .data
align 8
iterationCount DD 100000000
result DD 0
SECTION .text
start:
; Copy the iteration count to the ecx register
; which is used by the loopnz instruction
mov ecx, [iterationCount]
loopBlock:
inc dword [result] ; Increment the value at the address of result
loopnz loopBlock ; Decrement the ecx counter and loop until ecx
; is zero
exitBlock:
push dword 0 ; Set the exit code to zero
mov eax, 1 ; Place the system call number (exit) in the eax reg
sub esp, 4 ; I have to add some dummy space to the stack for
; some reason
int 0x80 ; Exit
Compiling:
$ nasm -fmacho src/AsmLoop.asm
$ ld -o ./bin/AsmLoop -macosx_version_min 10.7.0 ./src/AsmLoop.o
$ bench ./bin/AsmLoop
benchmarking ./bin/AsmLoop
time 185.2 ms (183.7 ms .. 186.9 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 187.0 ms (185.9 ms .. 188.5 ms)
std dev 3.132 ms (2.229 ms .. 4.269 ms)
Wow, my l33t skillz increased the time by around 15% (three years ago this version was faster … just saying).
So IORef
is about two times slower than naive approaches in C and assembly. What gives? To the core!
$ stack exec ghc-core -- --no-syntax -- -O2 \
-dsuppress-all src/IORefLoop.hs
Looking through the core I saw:
...
case readMutVar# ipv1_aVW w2_a35m
of _ { (# ipv2_aWt, ipv3_aWu #) ->
case ipv3_aWu of _ { I# x_aUW ->
case writeMutVar# ipv1_aVW (I# (+# x_aUW 1#)) ipv2_aWt
...
I find GHC core almost unreadable.
One trick is to try to ignore most of the case statements. The first and third case statements are not for scrutinizing alternatives, but are to ensure proper sequencing of IO actions. Also, renaming the generated variables helps.
Here is a cleaned up version of what was above:
...
case readMutVar# ioRef token of
{ (# token', x #) ->
case x of
{ I# unbox -> case writeMutVar# ioRef (I# (+# unbox 1#)) token'
...
The second case statement is unboxing an Int
to a primitive unboxed Int#
,
case x of
{ I# unbox
and then boxing when setting.
case writeMutVar# ioRef (I# (+# unbox 1#)) token'
I#
is the Int
constructor (the #
means it is a compiler primitive). It wraps or boxes an unpacked, unboxed, "machine" int. Most of the time, GHC can unbox primitives automatically, but it can't in this case. Boxing/unboxing could be the root of the slowdown.
We need an unboxed mutable reference. I can think of two options: Ptr Int and MutableByteArray.
First the Ptr
version:
main = alloca $ \ptr -> do
poke ptr (0 :: Int)
replicateM_ 100000000 $ do
i <- peek ptr
let !i' = i + 1
poke ptr i'
Running, I get:
bench ./bin/ptr-loopbenchmarking ./bin/ptr-loop
time 175.2 ms (174.1 ms .. 176.2 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 174.7 ms (174.1 ms .. 175.2 ms)
std dev 1.626 ms (1.372 ms .. 2.089 ms)
This is good! Almost twice as fast IORef
.
Now the MutableByteBuffer
version:
main = do
mba <- newAlignedPinnedByteArray 4 8
replicateM_ 100000000 $ do
i <- readByteArray mba 0 :: IO Int
let !i' = i + 1
writeByteArray mba 0 i'
Running:
bench ./bin/mutable-byte-buffer-loopbenchmarking ./bin/mutable-byte-buffer-loop
time 175.2 ms (173.5 ms .. 177.3 ms)
0.999 R² (0.999 R² .. 1.000 R²)
mean 177.4 ms (176.5 ms .. 178.4 ms)
std dev 2.423 ms (1.987 ms .. 2.919 ms)
Basically identical.
I’ve made progress, but I don’t want to use either Ptr
or MutableByteBuffer
. They are not very safe (Ptr
is arguably safe enough, but I prefer not to manually allocate memory directly if I can help it).
If only there were a safe wrapper around one of the types? There is! Vector
is a safe wrapper around MutableByteBuffer
. Here is the Vector
version:
main = do
mba <- M.new 1
replicateM_ 100000000 $ do
i <- M.read mba 0 :: IO Int
let !i' = i + 1
M.write mba 0 i'
Running:
bench ./bin/vector-loopbenchmarking ./bin/vector-loop
time 178.6 ms (177.3 ms .. 179.9 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 178.2 ms (177.1 ms .. 179.3 ms)
std dev 3.021 ms (2.082 ms .. 4.288 ms)
Nice! With only a small increase in time, we get a safe interface for fast, unboxed mutable references. It is a little odd to use a collection type for a single element, but it’s not terrible, either.
What’s on the Table
The CLoopVolatile
and the AsmLoop.asm
are not the fastest way one could write a loop that increments a number. That would involve putting the reference in a register for the duration of the loop, and other tricks like loop unrolling. Just to give a sense of how much faster things could be here is a some simple assembly example that does that.
; FastAsmLoop.asm
EXTERN _exit
GLOBAL startSECTION .text
start:
mov rcx, 0
mov rax, 100000000loopBlock:
inc qword rcx ;
cmp rcx, rax ;
jl loopBlock ; exitBlock:
mov rax, 0x2000001 ; exit
mov rdi, 0
syscall
Compiling
$ nasm -fmacho64 src/FastAsmLoop.asm
$ ld -o ./bin/FastAsmLoop -macosx_version_min 10.7.0 \
./src/FastAsmLoop.o
$ bench ./bin/FastAsmLoop;benchmarking ./bin/FastAsmLoop
time 38.38 ms (37.96 ms .. 39.15 ms)
0.999 R² (0.995 R² .. 1.000 R²)
mean 38.22 ms (37.98 ms .. 38.87 ms)
std dev 715.8 μs (282.6 μs .. 1.261 ms)
I barely believe these numbers, but I think it shows that GHC still is leaving some performance on the table.
Something else worth noting, this isn’t how a tight loop would be written in Haskell anyway. One would not need a mutable reference. In a future post I would like to do more direct comparison between Haskell and assembly similar to above. The C code was really just a baseline that lead me to look into what GHC was doing.
Update: Michael Snoyman alerted me to a package of his that provides an unboxed mutable reference, among other nice things: https://www.stackage.org/haddock/lts-8.16/mutable-containers-0.3.3/Data-Mutable.html#t:URef
The a project with this code that can also produce these timings can be found here: https://github.com/jfischoff/are-mutable-references-in-haskell-fast
Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising & sponsorship opportunities.
To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.
If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!