5 Compilers Inlining Memcpy

7 min readMay 9, 2024

I am going to be looking and comparing disassemblies from 5 different C compilers: gcc clang zig-cc icx ccomp. On a simple c program focusing around how they inline memcpy.

I am going to be looking at some specific assembly instructions. Explaining what they do and why I think they are used. Each compiler has a different approach in how they translated my simple program into executable assembly.

My qualifications to write this is basically nothing. Its true that I did work on some research into c code generation, but I was there as a python dev. And at the time I didn't even know c (i still don't know FORTRAN and I have a paper published on it).

This is purely me playing around with things I think are neat. So if this is dead obvious for people who actually use assembly regularly I am sorry.

First Test GCC

So what drove me to this in the first place? I was mindlessly looking at the NASM instruction list (as one does) and found “movsb” really interesting.

A quick recap for those who don't know what movsb does:

It copies from a source input pointer (rsi) to an output pointer (rdi) and then it increments both of them.

in c this would look something like this:

*rdi=*rsi;
 rdi+=1;
 rsi+=1;

My first thought was “DAMN that can be really useful for memcpy” so I wanted to check if gcc used it.

I wrote this basic test program:

#include <string.h>
#include <stdio.h>

char input[100]={"helo world"};

int main(){
 char output[100];
 memcpy(output,input,100);
 printf("%s\n",output);

}

and compiled it with -O3. It did not use that instruction, instead it did a bunch of SIMD operations. I was disappointed and a bit confused I tried changing it to memov instead of memcpy and it still preferred using simd.

I then guessed that having the 1 instruction would probably be very nice when doing -Os and I was not disappointed.

It used movsd which is similar to movsb but it moves 4 bytes at a time instead of 1.

Here is the core loop that gcc compiled for both memcpy and memmov

rep movsd

That’s it… 1 instruction… it loops by itself and decrements the counter. For context writing a function in c that would do the same would look something like this.

for(;rcx>0;rcx--){
  *rdi=*rsi;
  rdi+=1;
  rsi+=1;
}

It also turns out that for my code with -Os and with my gcc (11.4.0) there is no diffrence between the memcpy and memov they both compile to the same exact assembly.

Clang and Icx

Okay cool gcc does this trick. What about the other compilers?
So clang does something similar but it uses:

rep movsq es:[rdi], [rsi]

a slightly different instruction but it does basically the same thing. the only difference is the type of the pointer. movsq uses 8 byte pointers while movsd uses 4 so that means it needed to do a shorter loop but it also needs to deal with a remainder since 100 doesn't cleanly divide by 8

Clang also specified the source and target pointers. While gcc could do it implicitly. This seems to be just a syntax difference; but I am not 100% sure.

Icx does basically the same thing as clang. Funnily enough u need to tell it to output intel assembly I thought it would do that by default but i guess since gcc and clang wont it decided it wont either.

Zig

So yes zig does ship with a c compiler. Zig cc is quite good and its wildly used for how easy it makes cross compilation. I used it to compile a school project to windows and it was fairly nice.

Zig did some simd stuff which is similar to what I saw with gcc on -O3 . It mainly used vmovups which seems to be an instruction that just moves a chunk of data

More specifically its mainly aimed at floats? It loads data to and from floating point registers like ymm0. So zig figured “hey no one is using these here and its all just bytes so I may as well put chars in”.

Very clever! It makes code with no loops that is still fairly short. You just need 3 of these bad boys. And the same remainder trick we saw clang and icx do.

gcc on -O3 uses movaps which is basically the same as vmovups but it moves an aligned piece of data into a general purpose register. So gcc needed to fiddle around with the alignment. Which took too much space.

Zig actually does not change much with O3 the only difference in the code is that it moves 1 of the pointers to rdi first. I have no idea why but it seems to think that would make things faster.

ccomp

Lastly ccomp… a formally verified compiler build with coq as the verifier and gcc as the linker. In theory a cool idea, I will be honest this is my least favorite of the bunch.

Ccomp acted out a bit. It did give out assembly like all the rest. But actually compiling and running code took some effort. Their website says they use gcc for linking so I just asked it to give me the assembly and put it in gcc.

Even the assembly was super lazy and just did

call memcpy

like… wow guys very original great compiler… worse part is that memcpy was compiled and linked with gcc. So what its really doing is saying “hey gcc please do this for me”.

This is actually what gcc does if you don't include the headers it needs for memcpy. It did write its own setup code so at least there is that.

You may think that not inlining is a deliberate choice since we asked for less instructions. NO it does this on O3 as well so this is a skill issue on their part.

Still for a compiler that is focused on predictability it makes sense to do exactly what you asked for. So if you really care about your compiler being predictable to the point it can be proven, go with ccomp.

That being said ccomp is NOT the most security oriented compiler we looked at. That honor actually goes to gcc. Because it is extremely paranoid about hardware oriented attack.

Gcc being weird

Gcc does some extra security things the rest of the pack doesn’t. Specifically “endbr64” which stops some vulnerabilities I am not smart enough to know about; and “call __stack_chk_fail@PLT” which is about stack/buffer overflows.

Frankly they both seem like overkill. Our 100 bytes are not going to crash any 64bit system all the other c compilers agree with me here. Maybe something else overflows? Shouldn't really be an issue considering we only depend on glibc.

I dont know enough about security to say if we need endbr64 but given that it is an intel hardware feature you would think icx would know about it and use it if its needed.

Then again icx is mainly aimed at being fast so it may be okay with dropping security for performance.

I cant really blame gcc for being extra careful. Considering it is used to compile an OPERATING SYSTEM it should be as careful as it gets. And it does seem to deliver on that.

Sizes

So now the main question: “which did better?” since we asked for a small binary (and since for 100 bytes timing is futile) lets look at the sizes.

zig: 3656 bytes
ccomp: 16096 bytes
icx: 16104 bytes
gcc: 16152 bytes
clang: 16160 bytes

Zig is actually the most impressive here its both the shortest binary by a factor of 4 AND it uses simd and no loops.

The rest are pretty much the same size and all but ccomp use a similar strategy. Which is the same strategy I came up with when randomly browsing NASM’s instruction set.

Ccomp doesn’t have much to show for its refusal to inline, which is just sad. Still for our criteria of making a small binary it did pretty well, i suppose sometimes being lazy pays off.

Clang making a longer binary than icx is definitely interesting since they both used the EXACT same technique. I did hear that icx took some code from clang back in the day. But I cant confirm this rumor since its closed source. Clang does seem to be doing the worse in general.

Seems like the less established compilers actually did better on size. And since this is what we asked for that is saying a lot. Gcc and clang are actually at the bottom of the list.

Conclusion

I am fairly surprised by the results.
both by how my prediction that an instruction like movsb could be useful turned out to be right.

And by how different zig was from clang. I was really thinking that zig would end up similar to clang since they are both using llvm and I believe zig uses a lot of the clang code for stuff like parsing.

Honestly my main conclusion is that zig is a very good c compiler and that it can give gcc/clang a run for their money.

Another thing that is really nice to see is just how rich the landscape for c compilers is. And they are all actually different in how they implement things. Most of these have a c++ version as well (icx is actually mainly c++ I would of used icc but it dosent ship with oneapi) so this also applies to c++.

Credits

(Imagine this is rolling like in the movies)

Compilers:

gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

clang --version
Ubuntu clang version 14.0.0-1ubuntu1.1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

icx --version
Intel(R) oneAPI DPC++/C++ Compiler 2024.0.2 (2024.0.2.20231213)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2024.0/bin/compiler
Configuration file: /opt/intel/oneapi/compiler/2024.0/bin/compiler/../icx.cfg

zig cc --version
clang version 17.0.6 (https://github.com/ziglang/zig-bootstrap 4c78aa1bba84dbd324e178932cd52221417f63da)
Target: x86_64-unknown-linux-musl
Thread model: posix
InstalledDir: /usr/bin

ccomp --version
The CompCert C verified compiler, version 3.14

Technical Help:

assembly and c subbredits for answering my dumb questions.
NASM reference for explaining the instruction sets
gpt4 claude3 and google for giving me answer on things that are not documented anywhere

Editorial Help:

gpt4 for going over this text almost as many times as I did
my dad (who is a software architect) for giving me tips on tone