How is a ‘Hello World’ C Program compiled in (MinGW) GCC? <Part 2-Linking>

EvilSapphire
9 min readFeb 19, 2020

--

This is the second part to a two part article that attempts to elucidate how the general process of Pre-processing/Compilation/Linking work together to turn a simple ‘main.c’ C Source Code into ‘main.exe’ executable binary. This part is about explaining the Object Files that would be Linked together to produce the final executable. The first part that mostly talks about the Pre-processing can be found here, please go give it a read for the sake of consistency and for better understanding of this article.

As mentioned in the first part, my attempt in this article will be to custom Compile/link the small program ‘main.c’ above with the MinGW toolchain (GCC implementation on Windows) and in the process give you a very beginner-friendly, illustrated example of the compilation/linking procedure by revealing the inner-workings of the generated intermediate files. In short, instead of relying on GCC to do all the work in the background, we’re going to build ‘main.exe’ out of 'main.c’ by hand ourselves.

The simple ‘main.c’ that we are trying to compile (the function prototype in the second line was manually included in the file in the first article, please refer to that for an explanation of the prototype):

//main.c__cdecl __attribute__((__nothrow__))  int printf(const char *, ...);void main()
{
printf("Hello World");
}

In the last part, we generated the Object File for this Source Code with the following GCC command:

gcc -c main.c -o main.o

Object Files contain machine code that directly correspond to the binary instructions your CPU understands.

For recap, let me briefly re-iterate the purpose of linking. Linking is necessary for the practical scenario that one Object File may issue a call to a function defined in another Object File (a function definition is basically the code for what the function actually does). If you take our own example ‘main.c’, there is only a call to the printf function, but it is not defined in the same file. In fact, we can use Ghidra which is an awesome Decompiler/Disassembler tool developed by the National Security Agency (NSA) to look at the generated Object File ‘main.o’ to see what functions it expects to import from other Object Files.

As you can see, in the _main() function (now converted to Assembly Language after compilation has finished), there is a call to printf() via the CALL _printf assembly instruction. Main.o also adds this Symbol to its Symbol Table which is contained inside the EXTERNAL Section as displayed by Ghidra in the picture below. This tells the linker to find this Symbol _printf in some other Object File during link-time and link these two Object Files together.

So we must link ‘main.o’ with an Object File/Library that contains the definition of printf and exports it. Where do we go looking for it?

The C standard functions (like printf ) are defined in something called the C Runtime Library (CRT), which comes bundled with any C Compiler that you download. In case of Microsoft Visual Studio C Compiler (cl.exe), the CRT is ‘MSVCR100.DLL’, ‘MSVCR110.DLL’ etc. depending on the version of Visual Studio that you have (DLL stands for Dynamic Linked Library). However, MinGW GCC still uses the old MSVCRT.DLL as its CRT where all the C standard functions are implemented, so for our little exercise of custom linking, MSVCRT.DLL is what we are going to use (The 32 bit version of MSVCRT.DLL resides in C:\Windows\SYSWOW64 and the 64 bit version in C:\Windows\System32). DLLs in Windows are written in PE (Portable Executable) format, so we can use a tool called PE Bear which is a PE format parser to see whether it exports the printf function or not (the same information can, of course, be viewed in Ghidra too, but in this specific case I find PE Bear gives a much better view of the export table).

As you can see in the Export Table for MSVCRT.DLL, among other functions (the entire export table is too long to be shown here), printf is listed as a function that this DLL exports. So we can link ‘main.o’ with msvcrt.dll which will solve the printf external reference, which is exactly what MinGW GCC linker (ld.exe) does in the background.

However, if you look at the _main function in main.o in the first Ghidra Disassembly picture above, there is another instruction CALL ___main which issues a call to a function/symbol___main . ___main is also listed in the Symbol Table of ‘main.o’ in the EXTERNAL section viewed in Ghidra (which means this ___main Symbol also needs to be exported from a different Object File). Which Object File would you find this ___main symbol in? This is where the concept of the ‘main Startup Code’ kicks in. Whenever you compile a C Source Code starting with main() , your first line of code in main is actually not the first instruction to be executed by the application. The first instruction to be executed by the application is called its ‘Entry Point’, which subsequently calls main and then the code that you’ve written in main gets executed. Why do we need a Entry Point different than main? Well, if you think about the standard declaration of main:

int main(int argc, char **argv)

Right from the first line in main, you can use the arguments supplied to main where argc stands for the number of Command Line Parameters supplied to the application plus one, and argv is an array of Strings containing the name of the application and the Command Line Parameters supplied to it. But of course, some code must take care of the fact that these parameters are properly fetched from the Command Line and supplied as an argument to a call to main. This is one the jobs that the code between the Entry Point and the Call to main performs, it also takes care of variable initializations if they are needed before the code in main actually gets executed and can do various other tasks. When you compile a C Source Code to executable using GCC, during linking GCC supplies it’s own main Startup Code and links it with the ‘Object File produced from the Source Code containing the main function written by the programmer (main.o in our case)’ to produce the final executable. GCC also has this behaviour that it inserts a CALL ___main assembly instruction in main.o which calls ___main which is defined in the Startup Code that GCC supplies. Since we are custom compiling, we can write our own simple Startup Code which will define a ___main function which will return a zero value (to satisfy the call to ___main in main.o), and also will export the Entry Point to the linker and down the line will call the _main function defined and exported by main.o (I know there are a lot of ‘main’s to keep track of in this paragraph, a careful read should make it clear. Please post a comment if this creates any confusion, I would be happy to reply). The picture underneath should explain the entire control flow of the final, linked executable.

The best way to write the Start Up code is in direct assembly language as writing it inside another C file inside a main() function would generate it’s own CALL ___main instruction upon being compiled. Of course, we can use the asm directive in C to directly write assembly in a C source file but using an Assembler is the cleanest way to go with.

Let us quickly whip up some assembly code in a file named maincaller.asm which we would assemble using MASM assembler (ml.exe). Maincaller.asm as discussed will do 3 things:

  1. Have a definition for __main to satisfy the CALL ___main instruction generated by GCC in the main.o Object File. __main will do nothing but return a zero value.
  2. Define the Startup function called myEntry and export its starting instruction via a Symbol to the linker to be used as the Entry Point.
  3. After the Entry Point, call the _main function in main.o.

Here’s the necessary compact assembly code which should be pretty much self explanatory. If not, as always post a comment.

; maincaller.asm.386
.model flat, C
option casemap :none
EXTERN main :PROC
PUBLIC myEntry
PUBLIC __main
.code__main proc
xor eax,eax
ret
__main endp
myEntry proc
call main
ret
myEntry endp
end

The EXTERN Directive tells MASM that the function main is defined elsewhere (which basically means instructing MASM to add main to its Import Symbol Table), and the PUBLIC directive tells MASM to export the Symbol myEntry and __main via the Export Symbol Table of the Object File to be generated.

This is compiled using MASM (ml.exe) with the command,

ml /c /coff maincaller.asm Where /c switch tells masm to only assemble the file and generate an Object File without linking, and /coff is to enforce to generate the Object File in the COFF format (Common Object File Format). Since we are going to link this Object File with the Object File main.o generated by GCC, a standard format like COFF is the way to go. This produces a maincaller.obj file that can be linked with main.o.

We can of course, take a quick peek at this maincaller.obj using Ghidra as before.

As we can see in the Disassembly window in the middle, the code in the maincaller.obj file is basically a line by line translation of the maincaller.asm file since it was written in assembly to begin with. Our ___main and myEntry functions have been assembled to machine code, and the Symbol Tree view in the left pane shows this Object file imports the Symbol_main (to be called from main.o) and exports two Symbols,___main (to be called by main.o) and _myEntry (to tell the linker to set the Entry Point to this function.

Finally, we can use GCC’s linker ld.exe to link these three files together (and set the entrypoint to _myEntry ) to produce the final Executable using the command:

ld maincaller.obj main.o "C:\Windows\SysWow64\msvcrt.dll" -e _myEntry -o main.exe

And voila! We have a main.exe executable which is properly linked and upon running it produces the Expected “Hello World” Output.

(NOTE: This kind of linking Object Files with DLLs is normally not done. Normally DLL files are wrapped inside .lib files and then linked with other Object Files, but MinGW GNU BinUtils version 2.28 allows this, so for the sake of simplicity I went for direct DLL linking)

We can view main.exe in PE Bear and that gives us a nice parsed view that the application imports only one dll which is msvcrt.dll from which it imports only one function which is printf.

We can also view the value of Entry Point of the main.exe application from the Optional Header in main.exe which is 0x1003 (Relative Virtual Address)

In the Disassembly of main.exe below, I highlighted the Entry Point, and if you study the disassembly you’ll see it is nothing but the code of maincaller.obj and main.o appended together with the Symbols in the disassembly view in Ghidra replaced with Addresses of the Symbols/Functions.

For anyone wondering, lines 0x1000 to 0x100B belong to our maincaller.obj file. Function at 0x1000 is the___main function returning 0, 0x1003 is the myEntry function with the Entry Point of the application set to it. Main.o begins from 0x100C where the _main function starts from, which calls __main at 0x1015 and printf at 0x1021. Refer to the Control Flow I posted above.

That’s pretty much it! Tell me if there’s any error that you noticed that I need to correct, and if you found this article helpful in demystifying how linking multiple Object Files together work for even a bit I would really appreciate a clap or a comment as that would let me know over 6 hours of my life did not completely go to waste. See you in the next article. Sayonara!

--

--

EvilSapphire

A guy who likes to fiddle with OS Userland internals. Professionally coming from a Network Security background, looking to switch to Reverse Engineering.