How is a ‘Hello World’ C Program compiled in (MinGW) GCC? <Part 1-Preprocessing>

EvilSapphire
9 min readFeb 19, 2020

--

Originally I intended to write this article in one part, but it turned to be really long as I like to explain as much as I know about any topic I’m discussing, so I broke it into multiple ones. This part will mostly discuss the procedure of Pre-processing. The actual meat of the article, which was the reason I started out writing the entire thing in the first place, discussing how the Linker works in combination with MinGW GCC object files, is in the 2nd part. So skip to that if that’s what you’re looking for. :)

For the last couple of weeks I had been trying to understanding how the simplest C Source Code you start your programming journey with:

//main.cvoid main()
{
printf("Hello World");
}

is run by your computer to achieve it’s majestic goal, printing the string “Hello World” on the console. Now, if you’re anything like me, the day you dipped your toe in the water in the world of programming you were instructed to use an IDE, probably Visual Studio (or for poor souls like me, Turbo C inside DOSBox emulator in the twenty-first century!) which allowed you to write your code inside a white text area and upon satisfaction that you’ve done a reasonably good enough job to just hit the ‘Build and Run’ button which would do some magic in the background and show you the output (if you were lucky enough) in the area designated as the ‘Console Output’. And if you weren’t, it would show you some cryptic message which in 90% of the times would be solved by putting that elusive semicolon in its proper place. This makes the procedure that goes on under the hood opaque to the programmer which is a blessing when you are building a practical project and you can’t be bothered with the many idiosyncrasies of the pre-processors, compilers and linkers working in the background. However, understanding how it all works together makes you better at your craft and honestly gives you a better appreciation for this immensely complex machine that makes life in twenty-first century possible. Of course, a quick search on the internet would give you a sea of information on how the pre-processing+compiling+linking+loading process work in detail. My attempt in this article will be to custom Pre-process/compile/link the small program ‘main.c’ above with the MinGW toolchain (GCC implementation on Windows) and in the process give you a very beginner-friendly, illustrated example of the compilation/linking procedure by revealing the inner-workings of the generated intermediate files. In short, instead of relying on GCC to do all the work in the background, we’re going to build ‘main.exe’ out of ‘main.c’ by hand ourselves.

Now, if you have experience outside of an IDE you’ve probably worked with the GNU C Compiler (GCC) to turn your ‘main.c’ file into an Executable with the following command:

gcc main.c -o main.exe

main.exe is a ELF(Executable Linkable Format)/PE (Portable Executable)Format file depending on your OS (Linux/Windows respectively, here it will be PE as we are using the MinGW toolchain to compile/link the file on a Windows machine). It basically contains Assembly Code that is specific to your CPU Architecture Instruction Set (which if you’re working on a Laptop/PC is most likely to be Intel x64), and information for how this code is to be loaded into the Main Memory (RAM) when you actually execute the file. The job of loading the ELF/PE file into Main Memory during runtime is designated to the loader, which is a different beast in itself. This article will try to explain how the ‘main.exe’ binary is generated upon invoking the gcc command.

Here’s a high level diagram of the compiling/linking procedure that is abundant on the Internet.

The first hint of Assembly Language which is directly understood by the CPU is in the Object file (main.o, *.o, *.obj, *lib). The Pre-processing step before that reads any Header file included (with the #include directive) into the Source Code, and according to the included Header File Instructions extends the macros, declares function prototypes, and inserts different lines of code into the Pre-processed output depending on which Pre-processor Symbol is defined. The pre-processor symbols can be supplied by the Programmer when the s/he compiles the program. For Example, consider the following ‘example.c’ file:

//example.c#include "example.h"void main()
{
int p=0;
func(p);
}

With the header file ‘example.h’ as:

//example.h#ifdef intReturn
int func(int );
#else
void func(int );
#endif

The Pre-Processor checks whether the intReturn Pre-processor symbol is defined and accordingly declares the function func() to have either an int return type or a void return type. The programmer can define the Pre-processor symbol in gcc with the -D flag during compilation. We can compile the the example.c file with the command: gcc --save-temps -c example.c -D intReturn which defines the symbol intReturn before compilation starts, and the resulting Pre-processed output is ( the -— save-temps flag is supplied to save the Pre-processed output file ‘example.i’):

//example.i# 1 "example.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "example.c"
# 1 "example.h" 1
int func(int );
# 2 "example.c" 2
void main()
{
int p=0;
func(p);
}

Instead if the Pre-processing is done with intReturn not defined with the command gcc --save-temps -c example.c the resulting Pre-processed output is:

//example.i# 1 "example.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "example.c"
# 1 "example.h" 1
void func(int );
# 2 "example.c" 2
void main()
{
int p=0;
func(p);
}

As you can see the Pre-processing stage provides the Programmer with control over the Pre-processor symbols which can be leveraged in a variety of ways to generate different output and subsequently different object files/libraries with the same header file included in their Codebase.

However, in our small ‘main.c’ file the only work a Pre-processor needs to do is to declare the proper prototype for the printf() function, which is normally done by including the ‘stdio.h’ header file by the directive #include<stdio.h> . We will not include the entire header file for the sake of custom Pre-processing, and also this serves to explain how exactly the prototype of printf() is declared. But first, why is declaring a function prototype before calling the function even important? This is because the Compiler ‘reads’ the Source Code from top to bottom, and it is critical for the Compiler to know a few things before it actually generates the assembly code of a function calling another function. The Compiler needs to know beforehand what return value to expect from the Callee function so that it is collected properly by the Caller, and what type of parameters are passed to the Callee and HOW they are passed. This is because at runtime (while the executable is executing), parameters are passed to a ‘Callee function’ by a ‘Caller function’ either via a specific region in the process memory (on Main Memory during the process is executing) known as the ‘Stack’/ or via pre-defined CPU Registers, and return value is passed back to the Caller function via the CPU Register EAX/RAX. To make everything work correctly the assembly code of the ‘Caller Function’ and the assembly code of the ‘Callee function’ must follow the same protocol (formally known as a Calling Convention), or during execution, after a function call takes place there may remain rogue values on the Stack or the Registers that could cause Faults in the instructions following the function call.

Now let us scan through the ‘stdio.h’ header file inside the include folder of where GCC is installed and find the relevant function prototype declaration:

_CRTIMP __cdecl __MINGW_NOTHROW  int printf (const char *, ...);

Before this prototype is explained however, a quick explanation of the Linker is necessary. Linker, as the name suggests, links multiple Object files together to produce the final ‘main.exe’ binary/executable. Now, why is this needed? If you read the entirety of the ‘main.c’ Source Code, you find that there is only a call to the printf() function, but obviously without a definition of the function (basically what the function should actually do), the call is useless. So, where is the function printf() defined? The answer is the C Runtime Library, which is the backbone of all your Programs coded in C. C Runtime Libraries (CRT) contain assembly code that it exports via ‘Symbols’ (like printf) to applications calling different functions in the CRT via these symbols. CRTs are different depending on the OS/CPU architecture you use. The CPU architecture defines the actual Instruction Set that machine codes of an Object File/Library need to be written in, and different OSes are implemented differently (even when running on the same CPU), which is to say the C Runtime Library needs to be different to cater to whatever OS/CPU architecture you are working on. The CRT that the Visual Studio C Compiler (cl.exe) uses in Windows are ‘MSVCR100.DLL’, ‘MSVCR110.DLL’ etc. (DLL standing for Dynamic Linked Library) depending on the version of the Visual Studio. However, the MinGW GCC Compiler on Windows still uses the old MSVCRT.DLL, so for this custom compiling/linking of our small ‘main.c’ program, MSVCRT.DLL is what we are going to use.

Now that the job of the Linker has been explained,different components in the declaration of the printf() prototype can be easily explained. A quick search on Google reveals that_CRTIMP is actually another Pre-processor symbol defined as:

#ifndef _CRTIMP
#ifdef _DLL
#define _CRTIMP __declspec(dllimport)
#else /* _DLL */
#define _CRTIMP
#endif /* _DLL */
#endif /* _CRTIMP */

Which means if the _DLL Pre-processor symbol is defined, _CRTIMP resolves to __declspec(dllimport), otherwise it resolves to a null value. __declspec(dllimport) is something called a ‘Storage-class-Specifier’ that tells the compiler that the function printf() is defined on an external DLL (like msvcrt.dll). However, including this symbol in the prototype only does some time-optimization in the generated assembly as it reduces one instruction (for anybody curious to dig deeper, my findings on this was that including __declspec(dllimport) helps the generated PE binary (main.exe) to call printf directly via it’s address in the Import Address Table by moving the into EAX and calling it MOV EAX,[->MSVCRT.DLL::printf , CALL EAX instead of calling printf via an ‘indirect JMP’ instruction on the jump table, not too sure why), and some test runs of MinGW gcc revealed that during regular compilation with the command gcc main.c -o main.exe , the Pre-processor symbol _DLL is not defined anyway via any header file, so in our custom compilation/linking, we can omit this symbol. The symbol __cdecl defines what we discussed earlier, the Calling Convention for printf. There are different Calling Conventions, like STDCALL, CDECL, FASTCALL. Discussing the intricacies of Calling Conventions is out of scope for this article, so suffice it to say what was explained earlier, the Compiler needs to be told to generate assembly code for this call to printf() to follow the CDECL Calling Convention, as that is the convention that the assembly code in the definition of printf() in the CRT msvcrt.dll follows. __MINGW_NOTHROW similarly resolves to a Storage-class-Specifier named __attribute__((__nothrow__)) , which means the function printf() throws no exception.

So, the prototype that we declare on our ‘main.c’ file is:

//main.c__cdecl __attribute__((__nothrow__))  int printf(const char *, ...);void main()
{
printf("Hello World");
}

Now that the necessary Pre-processor output is included in the Source Code, we can compile this file to a Object File that would contain the corresponding Assembly Code. -c Switch tells GCC to only generate the Object File without attempting any kind of linking.

gcc -c main.c -o main.o

The discussion of the generated object files and the Linking of multiple object files will be continued in Part 2 (link). If this series ends up helping even a single beginner find their feet in the initially confounding world of understanding how programming/Userland applications work under the hood, I would consider my effort worth it! For now, ciao.

--

--

EvilSapphire

A guy who likes to fiddle with OS Userland internals. Professionally coming from a Network Security background, looking to switch to Reverse Engineering.