C Source Code Compilation is Its Own Type of Magic!

Derek Kwok
8 min readJan 18, 2018

--

Image taken from: https://www3.ntu.edu.sg/home/ehchua/programming/cpp/gcc_make.html

When we type a command like gcc -Wall file.c -o file , it almost seems like transfiguration magic has occurred when we look into our directory again and see a new executable file. The deceptively simple action of compiling a file is actually A LOT more complicated than it would seem. This blog post will explain the intricacies involved in compilation of source code using C source code and the gcc compiler as a reference.

The gcc Compiler

In a nutshell, compilation is the process of converting source code in some programming language into binary (1’s and 0’s) machine code which the computer’s central processing unit (CPU) can understand and execute. Every single function or line of code you type in some programming language means nothing to the computer unless it is converted into binary. In order to do this, the source code (code written in a programming language) needs to be run through a program called a compiler. For C, there are many compilers out there that vary in complexity, but the most widely used compiler is called the GNU C Compiler (GCC) and typically comes shipped with most Unix systems. The compiler is where all the magic happens.

Overview of The Compilation Process

The compilation process actually involves passing through four different “machines” where each step translates some aspect of the code and gets it closer and closer to its final binary code. These “machines” are as follows:

  1. Pre-processer
  2. Compiler
  3. Assembler
  4. Linker

Below, I have created a very simple program in C code that prints out “Hello, World!”. The source code is this programming language representation. The file is called hello_world.c .

When I type the command gcc -Wall hello_world.c -o hello_world, I will get an executable file called hello_world , in addition to my source code file, which when run in the terminal, will print out what I told it to:

So let me start over and try this: gcc -Wall -save-temps hello_world.c -o hello_world . Here are the contents of my folder now:

There are now three more files in this folder than there were before! What the -save-temps option of the gcc command does is it saves all of the intermediate files between each step of the compilation process.

So what actually happens when the source code passes through each “machine”? Let’s dive in!

Step 1 — The Preprocessor

This is “Machine 1” that the source code is run through. The primary steps that this “machine” takes care of include:

  • Comment removal

In programming, it is always good practice to comment parts of source code so that it is easier for humans to read and understand; but when the computer reads the code, it does not need these comments, so this first step removes all comments. In C, comments typically start with the syntax /* ... */ like the /*This is a comment*/ text I had in my program above.

  • Macro expansion

In C, macros are sections of code that act like aliases. In other words, let’s say we have the following syntax: #define PI 3.1415926535 . With this “alias” definition, I can use the word PI in the rest of my code any time I want to use the actual number. The preprocessor will physically look at all instances of the word PI and replace them with 3.1415926535. In my code, I included a macro #define HELLOWORLD "Hello, World\n" which should put the string in quotes in place of HELLOWORLD in the printf function.

  • Header file expansion

In C, there are certain files called header files that must be included in programs in order to use predefined functions, variables, structures, data types, and other things. Just like we can type int age; to declare an integer variable called age, these header files contain these and other declarations written in C so we do not have to waste space or decrease readability by typing them. There are many header files included with C, but you can even create your own. In the syntax of my hello_world.c file, the first line #include <stdio.h> includes the header file called stdio.h which defines useful functions like printf and other variables and parameters so they can be used in the program. When the preprocessor reads this line, it physically puts all of the contents within stdio.h into that space.

Thus, after the preprocessor step, the output is put into a file called hello_world.i . In the gif below, I am scrolling through the whole file to show you that the header file stdio.h was physically inserted in place of the command #include <stdio.h>. Additionally, in the last frame of this gif, you can see that "Hello, World!\n" was indeed placed in the printf and that last comment was removed!

Step 2 — The Compiler

Now starts the steps where the computer starts converting and transforming data. In “Machine 2”, the source code from the previous step is passed through the Compiler (ccl) where the code is converted into Assembly Code which “Machine 3”, called the Assembler, will understand.

So what exactly is assembly code? Assembly language is another form of low-level code that the CPU can understand much better. It is still a low-level code because of how each line mostly corresponds to exactly one action taken by the CPU. Typically, assembly code can still be somewhat understood since it uses things like mnemonics, operands, and comments to break a task up into simple steps. Thus, the ‘Assembly’ in the name makes sense since the process of assembling something often involves performing a series of simple actions. Think about an assembly line and how each part of the assembly line is in charge of one specific part of the final product. One of the downsides of assembly code is that it is not portable at all. The code itself is tied to the specific computer machine such that assembly code written on one machine would have to be rewritten or converted (very difficult to automate) when moved to another machine.

The assembly code from the example can be seen in the following which is the hello_world.s file:

You can sort of see where the computer is telling it to print “Hello, World!” . You can even see the main: tag at Line 6! This assembly code is formatted with three main parts:

  1. Labels — In blue above and end with : . For example, .LC0:, main:, .LFB0:, .LFE0:.
  2. Directives — begin with . . For example, .string, .size, and .file.
  3. Instructions — the actual assembly code instructions. For example, %rsp, %rbp, $0, %eax, and main, @function.

This code is then passed to an Assembler.

Step 3 — The Assembler

“Machine 3” is called the Assembler (as) and its primary job is to convert the assembly code passed to it from “Machine 2 — The Compiler” into binary that the computer can understand. This file is the hello_world.o file and the code contained within is called Object Code. This process involves a very tedious one-to-one mapping of the assembly language statements above to their translations in machine language (binary). The only thing that is left unpaired/untranslated are calls to external functions like printf() which will be filled in with “Machine 4 — The Linker” which links the remaining code to libraries. The following images show the Object Code as opened in emacs and as dumped in hexadecimal 2-byte units using the command od -x hello_world.o.

Step 4 — The Linker

Remember in the previous step when I mentioned that external functions that exist in the system environments and C run-time environments were left undefined by the Assembler. Well, now they are actually given links and addresses through comparisons with pre-defined symbols and patterns in standard and non-standard libraries. By default, the gcc compiler uses dynamic linking, a process which involves linking the code to the binary and library files in memory only when the program is run. This is why some applications require you to install certain code packages from the developers so that certain libraries and run-time environments are already installed on your system in memory. The application is dynamically linked and will call on these pre-installed libraries when it needs them during run-time.

The goal of the linker is to completely resolve all comparisons and translations and create a final executable file. On Linux, this executable file is in a format called ELF (Executable and Linking Format). The cool thing is, in the first Object Code image above, even though you cannot understand anything, the letters ELF can actually be seen in the first line! Even though this is not how the computer actually reads the file, it is still cool to see little traces of things going on in the process!

The Linker also has another job which is to add standard run-time routines in the form of pretty complex code that take place to make the program run, exit, and communicate properly on the computer. Some of these tasks include setting up a run-time environment, passing in variables, and setting up the return code process. Additionally, if there are multiple files that contain Object Code, they will be combined together by the linker. The whole process itself is complicated!

--

--

Derek Kwok

BS in Materials Engineering from the University of Illinois. Software Engineering Student at Holberton School. Foreign language enthusiast. Musician. Engineer.