The Magic Black Box of GCC Explained

image credit:

Ever programmed in C? Then you must know GCC, the magic three letters that transforms the C source code into a program that can be executed. But what exactly is going on when you send the source code through it?

First, let’s get the name straight. GCC stands for GNU Compiler Collection. It was first developed by Richard Stallman, an evangelist for open source software and founder of the GNU project. Known as the GNU C Compiler at the time, GCC has evolved to serve other languages such as Java and C++, and has changed its name to GNU Compiler Collection to reflect the multitude of languages it now supports.

There are four steps in the GCC compilation process: preprocessor, compiler, assembler, and linker. Essentially, the compiler takes the source code, which is written in human-understood form, and breaks it down to binaries that computers can read in order to execute it. The process is not unlike our digestive system: a hamburger is chewed down into small pieces in our mouth, and got broken down into smaller molecules as it passes through our stomach and intestines (so that the body is able to absorb it), until it uh, got merged with all other food processed the same way and form a cohesive (ideally) output from our rear end.

I hope the above analogy gives you a good idea on what GCC does. If not, don’t worry. We will expand on what each component does in more details in the following sections.

image credit:

Stage 1: Preprocessor

In this first stage, GCC takes the source code (often saved as .c file) and processes it by removing comments, including header files (e.g. <stdio.h>, <stdlib.h>), and expanding macro name with code (e.g. #define). Below is an example of a source code named “main.c,” and what it looks like before and after going through the preprocessor:

Left: Source Code. Right: the bottom few lines from the preprocessed file

We can run a file through just the preprocessor stage of gcc by typing the command gcc -E filename into the shell. The -E flag will stop the gcc from pushing the preprocessed file to the next stage. See command line example below:

In this example, we are taking the source code file named “main.c” and redirect the preprocessed output to a file named c.

Stage 2: Compiler

After the file is preprocessed, gcc moves it to the compiler. The compiler turns each line in the preprocessed file into assembly language, which are instructions in English mnemonics that have strong one-to-one correspondence to the machine code that computers can read. We can run a file through the preprocessor and compiler stage by doing the following:

The -S flag stops GCC from pushing the file to the next stage after going through the compiler. The compiled file keeps the same file name but with the “.s” extension. We can see the assembly language content of the compiled file below:

Stage 3: Assembler

Now that the file has been compiled into assembly language, the assembler then translates it into the object code (or machine language), which consists of pure binary code that the computer can process and execute. We can use the command gcc -c filename to stop gcc from pushing the file to the next step:

The output filename by the assembler has the “.o” extension at the end.

Stage 4: Linker

As the last step of the compiling process, Linker serves two purposes:

  1. It can take multiple C files and merge it into one program
  2. If the C contains functions from the library, the Linker will extract and connect them to output file

The linker then out puts the file with the end extension “.out” that is executable and ready to be run. Now that you understand the steps involved in GCC compilation, remember to take a moment to appreciate the complexity of the process next time when you type in the command gcc filename -o output_filename to compile your code in C. It ain’t as simple as it seems!