Brian Seel
cylussec
Published in
5 min readJul 25, 2013

--

How Compilation Works: preprocessing, compilation, assembly and linking

Note: I found this article after I wrote this, and it does a much better job at explaining all of this. As I have said before, this blog is mostly an exercise in forcing myself to learn things at a deeper level. As such, this is a half baked explanation. Don’t read mine. Read theirs

I am going to take a break from looking at shell code to dive into the compilation process. Most of the time, compilation just works and it doesn’t matter what’s going on under the hood… there is just a magical gnome that takes your C++ code and gives you something that will actually run. But when it doesn’t work, it usually REALLY doesn’t work. And thats when you need to understand the process. So lets dive in.

According to the gcc man page, “Compilation can involve up to four stages: preprocessing, compilation proper, assembly and linking, always in that order.” So lets break it down like that. I will also outline the input file names and the output file names because for any given input file, the file name suffix determines what kind of compilation is done. We will use the following example files to compile.

cube.h

#ifndef CUBE_H
#define CUBE_H
#define PI 3.14159
class Cube{
public:
Cube();
~Cube();
void setSide(double s);
double getSide();
double Area();
double Volume();
void Properties();
private:
double Side;
};
#endif

cube.cpp

#include <iostream.h>
#include "cube.h"
Cube::Cube(){}
Cube::~Cube(){}
void Cube::setSide(double s){
Side = s <= 0 ? 1 : s;
}

double Cube::getSide(){
return Side;
}

double Cube::Area(){
return 6 * Side * Side;
}

double Cube::Volume(){
return Side * Side * Side;
}

void Cube::Properties(){
cout << "Characteristics of this cube";
cout << "\\nSide = " << getSide();
cout << "\\nArea = " << Area();
cout << "\\nVolume = " << Volume() << "\\n\\n";
}

main.cpp

#include "cube.h"
void main(){
Cube cube;
cube.setSide(PI);
cube.Properties();
//this is a comment
Cube de;
de.setSide(28.15);
de.Properties();
}

Preprocessing

Expected Input
filename.c — C code that should be precompiled
filename.cpp — C++ code that should be precompiled
filename.i — C code that should not be precompiled
filename.ii — C++ code that should not be precompiled

According to the g++ man page, running g++ -D prog=1 -E main.cpp will stop the compilation process after preprocessing, so what do we get?

# 1 "main.cpp"             //Says that this section is from main.cpp. Debugger hint.
# 1 "<command-line>" //Quick break to reference the command line defines [1]
# 1 "main.cpp" //We are back to processing main.cpp
# 1 "cube.h" 1 //First line in main.cpp is the #include, so this section is cube.h.[2]


class Cube{ //This section is where the preprocessor basically dumps cube.h directly
public: //in. Thats all an include is.
Cube();
~Cube();
void setSide(double s);
double getSide();
double Area();
double Volume();
void Properties();
private:
double Side;
};
# 2 "main.cpp" 2 //Line 2 is back to main.cpp. This is a debugger hint. [2]
int main(){
Cube cube;
cube.setSide(3.14159); //PI was #define'd as this value so the preprocessor replaces that value
cube.Properties();
//Our comment was here, so it is replaced with whitespace.
Cube de;
de.setSide(28.15);
de.Properties();
return 1;
}

What happened?

  • [1] Command line defines can be passed by using the -D flag with g++. “But nothing shows up there” you say. Well, notice that we did a #define of PI as well, and that doesn’t show up anywhere either. The job of the preprocessor is to make those substitutions, so you will notice those variables replaced in the code, but not defined at the top (look at where PI was replaced in main.cpp on line 4 to understand more of that.
  • [2] After a debugger hint, there are usually flags that give more information about that file. In our example, we see that twice with # 1 “cube.h” 1 and # 2 “main.cpp” 2. The second number is a flag that corresponds to the following:
  • `1' — This indicates the start of a new file. `2' — This indicates returning to a file (after having included another file). `3' — This indicates that the following text comes from a system header file, so certain warnings should be suppressed. `4' — This indicates that the following text should be treated as being wrapped in an implicit extern "C" block.

For more information, see the gcc docs at http://gcc.gnu.org/onlinedocs/gcc-4.6.2/cpp/Preprocessor-Output.html

Compilation Proper

The compilation process is when source code is converted to assembly. Back in college, we were given this as an assignment, with the idea basically being that you take a set of code and figure out how to do it in assembly. So something like

int i = 1;
i +=1;

would translate (roughly) to

xor eax, eax //clear eax
mov eax, 1 //set eax to 1
add eax, 1 //add 1 to eax

Compilers do this same exercise after preprocessing. We saw that the output of preprocessing was basically all of the source files concatenated together, so the compilation proper stage converts that to assembly. I won’t output the whole cube example here because of its length, but the compiler will basically turn each line of C into assembly (and one line of C can translate into multiple lines of assembly). Note that compilers are able to optimize code to reduce its size, but we are going to keep it simple here.

If you are following along, To get the compiled version of a binary (but stopping before the linking process), simply run gcc with the -c flag. To get the human readable version, do -S.

There are ways that the compiler can optimize (such as doing a bitwise shift left since we are essentially doubling the value of i when we add to it). However, all that this step is doing is the C++ to assembly section. What if we are calling a function like MessageBox which is defined in an external library? And that assembly doesn’t do anything relating to setting up the process or tearing it down.

Assembly

The third part is the assembler.

Linking

Linking takes care of those issues. In fact, because we have main.cpp and cube.cpp, we are going to have to link the object files that each of those produce. Bascially, evertything that we have done so far has treated main.cpp and cube.cpp as separate files. But for this to work, they are going to have to be combined into a single binary. Linking does that process.

Source code from http://www.functionx.com/cpp/examples/simpleclass.htm

--

--

Brian Seel
cylussec

Software developer; resident of Baltimore; love trying new things