The Compilation Process

Sibasish Ghosh
Coding Den
Published in
6 min readMay 8, 2018

Many of us write codes and execute them. But do all of us (well, mostly the newbie I guess :P) know the underlying process that converts a source code to an executable program? Not many I guess. Well this article’s for all those Not-Many’s who are looking for the secret ;)

Image Source — worldcomputerarticle

“Programmers write programs in a form called source code. Source code must go through several steps before it becomes an executable program. The first step is to pass the source code through a compiler, which translates the high-level language instructions into object code. The linker combines modules and gives real values to all symbolic addresses, thereby producing machine code.”

The above lines explain the entire process of compilation. So basically, you have a program, called source code, which must undergo the process of compilation to produce an output with the help of an executable program.

So, if there’s a “compilation” process, there must be a tool to perform this work. Yes, there is one such thing and it is known by the name of a compiler. Now let us see what a compiler is exactly.

What is a compiler?

Image Source — CraftingInterpreters

A compiler is a special program that processes statements written in a particular programming language and turns them into machine language or “code” that a computer’s processor uses. Typically, a programmer writes language statements in a language, such as Pascal or C, one line at a time using an editor. The file that is created contains what are called the source statements. The programmer then runs the appropriate language compiler, specifying the name of the file that contains the source statements.

When executing (running), the compiler first parses (or analyzes) all of the language statements syntactically one after the other and then, in one or more successive stages or “passes”, builds the output code, making sure that statements that refer to other statements are referred to correctly in the final code. Traditionally, the output of the compilation has been called object code or sometimes an object module. The object code is machine code that the processor can execute one instruction at a time.

Traditionally in some operating systems, an additional step was required after compilation — that of resolving the relative location of instructions and data when more than one object module was to be run at the same time and they cross-referred to each other’s instruction sequences or data. This process was sometimes called linkage editing and the output known as a load module.

A compiler works with what are sometimes called 3GL and higher-level languages. An assembler works on programs written using a processor’s assembler language.

(Source -WhatIs)

What about those languages that do not compile?

There are certain programming languages that do not compile the source code. So can they be executed? Fortunately or unfortunately, it is quite possible. The answer is with the help of interpreters.

In computer science, an interpreter is a computer program that directly executes, i.e. performs, instructions written in a programming or scripting language, without requiring them previously to have been compiled into a machine language program. An interpreter generally uses one of the following strategies for program execution:

1. Parse the source code and perform its behavior directly;

2. Translate source code into some efficient intermediate representation and immediately execute this;

3. Explicitly execute stored precompiled code made by a compiler which is part of the interpreter system.

Early versions of Lisp programming language and Dartmouth BASIC would be examples of the first type. Perl, Python, MATLAB, and Ruby are examples of the second, while UCSD Pascal is an example of the third type.

(Source — Wikipedia)

Alright, I’m done with most of the theory part. To understand how this actually works, let’s take the help of a famous programming language — C++.

How C++ compiles code?

The compilation of a C++ program involves three steps:

1. Preprocessing: the preprocessor takes a C++ source code file and deals with the #include’s, #define’s and other preprocessor directives. The output of this step is a "pure" C++ file without pre-processor directives.

2. Compilation: the compiler takes the pre-processor’s output and produces an object file from it.

3. Linking: the linker takes the object files produced by the compiler and produces either a library or an executable file.

1. Preprocessing

The preprocessor handles the preprocessor directives, like #include and #define. It is agnostic of the syntax of C++, which is why it must be used with care.

It works on one C++ source file at a time by replacing #include directives with the content of the respective files (which is usually just declarations), doing replacement of macros (#define), and selecting different portions of text depending of #if, #ifdef and #ifndef directives.

The preprocessor works on a stream of preprocessing tokens. Macro substitution is defined as replacing tokens with other tokens (the operator ## enables merging two tokens when it makes sense).

After all this, the preprocessor produces a single output that is a stream of tokens resulting from the transformations described above. It also adds some special markers that tell the compiler where each line came from so that it can use those to produce sensible error messages.

Some errors can be produced at this stage with clever use of the #if and #error directives.

2. Compilation

The compilation step is performed on each output of the preprocessor. The compiler parses the pure C++ source code (now without any preprocessor directives) and converts it into assembly code. Then invokes underlying back-end (assembler in toolchain) that assembles that code into machine code producing actual binary file in some format (ELF, COFF, a.out, etc. ). This object file contains the compiled code (in binary form) of the symbols defined in the input. Symbols in object files are referred to by name.

Object files can refer to symbols that are not defined. This is the case when you use a declaration, and don’t provide a definition for it. The compiler doesn’t mind this, and will happily produce the object file as long as the source code is well-formed.

Compilers usually let you stop compilation at this point. This is very useful because with it you can compile each source code file separately. The advantage this provides is that you don’t need to recompile everything if you only change a single file.

The produced object files can be put in special archives called static libraries, for easier reusing later on.

It’s at this stage that “regular” compiler errors, like syntax errors or failed overload resolution errors are reported.

3. Linking

The linker is what produces the final compilation output from the object files that the compiler produced. This output can be either a shared (or dynamic) library (and while the name is similar, they haven’t got much in common with static libraries mentioned earlier) or an executable.

It links all the object files by replacing the references to undefined symbols with the correct addresses. Each of these symbols can be defined in other object files or in libraries. If they are defined in libraries other than the standard library, you need to tell the linker about them.

At this stage the most common errors are missing definitions or duplicate definitions. The former means that either the definitions don’t exist (i.e. they are not written), or that the object files or libraries where they reside were not given to the linker. The latter is obvious: the same symbol was defined in two different object files or libraries.

(Source — stackoverflow)

So the next time you run a code, remember it’s hell lot of underlying process! Have a good day amigos :)

--

--