How do compilers work?

Sevcan Doğramacı
Skydome Academy
Published in
4 min readDec 23, 2021

Computers. As we interact with them every day mostly using dozens of software, we sometimes forget that they are working with hardware indeed. So, the thing that they just understand is whether electric is on/off, or 1/0.

But wait? How do they understand our program then? Let’s look 👀

In this post, we will talk about how computers understand our program written in a compiled programming language. Let’s say, we develop an application using a programming language which might be C, Go, Haskell, or others which we can understand easily (sometimes not that much easy though 😀). And let’s see how we start talking to computers mainly. The compiler takes this application code and converts it to assembly code. Finally, the assembler converts this assembly code to machine code so that the computers can execute them.

A high-level view of application code to machine code

If we learn application to machine code journey in a broad perspective, then be ready to dive more into what the compiler does to convert the application code to assembly code 🚀.

Here is the outline of compiler stages :

  1. Lexical Analysis (Tokenization)
  2. Syntax Analysis (Parsing)
  3. Semantic Analysis
  4. Intermediate Code Generation
  5. Code Optimization
  6. Code Generation

1. Lexical Analysis (Tokenization)

Example of tokens

The first step of compilation is lexical analysis or so-called tokenization. In this step, the lexical analyzer which is a part of the compiler scans the application code character by character and converts sequences of characters into tokens. A token is a classification of groups of characters. To convert the sequences of characters into tokens, each programming language has rules on which tokens are defined.

Let’s say, there are 3 rules in our programming language.

  • We want to use a if keyword for creating if conditions.
  • We want to use a ; symbol for ending a line.
  • We want to define variables only using small letters and underscore.
"if" { return IF; }
";" { return END_OF_LINE; }
[a-z_]+ { return IDENTIFIER; }

Lexical analyzer starts to read characters in code. If the sequences of characters read by the lexical analyzer match with one of the token rules defined, then it understands that it finds that token 🥳 and sends the token to the 2nd step. If the lexical analyzer cannot match the sequences with any rule 😡 then it generates an error.

2. Syntax Analysis (Parsing)

Each programming language has its own syntax. This syntax is defined by some rules which are called context-free grammar (CFG). These rules are defined using tokens in the lexical analysis step.

Example of CFG for variable definition

Lexical analyzer sends tokens to parsing as it finds a token. In this step, the parser tries to match tokens to a rule defined. To achieve this, it continuously checks if the tokens kept from the lexical analyzer conform to any rule.

If the tokens match with one of the rules, then it constructs a parse tree using these tokens. Else the tokens cannot form a valid syntax meaning that conforms with no rule 😡, then it throws a syntax error.

Parse tree visualization of x => INT;

3. Semantic Analysis

After constructing a parse tree of the code, semantic analysis is done. In this step, the compiler

  • checks for type mismatches,
  • checks for incompatible operands,
  • checks if a function called with improper arguments,
  • checks for an undeclared variable, etc. for ensuring that the application is semantically correct.

4. Intermediate Code Generation

After the semantic analysis is done, an intermediate code is generated. This is for making the conversion of the code to the machine-level code (assembly code) easier. Rather than directly transforming the code into machine-level code, the compiler generates an intermediate-level code. Typically, these intermediate codes have two important features which make them preferable:

  • they are easier to optimize than machine-level code,
  • they are easy to convert to machine-level code.

5. Code Optimization

The intermediate code just generated previously is used in this step for optimization. For code optimization, there are a couple of tasks done so that the code performance might be increased in terms of execution speed and resource usage. For example, in this step,

  • unnecessary code lines are removed,
  • unreachable code is removed,
  • unused variables are removed,
  • the sequence of statements is rearranged.

6. Code Generation

Finally, the intermediate code that is just optimized in the previous step is converted into the machine-level code which is an assembly code.

After this step, the compiler finishes its work and now, it is the assembler’s responsibility to convert the assembly code into machine code which consists of only 1s and 0s.

This was the end of our journey in how a compiler works, see u in the next episodes 🧐 🚗

References

--

--