The Magic Behind Compilers - Part 1

Hinduja Balasubramaniyam
The Startup
Published in
7 min readJan 13, 2021

We all know the struggle of not understanding a new language. Have you ever wondered how computer understands its programs? Yes. Our computers are not as intelligent as us regarding the language skills. They only comprehend their binary codes.

But… In real, we use a handful of programming languages to communicate with computers; like Java, Python, C#, Ballerina… It’s a never ending list. So, there must be someone who does the translation work for the computers to understand these languages. Thankfully, we have compilers for the rescue here. So let us have a keen look at how these compilers accomplish this task.

What is a compiler?

A compiler is the program that translates a source code into a target language which is understandable/executable in a computer. Basically, the source language is the higher level one which a developer uses to program and the target language can be Assembly or machine language.

How compilers work?

With various source languages with different syntax & semantics, clearly we can understand the compiling is not done in on go. Yes. Though the compilers of these languages do the compilation in their own way, the common structure of a compiler always sticks to the following structure.

Let’s briefly look at what these phases do.

  • Scanning & Lexical analysis : Reading the source code, character by character and convert the text into a set of known character sequences (tokens).
  • Syntax Analysis : Translating and validating the tokens according to a collection of language specific grammar rules.
  • Semantic Analysis : Validating whether the translated statements are semantically meaningful.

At the end of front-end operations we get an Intermediate Representation(IR) of the program that will be transmitted the compiler back-end.

  • Code Generation: Translating the Intermediate Representation into assembly level code that is executable in a machine.
  • Code Optimization: Simplifying the complex code in order to increase performance and efficiency in terms of speed and size complexity.

Yes, your thoughts are correct. This is the real magic behind the compilers. To have a better understanding on these phases, we might need to dive deep into the working of each operation we mentioned earlier.

Let us focus on the compiler front-end at first.

Scanning

Just as we mentioned earlier, Scanning is the process of reading the characters in source code one by one and forming sets of known tokens using the lexical rules.

The lexical rules of a language indicates how a legal word is formed by concatenating the characters with a list of allowed words (tokens).

For example, the following code snippet contains a set of tokens such as public, function, main, ( etc. These tokens are listed out in the lexical rules by the language designers as tokens.

public function main() {}

The scanner does the following operations in an iterative way until it finds a valid token.

  1. Reading a character from the source code.
  2. Storing the character temporarily in the token buffer .
  3. Checking lexical rules whether the character sequence in the buffer is a valid token.
  4. If it’s not valid -> Continues reading the next character
  5. If it’s valid -> Returns the token to language parser and clears buffer.
Tokenization of “public” keyword

After the tokenization that happens in the scanner, we have a set of tokens for the next phase ‘Parser’. So, now we enter into another important part of the compiler front-end.

Parser : The Syntax Analysis

Parser is the key component of a compiler front-end which carries out the following major operations.

  1. Demanding scanner to produce the next token.
  2. Translating the tokens according to the grammatical rules of the language (Syntax).
  3. Controlling overall operations of the front-end.

Every language has a predefined syntax that includes a collection of rules which determine the constructs of the language. The parser demands the scanner for a token and validates the received token against the language syntax.

  • If it finds a rule that is suitable for the acquired token, it goes forward to demand the next token from scanner.
  • If no suitable rule is found for the token, it indicates a syntax error in the source code. So the compilation would stop with an error.
  • If it finds more than one rule that is suitable : This is a special case that indicates ambiguity in language grammar. (Oops…this means more work for the designers!)

As this is already complicated enough, let’s go through an example. The following shows a set of grammar rules that defines a function in a language.

<access-modifier> function <function-name> (<arguments>) {   
<statements>
}

This means a function definition of this new language must start with an access modifier ( public / private etc.) , function keyword, identifier of the function, open parenthesis ( and so on.

We compile a source code which contains the following.

public function test() {}

The parsing of this can be represented as below.

Parsing of first 3 tokens in the example

What if we don’t have function keyword after the public modifier in the code? Well… The rule says that an access modifier must be followed by a function keyword. As in this example, we don’t find any rule that tells otherwise. So, it would give an error in the syntax. In normal cases, we can have multiple rules and the parser goes through all of them until it finds a match.

So, that explains most of the compiler front-end part. But, the story does not end here.

Semantic Analysis

Semantic analysis is the part that examines and provides a meaning to the code. In this phase, the code is analysed back and forth in order to find any meaningless context.

To understand the semantic actions fully, we might need to find the difference between Syntax and Semantic errors. See the following code snippet.

int x;
int y;
int z;
// example 1
x = y +;
// example 2
z = x + m;

This shows some variable declarations and operations.

If we look at the statements, we can find that the ; is used as a statement terminator and + is some kind of an operator. As per the syntax of most languages (Java, Ballerina, C etc. ) an operator cannot be followed by a terminator. Therefore, in example 1, we can clearly see a syntax error.

In example 2, the syntax seems to be correct. It shows a + operation done between two variables and the result is assigned into another variable. But, if we look closely on the entire code snippet, we don’t find the variable declaration of the variable m. So, this indicates a semantic error. Because, we can’t find any meaning in the statement without knowing what is variable m. And we also realize another thing here. That, we need the entire context to identify this semantic error.

So semantic analysis is about finding this kind of context related errors in the source code. We need a more intelligent approach for this analysis as it is more complex, context sensitive and computationally expensive.

How is this analysis done ?

To make a long story short, we can say that it’s done with the help of an entire set of actions called Semantic Actions , an in-memory stack called Semantic Stack and a special structure that stores information called Symbol Table.

Symbol Table

This is a table structure that contains meta information of the elements used in the source code. For example, in the variable declaration of int x = 5; , the symbol table is populated with the relevant name, type, scope of that variable (In this example, it’s "x" , int and local .)

Semantic Actions

These are the operations done on the code by manipulating the symbol table. These actions include ,

  • Binding : Accessing and updating symbol table
  • Type checking
  • Scope checking
  • Context processing : Translating the code context into a more elaborated representation with the help of temporary values.

These operations are carried out on a memory structure called Semantic Stack. (Still the process is not clear? No worries… We’ll be looking more into these in the future sections.)

So, at the end of this smart Semantic Analysis, we will be gifted with a beautiful Intermediate Representation that helps to continue the compilation further.

And these operations of the compiler front-end is controlled by the parser itself. Even though we understood the phases sequentially, it does not happen one by one at compilation. The parser demands the scanner for tokens and does the syntax analysis. And it invokes certain semantic actions at required places during the parsing.

So, that’s the happy ending of our story regarding compiler front-end. Hope you enjoyed it! We’ll be seeing the other parts of the compiler in my next posts. Thank you.

--

--

Hinduja Balasubramaniyam
The Startup

Software Engineer at WSO2 , BSc.(Hons.) in Information Technology, University of Moratuwa.