Introduction to Compiler Construction: Part 1

From Source Language to Intermediate Representation

8 min readMar 3, 2017

What is a compiler?

A compiler is a translator that turns programs in a source language into the same program in a target language. Usually that target language is executable, which means that a machine can run it.

A translator? That sounds kind of like an interpreter, doesn’t it? Well, an interpreter gives you the result of running a program. A compiler does not run the source program.

In this post, I will discuss the front-end of the compiler (the first 5 or so stages of compilation). In modern real compilers, these stages are sometimes combined, but in general, all of these steps need to happen, roughly in this order.

1. Lexical Analysis

String of ASCII characters -> String of Lexical Tokens

This stage is all about transforming our program into tokens, which is just a sequence of characters that corresponds to a unit in the grammar of our language. For example, these are some valid tokens in many popular programming languages:

15 (INT), -0.6 (FLOAT), 6.022e23
“compilers rock” (STRING)
foo, x (IDENTIFIER)
+, &&, <
; (SEMICOLON)
return (RETURN)

As you can see, besides our typical primitive String and Number types, we also have punctuation tokens like return, if, void, etc. that cannot be used as identifiers (variable names) — they are reserved words (keywords).

We also have things that get ignored during this process, and we call these nontokens! Examples include:

// This is a single-line comment
/* This is a multi-line comment */
#define ARRAY_SIZE 10 (preprocessor directives)
whitespace, tabs, newlines (careful with this one, Python for example is white-space sensitive)

At the end of the lexing stage,

public static void main(String[] args) {    System.out.println("Hi Mom!"); }

might turn into something like this:

PUBLIC STATIC VOID MAIN LPAREN STRING LBRACKET RBRACKET ID(args) RPAREN LBRACE ID(System) DOT ID(out) DOT ID(println) LPAREN STRING(Hi Mom!) RPAREN SEMICOLON RBRACE

2. Parsing

String of Lexical Tokens -> Abstract Syntax Tree (AST)

This goal of this stage is to check that the structure of these tokens is legal and follows the rules of the language’s grammar (syntactic analysis). While we’re doing that, since a list of tokens isn’t really very useful (and hard to read to boot), we are going to turn it into a tree.

I am not going to go into the nitty-gritty bits of writing unambiguous grammars or leftmost vs rightmost derivation here. To read more about that, check out the resources section at the bottom of this article.

Let’s do an example.

i = 3 + (4 * 17)^2;

After lexing, we’ll have

ID(i) EQUALS INT(3) PLUS LPAREN INT(4) TIMES INT(17) RPAREN EXPONENT INT(2) SEMICOLON

And after parsing,

                         EQUALS 
                       /        \       
                  ID(i)           PLUS
                                 /     \
                            INT(3)      EXPONENT 
                                       /        \
                                  TIMES          INT(2)
                                 /     \
                           INT(4)       INT(17)

Notice how the BEDMAS precedence is captured in this tree (the multiplication happens before the exponent which happens before the addition). The parentheses and the semicolon are removed.

The kinds of errors that would be caught during this stage of compilation are your traditional “syntax errors” such as:

Mismatched or missing parentheses in if conditions or while loops
Missing semicolons after variable assignment, statements, etc
Missing a return statement

Errors like failing to recognise that multiplication expressions bind more tightly (should be done first) than addition expression indicate a problem with your grammar.

3. Semantic Analysis

Abstract Syntax Tree -> Abstract Syntax Tree

This is the stage that does type-checking. To do this, we typically construct and maintain symbol tables (also called environments) which keep track of bindings between variables and their types, and methods and their return types.

We must walk through the abstract syntax tree and put into the table:

Variables and parameters with their type
Methods with their return types
Classes with their variables and methods

Then we can walk through the AST again and make sure that all of the types match what we were expecting. We don’t necessarily have to do this in two steps, but it might help to think about it as building the symbol table first, then doing the checking.

How do we construct this table? If you’re thinking about using a hash table, you’re right on the money — hash tables are great for our use case because they allow us to insert and lookup things quickly. But once we have methods and local variables in our language, one master table isn’t going to cut it.

There are two main styles of implementing tables that express scope: the imperative and the functional. Imperative symbol tables work by keeping track of local bindings when you put them in the table, so that you can remove them when you exit the scope (but always making changes to the original table). Functional symbol tables work by creating a new table whenever you enter a new scope which contains the sum of the old table and the new local stuff. Unfortunately, creating new hash tables, copying over values, and destroying them is not very efficient. Instead we can use a binary search tree-like data structure (in the same spirit as using a linked list, but find operations are more efficient with a search tree).

So the kinds of errors we would be able to catch during this stage include:

Trying to multiply two strings
Declaring two integer variables with the same name
Trying to use a variable that hasn’t been defined
Returning an int from a method that is declared with return type void
Passing the wrong number of arguments in a method call
Passing the wrong type of arguments in a method call

4. Translation to Intermediate Representation

Abstract Syntax Tree -> Intermediate Representation (IR)

So we’ve got these nice trees that make it easy for us to understand what is going on in the program, but we’re still very far away from our ultimate goal —machine code. Now we could try to directly translate ASTs to machine code at this point but unfortunately, machine code contains tons of architecture dependent information which our ASTs are still free from. This means that for each source language, we’ll need to have a separate compiler for each machine language. That’s a lot of combinations!

The intermediate representation gets us closer to machine level code while still hiding the details of the target machine architecture. There are many different kinds of IR being used in practice, but they all need to be:

Convenient to translate from ASTs
Convenient to translate to machine code (for all the desired supported target machines)
Made up of components that describe very simple things like add, move, jump so that optimisation and rewrites can be easily implemented in later stages

Suppose our IR had a few constructs like:

/* Expressions: Instructions that return a value. *//* Gets the value inside the register r. */
REG(r)/* Applies binary operator to two expressions.  */
BINOP(Binop, Expression, Expression) /* Statements: Instructions that do not return a value but may have  
               side effects. */ /* Give a name to the current machine instruction address. */ 
LABEL(name)/* Evaluate sourceExpression and move it into the destination. 
   The destination could be a register, or an address in memory. */       
MOVE(destinationExpression, sourceExpression)/* Conditional jump. If the RelOp applied to the expressions result
   in true, go to the thenLabel. Else, go to the elseLabel. */ 
CJUMP(RelOp, Expression, Expression, thenLabel, elseLabel)/* Constants. */ /* Types of binary operators. */ 
Binop ::= PLUS, MINUS, MULT/* Types of relational operators. */
RelOp ::= EQUAL, LESSTHAN, GREATERTHAN

Then such a program

sum = a + b + c;
/* Return the absolute value of the sum. */
result = (sum > 0) ? sum : (sum * -1);

could be translated into

MOVE(REG(sum), BINOP(PLUS, BINOP(PLUS, REG(a), REG(b)), REG(c)))
CJUMP(GREATERTHAN, REG(sum), 0, trueBranch, falseBranch)
LABEL(trueBranch) 
MOVE(REG(result), REG(sum))
JUMP(done)
LABEL(falseBranch)
MOVE(REG(result), BINOP(MULT, REG(sum), -1))
LABEL(done)

This particular way of evaluating the conditional is called lazy evaluation: we do not compute sum * -1 if sum is positive. How would the IR change if we wanted to always evaluate both branches?

5. Canonicalization

Intermediate Representation -> Linearized Intermediate Representation (Linearized IR)

In order to do the optimisation that I alluded to earlier, we want to have straight-line IR code and not the jumping-all-over-the-place complex IR we might possibly have at the beginning of this stage. It wasn’t so bad in the previous example, but IR with lots of statements that are execution-order dependent make it hard to reorganise and rewrite later.

Let’s look at some more complicated IR. But first, we have to define a few more constructs:

/* Run the statement, then return the value of the expression. */ 
ESEQ(Statement, Expression) /* Run the first statement, then the second statement. */ 
SEQ(Statement, Statement)

Recall that expressions calculate and return a value, but statements do not (though they may have side effects that do things like change the value of a variable, perform I/O, etc.). Thus an ESEQ would be an Expression instruction, while SEQ would be a Statement instruction.

What can we do now that we have these new additions? Well, how about:

// a = 1; 
// b = 2; 
// c = 3; 
SEQ(MOVE(REG(a), 1),
         SEQ(MOVE(REG(b), 2),
             MOVE(REG(c), 3)))// a = 1; 
// b = a + 1; 
// c = b + 1; 
SEQ(MOVE(REG(a), 1),
         SEQ(MOVE(REG(b), BINOP(PLUS, REG(a), 1)),
             MOVE(REG(c), BINOP(PLUS, REG(b), 1))))// a - (++a * b)
BINOP(MINUS, REG(a), ESEQ(MOVE(REG(a), BINOP(PLUS, REG(a), 1)),
                          BINOP(MULT, REG(a), REG(b))))

What’s wrong with that? It’s totally awesome, we can write denser IR code, with single instructions that do a lot of stuff.

The problem is that ESEQ and SEQ make it hard for us to reorder the statements, because different evaluation orders may yield different results. In the first case, the order of execution doesn’t matter. It does matter in the second and third cases — can you see why?

We should rewrite the IR to remove the ESEQs inside the instruction (pull it to the top).

// a - (++a * b)
ESEQ(SEQ(MOVE(REG(temp), REG(a)),             // save the value of a 
         MOVE(REG(a), BINOP(PLUS, REG(temp), 1))),
     BINOP(MINUS, REG(temp), BINOP(MULT, REG(a), REG(b))))

This is just one example of simplifying the IR. There are other problems we need to deal with like handling return values if you have an expression like

BINOP(PLUS, CALL(...), CALL(...))

and the call expressions compete for the return value register. (How do you fix this?!) I won’t go into detail on this, but check out the resources listed below if you’re interested.

Now that we’ve got our canonical trees, we will group them into basic blocks that don’t have internal labels or jumps. Finally, we’ll order these blocks so that every CJUMP is immediately followed by its false label (this step is called trace scheduling). We want this because we can then eliminate the false label in the jump — if the condition in the CJUMP is not satisfied, we can go directly to the next instruction and that’s exactly what we want to execute.

Now our IR is ready for translation to machine code where we’ll do exciting stuff like instruction selection, control flow analysis, and register allocation!

Resources:

Materials from Stanford’s CS143 offering in 2012, especially on parsing

Appel, Andrew W. Modern Compiler Implementation in Java, 2nd Ed.

Materials from UBC’s CPSC 411 offering in 2017 (current)