Bet You Didn’t Know This About Compilers (Part 2)

Daniel Jensen

8 min readJan 31, 2023

If you havn’t read the first part yet. Find it here!:

Bet You Didn’t Know This About Compilers?

Alright folks, are you ready for some serious fun? ’Cause you’re about to get a crash course in compilers that’ll knock…

medium.com

Alright. So, where did we left off. Oh yeah right, a more complicated follow up on the inner workings of a compiler.

We’ll walk through:

Lexical Analysis
Syntax Analysis
Type Checking
Intermediate Code Generation
Register Allocation
Machine Code Generation
Assembly and Linking

… each of these will get their own post to dig even deeper into the subjects. Stay tuned for that!

Lexical analysis

Have you ever heard of lexical analysis? No? Well, let me break it down for you. It’s the first step in the compilation process, also known as lexing or tokenization. It’s like the appetizer before the main course, if the main course is the rest of the compilation process.

Photo by Sebastian Coman Photography on Unsplash

So, what does it involve? It’s pretty simple, really. It takes the source code written in a high-level programming language and breaks it down into its individual elements, known as tokens. Tokens are like the building blocks of a program, like keywords, operators, punctuation, and identifiers.

The process of lexical analysis typically involves several steps. First, it reads the source code character by character. Then, it identifies the lexemes, which are the smallest meaningful unit of a program. Finally, it groups the lexemes into tokens.

The lexical analyzer, or lexer, uses a set of rules called a lexical grammar to identify the lexemes and group them into tokens. The lexical grammar is typically defined using regular expressions, which are a formal way of describing patterns in text.

Once the lexical analysis is complete, the tokens are passed to the next stage of the compilation process, which is parsing. Parsing is like trying to make sense of a jigsaw puzzle, it takes the tokens and analyzes the structure of the program to understand the meaning of the code.

Lexical analysis is an important step in the compilation process because it is the foundation for the rest of the process. By breaking the code down into its individual elements, the compiler can more easily understand the structure and meaning of the program. Understanding the lexemes of a program also helps to identify errors early in the process, before they can cause more serious problems later on. In other words, it’s like catching a typo before it becomes a bigger problem.

How does it look like in code

Here is a simple example of lexical analysis in Python
import re

# Define the lexical grammar using regular expressions
lexical_grammar = {
    'NUMBER': r'\d+',
    'PLUS': r'\+',
    'MINUS': r'-',
    'MULTIPLY': r'\*',
    'DIVIDE': r'/',
    'LPAREN': r'\(',
    'RPAREN': r'\)'
}

# Create a list of tokens based on the lexical grammar
def tokenize(expression):
    tokens = []
    source = re.findall('|'.join(lexical_grammar.values()), expression)
    for item in source:
        for key, value in lexical_grammar.items():
            if re.match(value, item):
                tokens.append((key, item))
    return tokens

# Test the tokenize function with a simple arithmetic expression
expression = "2 + 3 * 4 - 5 / 6"
print(tokenize(expression))

This code is an example of lexical analysis in Python. It takes a simple arithmetic expression and uses a set of regular expressions to identify the lexemes and group them into tokens.

The lexical grammar is defined in a dictionary, where the keys are the token names, and the values are the regular expressions that match the lexemes for each token.

The code uses the re library to perform the regular expression matching, and the group() method to extract the matched lexeme from the regular expression. The matched lexemes are then stored in a list as tuples, with the first element being the token name, and the second element being the lexeme.

Once the lexical analysis is complete, the tokens are passed to the next stage of the compilation process, which is parsing. The parsing process uses the tokens to analyze the structure of the program and understand the meaning of the code.

This is an important step in the compilation process as it helps to identify errors early on and ensure that the code is syntactically correct before it is executed.

Syntax Analysis

Parsing, also known as syntax analysis, is the second course in the compilation feast. It takes the yummy tokens from lexical analysis and studies the recipe of the program to understand what it’s all about.

The chef of syntax analysis, or parser, uses a cookbook called a grammar to whip up the structure of the code. This cookbook lists the ingredients and cooking instructions of the programming language, including the rules for how to mix and match the elements to create a perfect program. The parser then bakes a parse tree, which is a fancy presentation of the code showing the relationships between the ingredients.

Next up, the parse tree is transformed into a simpler dish called the abstract syntax tree (AST). The AST discards any unnecessary ingredients and focuses on the key relationships between the elements of the code. This dish is then served to the rest of the compiler for the main course of generating machine code or checking for any semantic errors, like using spoiled ingredients or mismatched recipe types.

Parsing is crucial to the compilation process as it provides a deeper understanding of the code’s recipe and structure. It catches errors in the code that would be hard to spot later on and also provides a handy representation of the code for other parts of the compiler to use, like code generators or recipe optimizers.

In a nutshell, parsing takes the tokens from lexical analysis and studies the structure of the program to understand its meaning. The parser uses a grammar to make a parse tree, which is then transformed into an abstract syntax tree that the rest of the compiler uses to serve up machine code or check for semantic errors.

def parse(tokens):
    # Use a grammar to construct a parse tree
    parse_tree = construct_parse_tree(tokens, grammar)
    
    # Transform parse tree into an abstract syntax tree
    ast = simplify_parse_tree(parse_tree)
    
    # Use the abstract syntax tree for further processing
    generate_machine_code(ast)
    check_semantic_errors(ast)

tokens = lexical_analysis(code)
parse(tokens)

Type Checking

Type checking is a critical step in the process of compilers. It is used to ensure that the code that is being written is accurate and valid. During type checking, the compiler examines each of the program’s variables, functions, expressions, and statements to determine if they conform to the language’s specific type rules. It is like a teacher checking a student’s homework for accuracy and completeness. Type checking can be a bit of a tricky process. It involves comparing the type of the expression or variable to the type of the value being assigned to it. If the types are not compatible, the compiler will report an error. Think of it like trying to fit a square peg into a round hole. It just won’t work.

In some cases, the compiler may be able to make an educated guess about the type of a variable or expression. This is known as type inference. It can be a useful tool for making sure that the code is valid and consistent. Think of it like a teacher making an educated guess about the answer to a question based on the student’s previous answers.

Despite its complexity, type checking is an essential part of the compiler’s job. Without it, the compiler cannot guarantee that the code is valid and accurate. Think of it like a carpenter using the wrong tools to build a house. Without the right tools, the house may not be built correctly.

To illustrate, here is an example of type checking in action. Let’s say we have a function that takes two numbers as arguments and returns their sum:

def add(num1, num2):
    return num1 + num2

When the compiler encounters this function, it will first check that the two arguments are both numbers. If either of them is not a number, it will report an error. Similarly, it will check that the return value is also a number. If it is not, it will report an error.

Type checking is a critical part of the compiler’s job. It ensures that the code is valid and accurate, and it can even help with type inference. Without it, the code may not run as expected.

Intermediate Code Generation

Intermediate code generation is an important component of compiler design, as it bridges the gap between source code and the machine code generated for execution. It is used to create a representation of the source code that is easier for the compiler to process. This representation is known as intermediate code and can be thought of as a bridge between the high-level source code and the machine code.

Intermediate code generation is used to create a platform-independent representation of the source code, which can then be tailored to the specific machine code needed for different target systems. The intermediate code is usually in the form of an abstract syntax tree (AST), which is a hierarchical tree-like structure of the source code’s syntactic components. It preserves the structure of the source code and is easy to manipulate, making it an ideal candidate for optimization.

Intermediate code generation can be thought of as an assembly line in a factory, where parts are assembled into a bigger product. The source code is like the raw materials that feed into the assembly line and the intermediate code is like the product that comes out of the assembly line. The machine code is the final product that is ready to be used.

For example, consider the following code in C language:


int a = 5;
int b = 10;
int c = a + b;

This code can be represented as an AST with the root node representing the operation, the left subtree representing the left operand, and the right subtree representing the right operand. This AST can then be used to generate the intermediate code for the operation.

int c;
c = add(a, b);

The generated code is platform-independent and can be tailored to various target systems. In this example, the variable c is assigned the result of the operation add(a,b). This operation can be translated to the appropriate machine code for the target system.

Intermediate code generation is an important component of compiler design that helps bridge the gap between source code and machine code. By creating a platform-independent representation of the source code, it allows for efficient optimization and easy translation to machine code for various target systems.

… To be continued.