A compiler — under the hood

Kornelije Petak
6 min readNov 12, 2023

--

Photo by Lukas Tennie on Unsplash

🔗 This article is part of the series: What do programmers and computers do?
🔗 Previously: Seam — a new programming language

In the last article, we defined the first version of the Seam language. We now want to look into how a compiler translates code written in Seam into something a machine can execute.

For now, it is almost irrelevant what we’ll be translating into; we just want to get to something we can easily translate to something else. This desired step in that journey is called the Abstract Syntax Tree.

Bet let’s get there step by step.

Lexical analysis

Lexical analysis is the first action the compiler will do with your code. But what does it mean to analyze code lexically?

Lexis comes from the Greek language and means ‘word.’

When we talk, our brain quickly cuts the relatively continuous sound into smaller chunks of sounds — words. Even if we hear somebody talk fast, we can discern (at least most of the time) the exact words they say. This is that lexical analysis we’re talking about, done by the brain.

Compilers do the same. Except, it’s not just about words. If we look at a random sample of code:

function addOne(val: number) -> number
{
return val + 1
}

we can see that the code is not just words. It also contains parentheses, a colon, a plus, a number, and some braces. For the compiler, these also have a meaning, so we’ll use the word tokens for both the words and all other units with meaning (such as a plus sign).

The desired output of the lexical analysis is a sequence of tokens.

In the example above, it would be:

function, addOne, (, val, :, number, ), ->, number, {, return, val, +, 1, and }.

We have various types of tokens here:

  • Keywords: function, number, and return
  • Identifiers: addOne, val
  • Literals: 1
  • Interpunction: (, ), {, }, :, ->
  • Operators: +

If you look at the source code, it is a sequence of characters that needs to be lexically analyzed. The above sequence of tokens is the result of that analysis, and in a way, we can say that we have translated a sequence of characters into a sequence of tokens.

Now, we can proceed with the syntax analysis.

Syntax analysis

This step in the process is also called parsing. At the core, the main goal of parsers is to determine whether some code (that is, a list of lexical tokens) is valid in a particular grammar.

We mentioned in our recent language discussion that every language has to have vocabulary and grammar. This is the case both for natural as well as programming languages.

So, how do we define our grammar? Isn’t this daunting? What notation to use to write the grammar? Feel free to check this Wikipedia article to grasp what formal grammars are and how they are usually written.

I will use a simplified notation in this article, but I will use BNF (or some other more common standard) in the future.

For now, let’s start simple, with a simple sample grammar.

“Simple sample” was supposed to sound funny.

SENTENCE -> NOUN VERB 'a' OBJECT '.'
NOUN -> 'Father'
NOUN -> 'John'
VERB -> 'plays'
VERB -> 'cooks'
OBJECT -> 'violin'
OBJECT -> 'stew'

This sample grammar has seven rules, each consisting of a symbol on the left side of -> and a sequence of symbols on the right. Symbols wholly written in uppercase are called non-terminal symbols, while each string wrapped in single quotes is a terminal symbol.

If we start with the initial non-terminal symbol (by convention, the one in the first rule in the grammar definition) — SENTENCE, we can replace it with the right side of the rule. Then, we find the next non-terminal symbol still in the text and look for a rule that defines what I can replace it with. If multiple rules have the same non-terminal symbol on the left, we can pick and apply whichever we want. And we can continue the process until all non-terminal symbols are replaced with terminal ones. The result is an instance of a text written in the language defined by the grammar.

For example:

SENTENCE                  

[after applying rule #1]
NOUN VERB a OBJECT .

[after applying rule #3]
John VERB a OBJECT .

[after applying rule #4]
John plays a OBJECT .

[after applying rule #7]
John plays a stew .

I intentionally did not explain the difference between the terminal and non-terminal symbols before because I knew it would be apparent from the example. Non-terminal symbols are just placeholders that we can still replace with any of the available choices defined by the grammar rules.

Note, however, that in this grammar, John plays a stew. is a valid string, even though it doesn’t make any sense. Syntax analysis does not care about meaning and making sense (at least not typically). It just needs to know whether the string is in a given language.

The way we reached the John plays a stew. from a non-terminal symbol, SENTENCE is quite the opposite of how parsers work. Whereas we generated a string by continuously applying production rules on non-terminal symbols, parsers scan the lexical tokens from the code and look for rules they can apply to reduce parts of the code to their corresponding non-terminal symbols. They continuously apply rules by replacing their right sides with their left sides (non-terminal symbols), thus continuously reducing the code to just a single non-terminal symbol. If they reduce the code to the starting non-terminal symbol this way, the code is syntactically valid in that language.

And this is the main thing that parsers do.

Of course, just having a simple yes/no is not useful for programming languages because we want to make a program out of our code, not just know whether we wrote it correctly. So, the syntax analysis usually also creates this Abstract Syntax Tree we mentioned, which we can use to generate some machine code later, transpile it into another language, or interpret it immediately.

Syntax analysis also usually has some error checking, reporting, and recovery, but we’ll maybe deal with this later. Or only in code.

After lexical and syntax analysis, compilers do other steps, such as semantic checks and code generation. But for now, we’ll stop here. What we want at this moment is just the Abstract Syntax Tree.

Let’s refresh our memories of what our example code is.

function addOne(val: number) -> number
{
return val + 1
}

Without getting into how to achieve this, the outcome we desire is this:

Note that this is not the same as the Concrete Syntax Tree, which matches the structure of the code exactly. Our abstract syntax tree is abstract to a degree to which we can use it to generate code from it later unambiguously.

Note also that some nodes, besides representing a syntax element’s purpose, contain additional data, such as the function’s name and its return type or the argument’s type and name. This metadata is needed during the semantic analysis (such as type-checking) and code generation.

The level of detail and the exact structure of the AST differs among parsers for different languages. Still, all of them should be as detailed as needed for future steps to be unambiguous, correct, and easy to perform.

We have to stop here. We don’t have enough features in the language yet to do semantic analysis, and it would be too much of a leap to jump into code generation.

We first need to see how to get from the code to the Abstract Syntax Tree. And this will be the topic of the next post. Stay tuned.

--

--