Why We Built a Compiler We Won’t Use

Flock. Community Blogs

Published in

Flock. Community

11 min readNov 10, 2020

By Niels Simonides and Jerre van Veluw

Introduction

Compilers are weird. Full confession: We have never built a compiler before. Yes, we are software engineers, and we have built programs that transform one language into another (graphQLSimpleBindings, CMACC), but those are not full compilers. Maybe you have created tools to transform text into another text, and like us never built a compiler, even if you are a professional software engineer. Your education might not have been in computer science, or you did not take compilers 101. Compilers are intriguing though, are they not? They take in one language and through a process of lexicographic analysis, they emit another language. Moreover, in the case of computer science, the emitted language is a program. Even better, that program could be a new compiler to create better languages to create better compilers. Compilers are weird. It is akin to somehow getting yourself out of a swamp by pulling your own hair. This fact makes compilers interesting, more interesting than other programs. In order to know how compilation works you could read about it, or listen to a talk. We listened to such a talk at QCon 2019 (a video along with the transcript is available here). Colin Eberhardt’s talk was interesting and informative. You get what tokenizing, parsing, and emitting entails. In this particular case, it showed how you can create a compiler in TypeScript that takes in a language of your own design and transforms it into WebAssembly bytecode. So far so good. If you only want to ‘get it’, you are done! Of course, when you want to understand, you need to create.

“What I cannot create I do not understand.” — Richard Feynman

So we decided to create a compiler.

Building a Compiler

In order to understand and build a compiler, we need to decide a few things upfront. What programming language are we going to build the compiler in. What is our compilation target, and most importantly, what will be our designed language to compile. Since we love Kotlin we used Kotlin Native to create our compiler. The compilation target should be low level and with WebAssembly we could already use the insights of Colin to speed up the process. But what language would we want to compile? Since we are Dutch and don’t expect our compiler to be of use we chose to create a language with Dutch keywords. It is to show that you could choose any characters and are not bound to English. In addition, it makes sure that no one in their right mind will use this language. More importantly, it is quite hilarious to think of words in Dutch to describe a program with.

Architecture

At a high level, our compiler consists of 3 different components: The tokenizer, parser, and emitter.

Each component transforms one representation into another. First the source code to a list of tokens. Secondly, the tokens to an Abstract Syntax Tree (AST), and finally the AST to a byte array.

The Tokenizer

The tokenizer splits the source code into tokens. Technically, a defined regex matches a string of source code and is mapped to a single token. This implies that there exists a one to one mapping between the source code and the list of tokens. For example, consider the following source code: “waarde getal wordt 14; druk af getal;” (In English pseudo-code it would translate to “val number = 14; print number;”). This statement is broken down into 13 different parts:

The tokenizer matches the first part of the text with a list of “matchers”:

When a match is found that string is removed from the source code. The tokenizer continues to match the rest of the source code until it is completely tokenized:

Let’s break this down:

Firstly findToken tries to match a matcher against the source code. When matches are found it returns the first token. There can be multiple matches. For example, ‘druk’ would also qualify as the ‘Identifier’ token. This is why the order of the matchers matters. We want to create a ‘Print’ token for ‘druk af’ and not an ‘Identifier’ token for ‘druk’. If no matches are found we throw an exception to fail the compilation.

The regular expression in a matcher can be defined with a ^ to match the start of a string. In this way, the tokenizer can just remove the matched string from the start of the source and recursively analyse until there is nothing left. In a simple example it looks like this:

tokenize(“druk af 14;”):
- token(value: “druk af”, type: print)
tokenize(“ 14;);
- token(value: “druk af”, type: print)
- token(value: “ ”, type: whitespace)
tokenize(“14;”):
- token(value: “druk af”, type: print) 
- token(value: “ ”, type: whitespace)
- token(value: “14”, type: number)
tokenize(“;”):
- token(value: “druk af”) 
- token(value: “ ”, type: whitespace)
- token(value: “14”, type: number) 
- token(value: “;”, type: EOL)

When tokenizing is complete, the source code is successfully transformed into a list of tokens. In other words: The source has been analysed to have correct syntax. The next step is to analyse if the source code means anything.

The Parser

The parser transforms a list of tokens into an Abstract Syntax Tree (AST). An AST is a tree of nodes that describes relations between tokens. Where the tokens are just a flat representation of the source syntax, the AST is a structured representation of what the source means. Since the ‘Whitespace’ tokens do not mean anything to the parser they can be safely discarded. Subsequently, it expects groups of tokens to be in a certain order.

A TokenProvider is created to iterate over all the tokens and parse them. In this way the parser groups different tokens that belong to the same statement into a ProgramNode.

Since statements start with a keyword the first token for a statement is of the Keyword type. If parseStatement finds some other token it throws an exception. In other words: If the source code is syntactically correct, but does not start with a keyword it does not mean anything. Even more so: because the parseStatement function is defined in this way and the TokenProvider only parses statements, the source code must start with a statement and statements must start with a keyword.

Print and Value are both keyword tokens. The “druk af” Print token will trigger the parsePrintStatement function, “waarde” will trigger the parseVariableDeclarationAndAssignmentStatement function.

In our source “waarde getal wordt 14; druk af getal;” we are dealing with a variable assignment and declaration first:

First, eatToken is called, this moves the TokenProvider to the next token. Apparently ‘eat token’ is a common term associated with compilers so we will just use that here. Moreover, we know to expect tokens in a specific order: The expression we want to assign, the assignment itself, the value, and finally an EOL. If one of them is not the case, the variable assignment is meaningless. When the statement is meaningful a VariableAndAssignmentDeclaration is returned holding the identifier and number nodes.

The next statement in our program is the print statement:

Firstly eatToken consumes the current token. Subsequently, we expect the expression token to print, and finally an end of line token. Again, if one is not the case, the print statement is meaningless. When the print statement is meaningful a PrintStatement is returned holding the expression. We have shown a couple of expressions to parse. To parse a meaningful expression we defined a function as such:

Here parseExpression maps certain tokens to corresponding Nodes and consumes the token.

Now all tokens of our “waarde getal wordt 14; druk af getal;” program are consumed. To summarise, here is the list of parsed tokens:

- token(value=“waarde”, type=value)
- token(value“ ”, type=whitespace)
- token(value=“getal”, type=identifier)
- token(value=“ ”, type=whitespace)
- token(value=“wordt”, type=assignment)
- token(value=“ ”, type=whitespace)
- token(value=14, type=number)
- token(value=“;”, type=EOL)
- token(value=“ ”, type=whitespace)
- token(value=“druk af”, type=print)
- token(value=“ ”, type=whitespace)
- token(value=“getal”, type=identifier)
- token(value=“;”, type=EOL)
- token(value=“EOP”, type=EOP)

These are transformed into two ProgramNode’s: PrintStatement(expression: IdentifierNode(getal)) and VariableAndAssignmentDeclaration(identifier=IdentifierNode(getal), expression=NumberNode(number=14))

Schematically the AST looks like this:

In this simple example, VariableAssignmentAndDeclaration and PrintStatement are the only program nodes and they have only one layer of child nodes. For a more complicated program, the ‘tree’ is more apparent.

The Emitter

The emitter transforms the AST into a byte array; the module that a WebAssembly instance can understand. Next to the code section, this byte array contains other sections to make it a valid WebAssembly module. Although this blog is not about WebAssembly, we will list these sections for completeness:

The module starts with the magic number header and the version. The other sections are identified by their SectionCode and appear once. You can read more about the module structure here.

The emitter creates all sections and concatenates them into a single byte array:

The section we are most interested in is the code section. Here we transform the AST to a byte array.

The code section consists of 2 parts, the locals, and the code. The locals section specifies the maximum number of pointers to variable values you could use. In this way, the index of an identifier is the pointer to the value. In other words, if this number is equal to the number of variables, this is the most efficient use of memory. They must be declared before they can be used.

In our example, the identifier “getal” gets an index in the identifiers list and is thus declared as a pointer to the value 14. The identifiers list contains all variables used by the code. If this list is empty the locals will not be encoded (an empty byte array). Else the number of variables is emitted together with the ValueType.f32 byte.

It is a waste of space to encode the full variable name in bytecode. Therefore each variable is an index. The emitter builds an identifier list that holds all variables. The index in this list is the reference to the variable value:

We have now laid the groundwork so the code can be emitted. This is done by traversing the AST, emitting the nodes and concatenating them into a single byte array:

At first, the emitter checks the statement. Each statement translates into a specific sequence of WebAssembly instructions. For “waarde getal wordt 14;” and “druk af getal;” these functions are:

Note that the order is reversed. For the print statement, the value to print is emitted first. A call instruction follows. Finally, a reference to the function is emitted. In this case, it is the imported function at index 0. That is, the binding to console.log. The reversed order is a direct consequence of WebAssembly being stack-based. When this statement is executed the top of the stack will look like this:

Since all the statements are encoded with their size, WebAssembly knows how many records it needs to pop off the stack in order to execute the complete statement.

In both statements, an expression is emitted. The function we wrote for this is:

An expression can be a NumberNode or an IdentifierNode. In case of an IdentifierNode there are two branches. The node contains a set property indicating a new value to be stored in the identifier list yielding the index. If set is false the identifier index is retrieved to be called. As an example, in the statement: “waarde getal is 14” set_local is needed, in the statement “druk af getal” get_local is needed.

When our program is emitted the final bytecode looks like:

[0,97,115,109,1,0,0,0,1,8,2,96,1,125,0,96,0,0,2,13,1,3,101,110,118,5,112,114,105,110,116,0,0,3,2,1,1,7,7,1,3,114,117,110,0,1,10,17,1,15,1,1,125,67,0,0,96,65,33,0,32,0,16,0,11]

The bytecode can be fed into a WebAssembly instance and executed by calling the exported run function showing: “14” on the console.

What Did We Learn

As developers, we use these “black boxes” called compilers every day. We started this project out of curiosity to learn how they work. At a glance, a compiler can look quite intimidating. It looks like a complicated mechanism to transform source code into executable bytecode.

We started with Colin’s blog and his TypeScript compiler codebase. This way we had a global understanding and started with the tokenizer. Luckily this turned out to be easier than we thought. Using a recursive approach we were able to tokenize our program with only a few lines of code.

Continuing with the parser, it again turned out to be less complicated than we thought. The parser just expects tokens to be in a certain order and groups them in a statement. The hardest part though was handling the eatToken calls. This is a side effect and modifies a mutable property, the current token. We placed this code in one place, the TokenProvider. In this way, it is manageable and as clear as possible.

Is the emitter where the complexity lies? Indeed the emitter took most of our time and was complicated, but not in terms of logic. It just transforms the program nodes into bytecode in a logical way.

The hard part was getting the encoding right. Since we are using Kotlin native we couldn’t use any existing Java libraries to encode the different types. Moreover, we had to build some methods using bit shifting. In addition, debugging the output of the emitter functions was hard. Sometimes, for example, it emitted: [0, 2, 13, 1, 3, 101] instead of [0, 3, 13, 1, 3, 102]. Good luck finding the problem! For us humans, it is just an array of numbers and hard to deduce the meaning of every single number.

The three components (tokenizer, parser, emitter) show the power of defining the right abstractions. It is the most apparent for the parser. It turns a complex process into understandable, digestible, and independent steps. Choosing the right abstraction does not only apply to compilers, but to all software.

We were grateful for Colin to go through the WebAssembly specification. Even more so we could inspect his compiler code to see how he structured the sections. More importantly, we could see what opcodes are needed.

Looking at our codebase we are happy to have chosen Kotlin Native. The standard library did not let us down. Unfortunately, we could not use any Java libraries for encoding, but these functions were easily replicated.

Kotlin worked out really well. A lot of the types, like tokens and nodes, are modelled in sealed classes. In this way processing happens in exhaustive ‘when clauses’. With this approach, the Kotlin compiler helped us handle all cases. As follows it is trivial to extend DASM with features. If you add one, the compiler tells you where to add additional code.

Extension functions help with creating concise, eloquent statements. By overloading the ‘+’ operator, for example, byte array concatenation is as fluent as possible.

Even though writing a compiler might not be one of your life goals, we encourage software developers to give it a try. A compiler is probably a very different piece of software to what you write on a day-to-day basis. It requires you to think in different data structures and abstractions. Taking this perspective will help you broaden your knowledge and gain new insights to use in your daily work.

References

Colin Eberhardt:

Blog: https://blog.scottlogic.com/2019/05/17/webassembly-compiler.html

Talk: https://www.infoq.com/presentations/webassembly-compiler

flock-community/dasm

You can’t perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com