Building the Super Tiny Compiler with Reason (part 1)
If you are here, I suppose you think Reason is great! Or at least that you are curious to know if Reason is great. In any case, if you are a (JavaScript) developer interested in this new language, this post is for you.
What is this about
We will migrate the-super-tiny-compiler (by James Kyle) to Reason. Thanks to James for creating such an inspiring project! Most of the content has been adapted and heavily inspired from it, so I recommend to check it out before going through this post, or watch this video where he explains how it works.
There will be no need to create bindings, import libraries or any other interaction with the “outer world”. The goal is to get an overview of some of the tools available in the Reason toolbox, while not getting too distracted with the meta-language (© Cheng Lou). In particular, we will discuss a bit about variants, records, and pattern matching.
Finally, and thanks to how concise the Reason syntax is, the code produced will be quite contained, and the project will end up having even less lines than the JavaScript counterpart, so we could rename it to the-super-ultra-teeny-tiny-compiler. 🤖
Very short intro to compilers
Compilers are like sausage factories, with the difference that you know what goes into it. 🙈
The text input passed to a compiler –the program– goes through a series of steps where it gets processed, to finally get whatever output we are expecting at the end: object code, bytecode, a translation to a second high level language…
These steps are generally defined around 3 stages:
- Parsing: comprised of lexical and syntactical analysis
- Transforming: there can be multiple transformation steps
- Code generator: can output code in multiple different forms
In our case, the compiler will take Lisp-like code and convert it to C. In this first part of the tutorial, we will go through the first stage of the parsing, the lexical analyzer or lexer (aka tokenizer). The lexer takes raw text and transforms it into tokens, that can already be classified into different categories: numbers, operators, strings, etc.
Variant types
Before we start building, let’s go through one of the many features of Reason that is available in the language thanks to its inheritance from OCaml: variant types. When I started learning about Reason, variant types piqued my interest, they had some similarities with Flow union types. But it turns out, there are some important differences between them as we will see.
One example:
Variants allow us to express “this variable can have this or this other type”. The options themselves (John
, Paul
, …) are called constructors. A couple of notes:
John
is not defined beforehand, it is part of the variant itself (unlike Flow union types).- There is no need for annotations: the compiler will conclude that
drummer
is of typebeatle
, type inference ftw!
Let’s see another example:
Constructors can exist just on their own, or also define the parameterized types they are allowed to “contain”. These types can be strings, functions, parametric types, records, or any other Reason type. They can even be the variant type itself, allowing to define recursive data structures like trees (hint: we will make use of tree-shaped variants later, when building the parser and transformer).
This flexibility, combined with pattern matching, is one of the most powerful features of Reason.
Setting up the environment
First things first. If you don’t have a Reason environment ready, go to the site and follow the Quick Start guide. Make sure that after you create an app with:
bsb -init the-super-tiny-compiler -theme basic-reason
it can be compiled with
npm run start
If you find problems while installing, feel free to reach out in the Discord channel, the community is amazing and very welcoming.
Getting started
As mentioned above, our tiny compiler transforms Lisp code into C code. So if we feed it something like:
(add 2 (subtract 4 2))
it will give back
add(2, subtract(4, 2))
Our first function, lexer
will have a signature like:
/* string => list token */
let lexer input => {...};
And we will be calling in Reason like:
lexer "(add 2 (subtract 4 2))"
These will be the types of tokens that will be detected:
- Numbers
- Strings
- Names: like the functions names
add
orsubtract
above - Open parenthesis
- Close parenthesis
Now for a bit more complex task: how to build a system that “knows” what to parse next? To design the lexer, we are going to take advantage of a mathematical model that is slightly more advanced than state machines: pushdown automatons. If you have any experience with state machines, think about pushdown automatons like state machines plus a companion stack. If you don’t have any experience with state machines, it doesn’t matter, the image below will give you an idea of how they work.
Here is how a pushdown automaton that implements the lexer behavior could look like. The transitions between states correspond to the value of the element that sits at the top of the stack:
We can also draw it as a table:
We are leaving out multiple other potential cases, but for the purposes of our super tiny compiler, it is enough.
Types
Now we can finally start writing some Reason code! First, the token
type that will be used by the lexer
:
Variants are an excellent match to model tokens, as we can optionally include a type to store data for the tokens that require it.
Recursive functions and tail calls
Before writing the lexer
function, we need one more thing, a function to take a String
in and split it out in a bunch of Char
s. Characters allow to be pattern matched with ranges –like '0'..'9'
or 'a'..'z'
– which will come in handy later on for the lexer:
There is something clearly different from JavaScript functions: Reason and OCaml are expression-oriented languages, so the return value of statements and functions is the value of the last expression. In other words, no need to explicitly usereturn
.
The pattern used in explode
is very common in Reason: adding an internal recursive function that will take care of most of the work. In this case, the recursive functionexp
takes a couple of parameters in, i
and l
, to perform the calculations. Adding the accumulator l
allows to implement the recursive branch with a tail call (i.e. just the recursive call is necessary and no other operations are performed). Tail calls can be optimized later on by the BuckleScript compiler, which is essential to avoid stack overflow errors when working with large data sets. If you are curious, here are the two versions compared –with or without tail call–.
The lexer
We are ready now to build the lexer. We will use a combination of everything we have seen above:
- Variant types for the tokens
- Pattern matching to use against the tokens and the current character being analyzed
- Recursive functions with tail call
Let’s take a closer look:
let tokenizer input => {
This is the main function. It receives a string, and will output a list
of elements of the type token
we defined above.
let rec tok input current tokens =>
The internal recursive function. It follows the same pattern as we saw with explode
above. It will receive:
input
: a list ofchar
s containing the remaining data to process.current
: a param of typeoption token
(read more about option types) with the current type of token being processed. It will beNone
initially or after finishing processing a specific token.tokens
: the accumulator with the list of tokens that will be sent as output
switch input {
| [] => List.rev tokens
In case there are no more tokens to process, we will return the accumulated list tokens
. But we will have to reverse it first, as we are adding new tokens to the head of the list (Reason lists and the spread operator are built for prepending).
| _ =>
let head = List.hd input;
let tail = List.tl input;
let next = tok tail;
This is the second case of the switch
statement, which does all the heavy lifting. We prepare all the data for the inner switch that we will see below. head
will be the char that is being processed, tail
the remainder, and next
is a partially applied function of tok
, because we will always need to pass tail
to it.
switch (head, current, tokens) {
/* State: None */
| ('(', None, t) => next None [OpenParen, ...t]
| (')', None, t) => next None [CloseParen, ...t]
| (' ' | '\t' | '\r' | '\n', None, t) => next None t
| ('"', None, t) => next (Some (String "")) t
| ('0'..'9' as i, None, t) => next (Some (Number (String.make 1 i))) t
| ('a'..'z' as i, None, t) => next (Some (Name (String.make 1 i))) t
We create a tuple (head, current, tokens)
and start matching patterns against it.
First we cover all cases that could happen when current
is None
, and transition to a new state in case we find a double quote (String
), a number from 0 to 9 (Number
), or a lower caps letter (Name
).
On the right side, you can see the token being added to tokens
in the case of an OpenParen
or a CloseParen
, while other cases don’t add anything to the output.
/* State: String */
| ('"', Some (String c), t) => next None [String c, ...t]
| (i, Some (String c), t) => next (Some (String (c ^ String.make 1 i))) t
The case for strings is easy: if current
is in the String
state, and we receive a double quote, we store the processed string (c
) in the output, and go back to state None
. If we receive any other character, we just accumulate on current
.
/* State: Number */
| ('0'..'9' as i, Some (Number c), t) => next (Some (Number (c ^ String.make 1 i))) t
| (')', Some (Number c), t) => next None [CloseParen, Number c, ...t]
| (' ', Some (Number c), t) => next None [Number c, ...t]/* State: Name */
| ('a'..'z' as i, Some (Name c), t) => next (Some (Name (c ^ String.make 1 i))) t
| (')', Some (Name c), t) => next None [CloseParen, Name c, ...t]
| (' ', Some (Name c), t) => next None [Name c, ...t]
Here are the remaining cases for Number
and Name
. They are very similar to String
but in this case, we also consider the case where a closing parenthesis brings the state back to None
.
Note also the usage of character ranges (like '0..9'
). This is such an expressive syntax that allows to cover
tok (explode input) None []
Finally, the initial call to tok
, passing the original input as a list of char
s after processing the string withexplode
and also setting current
and tokens
to the default values(None
and []
).
Exhaustiveness checking
If you start tinkering with the tokenizer
function above, and add and remove cases to the switch
statement, you will see how quickly the compiler starts warning you about cases that have not been covered. One example:
You forgot to handle a possible value here, for example:('A', None, _)
These exhaustive checks are really helpful while writing new functions, as the compiler will work with you to point out what cases you have to cover next, in order to create exhaustive switch
statements. This exhaustiveness will help making bugs or unexpected behaviors way less likely when the code goes live.
The end
That’s it! You created a lexer in less than 60 lines of code. You can see the code for this part of the tutorial, and add comments, in this pull request. There will be a second part, that will include the rest of the compiler: parser, transformer and the code generator, stay tuned!
Thanks for reading, if you have any questions, suggestions or want to discuss other topics you would like to see covered, let me know in the comments or on Twitter. ✌️
Thanks to Dustan Kasten and iclanzan for reviewing this article and the companion code.