Building the Super Tiny Compiler with Reason (part 1)

Javier Chávarri
9 min readOct 4, 2017

--

If you are here, I suppose you think Reason is great! Or at least that you are curious to know if Reason is great. In any case, if you are a (JavaScript) developer interested in this new language, this post is for you.

What is this about

We will migrate the-super-tiny-compiler (by James Kyle) to Reason. Thanks to James for creating such an inspiring project! Most of the content has been adapted and heavily inspired from it, so I recommend to check it out before going through this post, or watch this video where he explains how it works.

There will be no need to create bindings, import libraries or any other interaction with the “outer world”. The goal is to get an overview of some of the tools available in the Reason toolbox, while not getting too distracted with the meta-languageCheng Lou). In particular, we will discuss a bit about variants, records, and pattern matching.

Finally, and thanks to how concise the Reason syntax is, the code produced will be quite contained, and the project will end up having even less lines than the JavaScript counterpart, so we could rename it to the-super-ultra-teeny-tiny-compiler. 🤖

Very short intro to compilers

Compilers are like sausage factories, with the difference that you know what goes into it. 🙈

The text input passed to a compiler –the program– goes through a series of steps where it gets processed, to finally get whatever output we are expecting at the end: object code, bytecode, a translation to a second high level language…

These steps are generally defined around 3 stages:

  • Parsing: comprised of lexical and syntactical analysis
  • Transforming: there can be multiple transformation steps
  • Code generator: can output code in multiple different forms
Overview of the tiny Lisp-to-C compiler

In our case, the compiler will take Lisp-like code and convert it to C. In this first part of the tutorial, we will go through the first stage of the parsing, the lexical analyzer or lexer (aka tokenizer). The lexer takes raw text and transforms it into tokens, that can already be classified into different categories: numbers, operators, strings, etc.

Variant types

Before we start building, let’s go through one of the many features of Reason that is available in the language thanks to its inheritance from OCaml: variant types. When I started learning about Reason, variant types piqued my interest, they had some similarities with Flow union types. But it turns out, there are some important differences between them as we will see.

One example:

Variants allow us to express “this variable can have this or this other type”. The options themselves (John, Paul, …) are called constructors. A couple of notes:

  • John is not defined beforehand, it is part of the variant itself (unlike Flow union types).
  • There is no need for annotations: the compiler will conclude that drummer is of type beatle, type inference ftw!

Let’s see another example:

Constructors can exist just on their own, or also define the parameterized types they are allowed to “contain”. These types can be strings, functions, parametric types, records, or any other Reason type. They can even be the variant type itself, allowing to define recursive data structures like trees (hint: we will make use of tree-shaped variants later, when building the parser and transformer).

This flexibility, combined with pattern matching, is one of the most powerful features of Reason.

Setting up the environment

First things first. If you don’t have a Reason environment ready, go to the site and follow the Quick Start guide. Make sure that after you create an app with:

bsb -init the-super-tiny-compiler -theme basic-reason

it can be compiled with

npm run start

If you find problems while installing, feel free to reach out in the Discord channel, the community is amazing and very welcoming.

Getting started

As mentioned above, our tiny compiler transforms Lisp code into C code. So if we feed it something like:

(add 2 (subtract 4 2))

it will give back

add(2, subtract(4, 2))

Our first function, lexer will have a signature like:

/* string => list token */
let lexer input => {...};

And we will be calling in Reason like:

lexer "(add 2 (subtract 4 2))"

These will be the types of tokens that will be detected:

  • Numbers
  • Strings
  • Names: like the functions names add or subtract above
  • Open parenthesis
  • Close parenthesis

Now for a bit more complex task: how to build a system that “knows” what to parse next? To design the lexer, we are going to take advantage of a mathematical model that is slightly more advanced than state machines: pushdown automatons. If you have any experience with state machines, think about pushdown automatons like state machines plus a companion stack. If you don’t have any experience with state machines, it doesn’t matter, the image below will give you an idea of how they work.

Here is how a pushdown automaton that implements the lexer behavior could look like. The transitions between states correspond to the value of the element that sits at the top of the stack:

The lexer, modeled as a pushdown automaton

We can also draw it as a table:

We are leaving out multiple other potential cases, but for the purposes of our super tiny compiler, it is enough.

Types

Now we can finally start writing some Reason code! First, the token type that will be used by the lexer:

Variants are an excellent match to model tokens, as we can optionally include a type to store data for the tokens that require it.

Recursive functions and tail calls

Before writing the lexer function, we need one more thing, a function to take a String in and split it out in a bunch of Chars. Characters allow to be pattern matched with ranges –like '0'..'9' or 'a'..'z'– which will come in handy later on for the lexer:

There is something clearly different from JavaScript functions: Reason and OCaml are expression-oriented languages, so the return value of statements and functions is the value of the last expression. In other words, no need to explicitly usereturn.

The pattern used in explode is very common in Reason: adding an internal recursive function that will take care of most of the work. In this case, the recursive functionexp takes a couple of parameters in, i and l, to perform the calculations. Adding the accumulator l allows to implement the recursive branch with a tail call (i.e. just the recursive call is necessary and no other operations are performed). Tail calls can be optimized later on by the BuckleScript compiler, which is essential to avoid stack overflow errors when working with large data sets. If you are curious, here are the two versions compared –with or without tail call–.

The lexer

We are ready now to build the lexer. We will use a combination of everything we have seen above:

  • Variant types for the tokens
  • Pattern matching to use against the tokens and the current character being analyzed
  • Recursive functions with tail call

Let’s take a closer look:

let tokenizer input => {

This is the main function. It receives a string, and will output a list of elements of the type token we defined above.

let rec tok input current tokens =>

The internal recursive function. It follows the same pattern as we saw with explode above. It will receive:

  • input: a list of chars containing the remaining data to process.
  • current: a param of type option token (read more about option types) with the current type of token being processed. It will be None initially or after finishing processing a specific token.
  • tokens: the accumulator with the list of tokens that will be sent as output
switch input {
| [] => List.rev tokens

In case there are no more tokens to process, we will return the accumulated list tokens. But we will have to reverse it first, as we are adding new tokens to the head of the list (Reason lists and the spread operator are built for prepending).

| _ =>
let head = List.hd input;
let tail = List.tl input;
let next = tok tail;

This is the second case of the switch statement, which does all the heavy lifting. We prepare all the data for the inner switch that we will see below. head will be the char that is being processed, tail the remainder, and next is a partially applied function of tok, because we will always need to pass tail to it.

switch (head, current, tokens) {
/* State: None */
| ('(', None, t) => next None [OpenParen, ...t]
| (')', None, t) => next None [CloseParen, ...t]
| (' ' | '\t' | '\r' | '\n', None, t) => next None t
| ('"', None, t) => next (Some (String "")) t
| ('0'..'9' as i, None, t) => next (Some (Number (String.make 1 i))) t
| ('a'..'z' as i, None, t) => next (Some (Name (String.make 1 i))) t

We create a tuple (head, current, tokens) and start matching patterns against it.

First we cover all cases that could happen when current is None, and transition to a new state in case we find a double quote (String), a number from 0 to 9 (Number), or a lower caps letter (Name).

On the right side, you can see the token being added to tokens in the case of an OpenParen or a CloseParen, while other cases don’t add anything to the output.

/* State: String */
| ('"', Some (String c), t) => next None [String c, ...t]
| (i, Some (String c), t) => next (Some (String (c ^ String.make 1 i))) t

The case for strings is easy: if current is in the String state, and we receive a double quote, we store the processed string (c) in the output, and go back to state None. If we receive any other character, we just accumulate on current.

/* State: Number */
| ('0'..'9' as i, Some (Number c), t) => next (Some (Number (c ^ String.make 1 i))) t
| (')', Some (Number c), t) => next None [CloseParen, Number c, ...t]
| (' ', Some (Number c), t) => next None [Number c, ...t]
/* State: Name */
| ('a'..'z' as i, Some (Name c), t) => next (Some (Name (c ^ String.make 1 i))) t
| (')', Some (Name c), t) => next None [CloseParen, Name c, ...t]
| (' ', Some (Name c), t) => next None [Name c, ...t]

Here are the remaining cases for Number and Name. They are very similar to String but in this case, we also consider the case where a closing parenthesis brings the state back to None.

Note also the usage of character ranges (like '0..9'). This is such an expressive syntax that allows to cover

tok (explode input) None []

Finally, the initial call to tok, passing the original input as a list of chars after processing the string withexplode and also setting current and tokens to the default values(None and []).

Exhaustiveness checking

If you start tinkering with the tokenizer function above, and add and remove cases to the switch statement, you will see how quickly the compiler starts warning you about cases that have not been covered. One example:

You forgot to handle a possible value here, for example:('A', None, _)

These exhaustive checks are really helpful while writing new functions, as the compiler will work with you to point out what cases you have to cover next, in order to create exhaustive switch statements. This exhaustiveness will help making bugs or unexpected behaviors way less likely when the code goes live.

The end

That’s it! You created a lexer in less than 60 lines of code. You can see the code for this part of the tutorial, and add comments, in this pull request. There will be a second part, that will include the rest of the compiler: parser, transformer and the code generator, stay tuned!

Thanks for reading, if you have any questions, suggestions or want to discuss other topics you would like to see covered, let me know in the comments or on Twitter. ✌️

Thanks to Dustan Kasten and iclanzan for reviewing this article and the companion code.

--

--

Javier Chávarri

Frontend at Ahrefs • Passionate about The Web, JavaScript, and now OCaml and ReScript