Writing a compiler that compiles itself

A couple of years ago I wrote a simple compiler with the sole purpose of being able to compile itself. Not particularly useful, but fun and educational. I wrote the initial version in javascript for quick prototyping, using jison for lexing and parsing, and recast for generating javascript code (resulting in pretty readable code as a nice side effect).

Meeting "Stop"

First things first. Since Google took the name “go” I decided to call my toy language for “stop”. Since it was meant for compiling itself, I wanted it to have the two most useful constructs when it comes to compiler writing: algebraic datatypes and pattern matching … or at least something close enough. For simplicity I implemented everything as expressions, so parsing an entire program would be the same as parsing a single expression. Instead of algebraic datatypes I ended up with record types that are easy to define with a type keyword having all the record fields as arguments.

Person = type(name, age);
Animal = type(owner);

I managed to implement some simple pattern matching over all the types, including the record types, allowing me to write things like

match x of
Person(name=’foo’, age=42) -> “Hurray”
| Animal(owner=Person(name="bar")) -> “Woohoo”
| _ = “Meh”
end;

Generating code for pattern matching usually involves goto-statements (yes, I wrote the g-word), but since javascript is laking these, I found a clever alternative. It turns out that you can jump to the end of a labelled code block using named break statements (which is enough for my use case). This allows for using nested if-statements like the following so you avoid evaluating expressions multiple times:

label: {
var e = the_current_expression;

if (e instanceof IdPattern) {
var v = e.value;
if (v === “true”) {
// Handle this case
break label;
}
}
  // ...
}

Leaky abstractions to the rescue

With all this in place I decided to make the switch and rewrite the compiler from javascript to stop. First hurdle was how to do lexical analysis and parsing. Since I was missing a parser generator (and didn’t really feel like writing my own), I took the easy road and went for using jison here as well. Since my little toy language is basically an abstraction of javascript everything defined by javascript leaks through to my program, and since I never got around to implement a type checker, there was nothing to stop me from utilizing this. This means that even though the require function was never defined by stop, I could still call it to import the parser I got from jison. I only needed to carefully ensure that calling functions in my language was compatible with calling javascript functions, and also that the usage of datatypes defined in my language was usable in the jison parser. In addition to require I was also able to reuse the entire standard library of javascript, meaning that I got access to e.g. an array implementation for free (that became pretty handy).

What did I learn?

I think that writing a compiler that is able to compile itself is pretty awesome, so I can only recommend it if you have the time (and courage). If I was to do it again, there is definitely some things that I would do differently.

  • Make sure to have a language agnostic test suite covering all the weird edge cases as well as all the obvious things that shouldn’t start breaking when playing around with things.
  • Using a tool like jison is nice for prototyping, but before a compiler really compiles itself, it should also handle lexing and parsing by itself, and here it would be easier with a handwritten lexer and recursive descent parser (plus you avoid the hurdle of figuring out why you get shift/reduce conflicts all the time).
  • Type checking and tooling is not something that should be postponed until after making the switch (I really missed better error messages at compile time).

Go take a look at the source, the generated javascript, and the examples if you haven’t already.