Compiler Democratisation In Practice

Published in

Sempiler

8 min readApr 2, 2019

One of the key tenets of the Sempiler project is democratising compilation.

Democratisation is a bit of a buzzword — one I admittedly stole from Unity’s slogan ‘democratising data’.

In this context it means providing a compiler inspired infrastructure that can be easily extended, customised and plugged in to for the requirements of any given project.

Below are some examples of how the compiler is designed to support this vision.

Programmable Pipeline

Sempiler is a misnomer when it comes to the actual compiler pipeline.

If you’ve visited the website you’ll know that the underlying goal of Sempiler is to alleviate the syntax brain tax — namely, allowing the author to express ideas in a familiar syntax, and have that code semantically transpiled to functionally equivalent code in a different syntax.

However, the pipeline itself — parsing, transformation, emission, consumption — is agnostic of this goal.

Sempilation describes a transpiler, but a pipeline set up to emit something lower level (or an intermediate program representation) could reasonably be considered a compiler.

Moreover, as you start to introduce a sequence of code transformations, code generation and other intermediate plugins you essentially create different tools. The only commonality between them is the underlying pipeline that powers them.

And whilst this pipeline is currently configured via a config file, I’m in the midst of implementing an API, directives and compile time execution to control the pipeline — with configuration information expressed inline in the source files themselves.

Bootstrapping

The codebase for the compiler is written in C#, and although it’s extensible, naïvely writing a plugin requires:

Creating a new .csproject or .sln (dotnet new classlib -n FooEmitter)
Providing a class declaration that implements the relevant functional interface (IParser, ITransformer, IEmitter, IConsumer)
Compile the project to get a DLL
Write the path to that DLL in the relevant section of the Sempiler config

When the compiler sees a DLL path it will attempt to load the assembly, and extract the class from it with the same name as the DLL file name (a limitation for now!).

This process works but is not conducive to an easy-to-extend compiler:

It’s laborious and involved
It forces the author to write their contribution in C#
It means we need a separate DLL for every plugin

So with the scene set we have plenty of motivation to devise a better way.

In the context of compilers, the term bootstrapping describes a situation where much of the source code (beyond the initial core) is written in the same language that will be compiled.

We can apply this idea to Sempiler by considering that although the core engine is written in C#, beyond that we should be able to write plugins in any language, as long as the language used is one we can parse.

And whether the language is one we can parse simply depends on whether we have an IParser implementation for it (which you can write otherwise!).

Once the code has been parsed, we can simply emit it to C# and use that as the basis for creating a DLL.

So we’re part way to resolving the constraint on implementation language, but behind-the-scenes Sempiler still needs to be fed a DLL — so now the question becomes, how is that generated?

Scripted Plugins

As aforementioned, Sempiler does not need to care what language a plugin was originally authored in, but it does care that every plugin is provided as a DLL.

We use the term scripted plugin to describe a plugin that has been written as a script in another syntax (eg. TypeScript), and then sempiled to C#.

We have a choice whether to load these scripted plugins lazily or eagerly.

Eagerly would mean that Sempiler loads/creates the DLLs as soon as it detects a path to a scripted plugin.

̶T̶h̶i̶s̶ ̶i̶s̶ ̶d̶e̶t̶r̶i̶m̶e̶n̶t̶a̶l̶ ̶t̶o̶ ̶f̶a̶s̶t̶ ̶f̶e̶e̶d̶b̶a̶c̶k̶ ̶i̶f̶ ̶c̶o̶m̶p̶i̶l̶a̶t̶i̶o̶n̶ ̶f̶a̶i̶l̶s̶ ̶b̶e̶f̶o̶r̶e̶ ̶t̶h̶a̶t̶ ̶p̶l̶u̶g̶i̶n̶ ̶i̶s̶ ̶h̶i̶t̶ ̶i̶n̶ ̶t̶h̶e̶ ̶p̶i̶p̶e̶l̶i̶n̶e̶.̶ ̶T̶h̶e̶ ̶t̶i̶m̶e̶ ̶i̶n̶v̶e̶s̶t̶e̶d̶ ̶i̶n̶ ̶p̶a̶r̶s̶i̶n̶g̶ ̶a̶n̶d̶ ̶p̶a̶c̶k̶a̶g̶i̶n̶g̶ ̶(̶P̶&̶P̶!̶)̶ ̶t̶h̶e̶ ̶p̶l̶u̶g̶i̶n̶ ̶h̶a̶s̶ ̶a̶r̶g̶u̶a̶b̶l̶y̶ ̶b̶e̶e̶n̶ ̶f̶o̶r̶ ̶z̶e̶r̶o̶ ̶v̶a̶l̶u̶e̶ ̶i̶n̶ ̶t̶h̶e̶ ̶c̶o̶n̶t̶e̶x̶t̶ ̶o̶f̶ ̶t̶h̶e̶ ̶c̶u̶r̶r̶e̶n̶t̶ ̶r̶u̶n̶ ̶(̶a̶l̶b̶e̶i̶t̶ ̶c̶a̶c̶h̶e̶a̶b̶l̶e̶ ̶f̶o̶r̶ ̶t̶h̶e̶ ̶n̶e̶x̶t̶ ̶t̶i̶m̶e̶)̶.̶

̶B̶u̶t̶ ̶w̶i̶t̶h̶ ̶l̶a̶z̶i̶l̶y̶ ̶l̶o̶a̶d̶i̶n̶g̶ ̶p̶l̶u̶g̶i̶n̶s̶ ̶w̶e̶ ̶o̶n̶l̶y̶ ̶p̶a̶y̶ ̶t̶h̶e̶ ̶c̶o̶s̶t̶ ̶f̶o̶r̶ ̶t̶h̶e̶ ̶P̶&̶P̶ ̶w̶h̶e̶n̶ ̶w̶e̶ ̶n̶e̶e̶d̶ ̶t̶o̶ ̶a̶c̶t̶u̶a̶l̶l̶y̶ ̶u̶s̶e̶ ̶t̶h̶e̶ ̶p̶l̶u̶g̶i̶n̶ ̶i̶n̶ ̶t̶h̶e̶ ̶p̶i̶p̶e̶l̶i̶n̶e̶ ̶-̶ ̶a̶n̶d̶ ̶t̶h̶e̶ ̶r̶e̶s̶u̶l̶t̶ ̶(̶g̶e̶n̶e̶r̶a̶t̶e̶d̶ ̶D̶L̶L̶)̶ ̶i̶s̶ ̶e̶q̶u̶a̶l̶l̶y̶ ̶c̶a̶c̶h̶e̶a̶b̶l̶e̶.̶

Eagerly loading a plugin is beneficial because the plugin code should be treated like application code in terms of importance.

You want to know about any problems with any of the code you’re reliant on as soon as possible, even if the problematic code is in a plugin that was not reached (ie. because of errors at a previous pipeline step).

To load a plugin we need to wrap the path to the plugin script inside a ScriptedParser, ScriptedTransformer, ScriptedEmitter or ScriptedConsumer instance.

When called upon it will:

Check whether a valid cached DLL exists for the plugin path, and attempt to load it if so
Fallback to performing the P&P (described below), and cacheing the resultant DLL for next time
Act as a proxy to the generated plugin DLL, delegating to it and propagating the result from it back to the pipeline

Parsing and Packaging

Parse the script to an AST (abstract semantic tree — the internal representation of the source code)
Emit the parsed AST to C# code
Use a consumer that wraps a call to dotnet to generate the DLL
Store the DLL in some file system location (the cache)

Caching Strategy

If a DLL with the same GUID does not exist in the cache folder, then create a fresh version of the DLL and store it in the cache folder
If a DLL with the same GUID does exist in the cache folder and the modified date of the DLL is earlier than the modified timestamp of at least one contributory plugin file, create a fresh version of the DLL and overwrite the stale version in the cache folder
If a DLL with the same GUID does exist in the cache folder and the modified timestamp of the DLL is later than all contributory plugin files, use the cached DLL without modification

Author Workflow

The author workflow becomes:

Write a plugin script in a language we can parse
Put the path to the plugin script in the relevant section of your Sempiler config (relative to the config file location)

Compiler Behaviour

When the compiler is run it needs to:

Detect when a plugin path actually points to a raw script (this could be as simple as examining the file extension)
Create the relevant Scripted instance wrapping the path
Put the Scripted instance in the pipeline*
Execute the pipeline as normal

*note that beyond this point, the rest of the compiler session does not need to know or care that a scripted plugin was involved.. it will just encounter plugins and doesn’t need to care where they originated.

Limitations

We still have exactly the same underlying interface to the compiler (DLLs), but this is now almost entirely abstracted away to the end user (with minimal performance impact).

However, going from the flexibility of a whole C# project for a plugin implementation, to a single file format implies several limitations:

The entire implementation is limited to a single file(!)
Transformations on the plugin code are not configurable anywhere (ie. there is no configuration file to control how the plugin code gets compiled, unlike the wider project)
Because this approach subcompiles the plugin in a separate pipeline, it may prove tricky to implement interactive debugging and being able to step through the original plugin script
Dependency on a C# compiler being available on the host system (eg. Mono or CSC)

Moreover, the abstraction is still pretty heavy handed under the hood. And we are still confined to a single DLL per plugin.

Improvements

The aforementioned method goes a long way to making the experience of extending/customising the compiler easier.

However, the Author Workflow still involves touching files beyond your plugin in order to wire it up.

After writing the draft for the Scripted Plugins section two things happened.

Compiler as a Service

First, I discovered Mono’s C# compiler as a service, which lets you evaluate literal source text fragments directly through an API to the Mono compiler.

We currently build Sempiler with .NET Core, and fortunately this supports a similar feature with Microsoft’s System.CodeDom.

With this in mind, we may still need to transpile the author’s plugin scripts to C# first, but now we do not need to necessarily write that emission to disk in order to load and use it.

We could still serialize something to disk for the sake of caching though, but in the absence of a DLL this could just be the transpiled C# source text.

So now the P&P behaviour becomes (differences highlighted in bold):

Parse the script to an AST
Emit the parsed AST to C# code
Evaluate the emitted C# with a Compiler as a Service API

Note this new approach does not stop you being able to share your plugins — it’s purely an implementation change to the compiler with how plugin scripts are ingested and evaluated.

Inspiration from JAI

Secondly, I have mentioned being inspired by Jonathan Blow et al’s work on JAI before, and in rereading the JAI primer I was reminded that:

Any function of the program can be made to run at compile time with #run

The entrypoint to any plugin in Sempiler is purely a function call.

So that got me thinking about a better Author Workflow:

The author could write a plugin function (ie. transformer) inline with the rest of the source code
The author adds a #transformer directive to the function
During compilation, each function with a #transformer directive is parsed to an AST
The resulting AST is nested inside a larger built-in AST template to form a legal implementation of the appropriate functional interface (IParser, ITransformer, IEmitter, IConsumer)
The complete AST is then emitted to C# and evaluated as appropriate, as if the author had written a fully fledged C# plugin class

Plugin function expressed inline and invoked using directives

Note that supporting compile time directives does not tie us in to any particular evaluation strategy — namely, we can use Compiler as a Service or DLL generation.

This feature will almost certainly be explored further in a future post, because it has the potential to support incredibly powerful concepts, like those illustrated with JAI.

I also need to improve how plugins are loaded, specifically not having to have the DLL name match the class name. This currently ties us into a 1:1 ratio of plugin to DLL.

By adding an additional way to express the target class name we can begin to bundle related plugins together in the same DLL where it makes sense.

The trade off with bundling plugins is that DLLs may be larger, and you may end up regenerating them more often, because they may be dependent on the contents of more than one constituent source file. And if any one source file changes, we have to regenerate the cached DLL entirely.

Closing Remarks

In closing hopefully this article has presented enough to substantiate the idea that compiler democratisation is not just a buzz phrase.

Your control over the code you write shouldn’t end the moment you pass it off to the compiler.

You should be able to easily do things like generate boilerplate code, prune unused/dead code (eg. tree shaking), or check switch statements are exhaustive, without the need for separate tools or tests.

Moreover, the tool you use shouldn’t dictate what syntax you write the code in.

These things are what motivates me to build the compiler, beyond the origianl sempilation use case. And I truly hope people find it a useful tool once it’s out there (mid 2019).