Relaxed Parser

Darius
Sempiler
Published in
6 min readMar 19, 2019

The first phase of a Sempiler pipeline is parsing, the task of taking some source input representing a program, and creating an agnostic semantic tree from it that the rest of the pipeline can reason about and manipulate.

This post will give a brief overview of the mechanics involved, and then cover why RelaxedParser is particularly useful in the case of Sempiler.

IParser

Any parser can be plugged in to the compiler, as long as it satisfies the following simple interface:

Parser interface

Given a session, an initially empty AST, some source and the cancellation token, it is expected to produce a collection of nodes.

Note also that we use the Task wrapper, allowing any parser to complete it’s workload asynchronously. This is useful for expensive operations like file I/O, or making a network request.

Node

Throughout the code we use the base class Node to describe a node, and then cast it appropriately to interrogate it.

When designing a data structure that has to match any source syntax/target domain pairing, it is useful to not hardcode any constraints. For example, if we said all field names were actually Identifier (as they are in C#), we would be stuck when it came to parsing TypeScript, where field names may be expressions (computed properties):

Computed property names

Lexer

We use a Lexer, whose job it is to walk over a source string and return tokens — the recognised lexemes that make up the source syntax.

The Lexer interface is a bit more involved in practice, but essentially boils down to:

Lexer interface

Namely the position of where we are in the source string, and a way to ask the Lexer to read on and return the first lexeme it encounters.

This design also makes cloning the Lexer cheap and easy, by rolling a new instance with the same source string and same position.

Cloning the Lexer

This is useful when the Parser wishes to look ahead in the token stream, and only affect the original Lexer’s position if it finds a certain token sequence.

Looking ahead in the token stream

Note, the Lexer is not making any decisions about the validity of the sequence of lexemes (whether a particular lexeme can follow another legally). That is the job of the Parser.

The Lexer I’ve written for this purpose is heavily based off the official TypeScript scanner.

Parser

All the functions for parsing the different constructs have the same signature:

Parse delegate

The delegate is passed:

  • the token containing a lexeme obtained from the Lexer
  • the Lexer such that this delegate can query for more tokens, as it parses a particular construct
  • the Context, which describes where the token was found (such as in a parameter list) — this is useful for resolving ambiguity, where the lexeme alone is not enough. The context will determine how to proceed based on what is syntactically legal/expected
  • the CancellationToken — a mechanism for terminating in flight compilation

Note that this means for any invocation we have complete control over which token we pass in, and which Lexer instance.

If we wanted to ‘try parse’ a construct, we could clone the Lexer and pass a clone with the token to a ParseDelegate.

If successful, we can set the position of the original Lexer to match the clone (catch up). If unsuccessful, no problem! — we have not polluted the position of the original Lexer and can continue without issue.

ParseDelegate makes it incredibly easy to rewrite reusable helper functions, like ParseList.

ParseList applies the given delegate (in this case, ParseStatement) to the initial lexeme of each list element detected:

Parsing the statements of a function body

In the above example we expect to encounter a list of statements, in order to comprise a function body.

Similarly though, we might reuse the same function to parse a list of parameters (ParseParameter):

Parsing the parameters of a function signature

The Lexer and Parser as described above are essentially all it takes mechanically to understand how parsing works in Sempiler.

Next let’s look at what makes RelaxedParser particularly useful.

Deviating from TypeScript

I was initially satisfied writing a standard TypeScript parser, but of course, the point of Sempiler is to use a familiar syntax (TypeScript) to express intent for other platforms and domains.

It struck me that some of the rules imposed by the official TypeScript parser are based on the assumption that the code will run the web or Node.js. This simply is not the case in Sempiler.

The goal of these changes are to provide the quickest and easiest way for authors to be expressive in their code, whilst still being legal TypeScript syntax (agnostic of semantics).

Modifiers

The set of supported modifiers is now open ended.

Take the DNA of a method declaration in TypeScript:

<modifiers> <name><template>(<parameters>) : <type> { <body> }

We remain faithful to this construct syntactically regardless of if the modifiers we write are private static, or something else like foo bar!

And when you consider that different target domains have different modifier keywords, it makes sense that Sempiler let you use those words directly inline where modifiers live.

In fact, we extend this idea to all symbols. Whenever we encounter an identifier, we ask for it’s role:

Deciding the role of a symbol

As you can see, we return SymbolRole.Modifier for those words that can appear in the modifier position of a declaration.

But it doesn’t stop there. Each one of the keywords that make up the lexicon has a case and associated SymbolRole. This helps the parser understand what is a keyword, and what is just an identifier (like a variable name).

This is simple to implement, but means the parser is completely unopinionated about the actual word used. This should make it easier to reuse the parser on different source syntaxes, albeit with some modification of the parsing delegates.

Imports

The first change to imports is to allow for this compact syntax:

Compact imports

This is useful for importing for side effects, specifically in target domains where the symbols from the imported package are added into scope (like C#).

Secondly, the standard TypeScript parser only expects to find a string literal or require(…) expression on the right hand side.

We allow any expression here syntactically:

import Foo = <any expression>;

The constraint is thereby removed from the source, and the consideration for the author becomes what type of expressions are legal in the target domain.

Catch Clauses

You can specify multiple catch clauses after a try block, and the pattern is optional:

Multiple catch clauses

Templating

You can have templated enums and namespaces:

Templates

Ornamentation

Ornamentation is a general term given to the mixing of annotations and modifiers in a declaration.

The parser supports adorning variable and parameter declarations with ornamentation:

Ornamentation

This is perhaps the most surprising departure, but is faithful to where ornamentation appears in a function style declaration.

And again, is motivated by allowing for code that is expressive enough to talk about concepts in the target domain, without needing to write transformers to achieve them— though that is a valid alternative!

--

--

Darius
Sempiler

Software Eng // prev @Microsoft // passionate about compilers & tooling 🛠️