Syntax: language agnostic parser generator

In this post we’ll briefly talk about the new language agnostic parser generator, called Syntax.

Note: we don’t describe here parsers algorithms in detail; the target audience is supposed to be familiar with parsers terminology, and basic parsing techniques.

Syntax tool is implemented as a Node.js application. It’s available as an npm module, and can be installed as:

npm install -g syntax-cli
syntax-cli --help

Yet another parser generator?

There is a plenty amount of parser generators available these days, so why “yet another” one?

Initially, education.

The tool was designed to be applied in classes on parsers, and compilers — I was needed a working solution specifically created for teaching, which can be used to describe different parsing techniques, starting from manual recursive descent parsers, and going deeper to automatic LL- and LR-parsers.

Thus, the goal was not just to show how to use e.g. Bison (another famous parsers tool used on practice), but instead to understand how to implement one. Since only implementing something, i.e. practically applying a theory behind it, we can make our understanding more complete.

For pure practical JavaScript parsers one could use e.g. Jison (port of Bison) — a very good tool supporting state-based tokenizers, EBNF format for grammars, etc. However, it didn’t cover my needs for educational tooling, which Syntax initially aimed to implement. From practical perspective though, the generated parser code is now even faster in Syntax, than in Jison.


The implementation was started from this picture I had after digging LR-parsing, and its related terminology, such as “canonical collection or LR-items”, “closure” (don’t confuse with JS closures!), etc.

And even though I only barely worked on this project at my spare time, I was eventually able “to put this rock-painting to actual code”, and the Syntax tool was born.

What’s the audience?

That’s said, Syntax allows not only to generate (fully working with production ready algorithms) parsers, but to describe the parsing process, and the algorithms used behind.

The tool will be useful for computer science students, which take a compilers class, for studying different parsing algorithms. It can be used to check the results of your assignments, comparing parsing tables, showing parsing states, analyzing parsing conflicts, etc.

For example, here is how a “shift-reduce” conflict is shown in the parsing table, when an SLR(1) grammar is trying to be applied in LR(0) mode:

Shift-reduce conflict in LR parsing

Besides the parsing tables, Syntax allows analyzing and showing the First-, Follow-, and Predict-sets, used for building LL tables, as well as the canonical collection of LR-items — the state machine used for building LR parsing tables. See the help command to discover more options for education.

However, the education was not the only purpose of the tool. Syntax evolved into an actual practical parsers generator tool, which can be used for automatic parsers generation suitable for production. Let’s take a look.

Practical application

Syntax implements several LR-algorithms, such as LR(0), SLR(1), LALR(1), and the CLR(1). In addition it supports LL(1) parsing.

Some of them, such as LR(0) are considered mostly educational, others, such as LALR(1) — the most practical ones.

As an example, take a look at the MIPS Assembly parser or Regular Expressions parser implemented using Syntax.

Example grammars

We can start from a simple “calculator” grammar, and will show how to use Syntax for building parsers or even direct interpreters.

Grammar format

As you can see, the grammar is a JSON-like structure for a BNF. It allows defining lexical (lex), and syntactic (bnf) grammars, operators associativity, and precedence, and other options.

Precedence and associativity

The later allows building very elegant grammars, using fewer non-terminal symbols (and therefore, building smaller parsing tables). E.g. we use the same E symbol to describe all kinds of expressions, which might be either an addition (E + E), or a subtraction (E — E), or a simple number, or … etc. Without it a grammar had to be restructured using more symbols, and rules, or would have a “shift-reduce” conflict.

The ["left", "+"] in the operators sections says, that we want + symbol to be left-associative, i.e. 2 + 2 + 2 should be (2 + 2) + 2, and not 2 + (2 + 2). And the fact, that * goes after + in the list, says it has a higher precedence, so the 2 + 2 * 2 is correctly parsed as 2 + (2 * 2), and not as (2 + 2) * 2.

To see how this grammar would look like without operators precedence and associativity, check this example. The grammar does absolutely the same, but has more non-terminal symbols (E, F, T), and its parsing table is larger. Sometimes though it might be more clear to describe a grammar with more symbols.

Alternative grammar syntax (BISON-style)

In the example below we use an alternative syntax to define grammars, aka Bison/YACC-style:

/**
* Calculator grammar.
*
* syntax-cli --grammar calc.g --mode LALR1 --table -p '2 + 2 * 2'
*/
%lex

%%

\s+ /* skip whitespace */
\d+ return 'NUMBER'

/lex
%%
Additive
: Additive '+' Multiplicative { $$ = $1 + $3 }
| Multiplicative { $$ = $1 }
;
Multiplicative
: Multiplicative '*' Primary { $$ = $1 * $3 }
| Primary { $$ = $1 }
;
Primary
: NUMBER { $$ = int($1) }
| '(' E ')' { $$ = $2 }
;

Version with operator precedence and associativity works too:

/**
* Calculator grammar.
*
* syntax-cli -g calc.g -m lalr1 -t -p '2 + 2 * 2'
*/
%lex

%%

\s+ /* skip whitespace */
\d+ return 'NUMBER'

/lex
%left '+'
%left '*'
%%
Expression
: Expression '+' Expression { $$ = $1 + $3 }
|
Expression '*' Expression { $$ = $1 * $3 }
|
NUMBER { $$ = int($1) }
| '(' Expression ')' { $$ = $2 }
;

This is possible because Syntax uses BNF-grammar parser in case if a grammar is passed as a string, rather than a JSON structure (the string is parsed into the appropriate JSON representation in this case). The parser is also implemented just using Syntax.

Language agnostic target

In the example above we chose Python as a target language, however Syntax in general is language agnostic, and allows choosing different target languages — we will see shortly how we can reuse the same grammar for building parsers for several languages.

At the moment the following targets are supported:

And adding a new plugin for any language is relatively straightforward.

See this instruction how to implement a new plugin.

Once the parser is generated, it can normally be used as a Python module:

SDT for any kind of actions

Syntax-directed translation (SDT) allows using the parsing actions for any kind of purposes. In fact, in the Python example above we implemented more of an interpreter (directly evaluating results of the expressions), rather than a parser, which practical purpose is usually to construct an AST (abstract syntax tree). Let’s see how we can use SDT to actually build the parse trees — it’s just few changes we’ll need to make to our grammar.

Building an AST

Now let’s use JavaScript as a target, and build the AST for the calculator grammar using SDT with semantic action handlers:

All we did is just changed the return value (stored in the $$ parse variable) from an actual calculation to the tree nodes construction. And again, automatic operator precedence allowed us correctly build AST for expressions like 2 + 2 * 2 treating them as in math 2 + (2 * 2), without explicit grammar restructuring or parenthesis. Explicit parenthesis can be used though for cases like (2 + 2) * 2. The usage is trivial:

Module includes, and parser events

Using inline objects for ASTs might be good, however in a bigger project we could also implement them as separate classes, and include to the parsing file. Let’s take PHP as a target at this time:

And again from PHP it can be executed as just:

Design notes

  • Syntax uses plugin-based architecture, which allows easily adding a new target language. This separates tables calculation from the actual parser code generation, making the tool target language agnostic.
  • Syntax has a built-in tokenizer support which uses regexp system of the target language, and also supports stateful tokenization. It is possible as well to use a custom tokenizer if the built-in isn’t sufficient.
  • LL-parsing: building and analysis of the First-, Follow-, and Predict-sets. Analyzing conflicts (“first/first”, “first/follow”) e.g. for left-recursive grammars, and showing them in the parsing table.
  • LR-parsing: building and analysis of the canonical collection of LR-items (which consists of “closured” states). Analyzing and automatic resolution of the conflicts (“shift-reduce”, “reduce-reduce”) and showing them in the parsing table. Automatic left-recursion support, which is in the nature of LR parsers.
  • Support of generating a skeleton for a recursive descent might be added.

Source code and documentation

The source code and more detailed documentation can be found in the corresponding repository.

If you have any questions, suggestions, or comments — as always, I’ll be glad to discuss them in comments below.

Have fun with parsers!

Show your support

Clapping shows how much you appreciated Dmitry Soshnikov’s story.