Building a Domain-Specific Language with Chrevrotain

Jeremy Sher
7 min readJun 30, 2019

--

Chevrotain, the open source parser building library, is an excellent tool that provides a clean solution for building many kinds of parsers in JavaScript without requiring any code generation at runtime. In this article, I will walk through using Chevrotain to build a simple domain-specific language (DSL) that controls an imaginary smart lightbulb platform. By the end of this article, we will have built a parser and interpreter that can run a program like the following to adjust our simulated lightbulb state:

alias green #00ff00 
alias blue #0000ff
alias delay 600
brightness 75
on
wait delay
color green
wait delay
color blue
wait delay
off

If you just want to skip ahead to the source code, here’s a gist.

A language implementation in Chevrotain consists of three components (although two can be combined): a lexer, responsible for taking an input string and converting it into a list of tokens; a parser, which defines the rules for how valid statements may be constructed from the tokens generated by the lexer and applies those rules to create a syntax tree of the input program; and the interpreter/visitor, which walks the generated syntax tree to execute the program.

While the functionality of the visitor can be embedded into the parser for enhanced performance, this article will keep those concerns separate and use Chevrotain’s built-in CST Visitor functionality. For the purposes of this article, we’ll be building our DSL in a single file. For maintainability, larger Chevrotain projects should probably be split into multiple files.

Getting Started

Chevrotain is available as an npm module and can be installed with npm install chevrotain in your project directory. We’ll assume that our source file is in a directory with Chevrotain installed. First, we need to import chevrotain, as well as two modules from that package:

Lexer

Next, we need to setup our lexer. Remember that the job of the lexer is to take the input and split it up into tokens that can be consumed by the parser. In order for it to do this, we need to create token definitions for every string that is allowed in the language. We’ll start by creating an array to hold our token types (which will eventually be given to the lexer’s constructor) and a helper function to create new tokens from options and push them into our tokens array.

Now we can define all of the symbols that are allowed in our language. For the purposes of our DSL, whitespace is not significant. Because tokens are processed by the lexer in order, adding a token to identify and skip whitespace first will make our lexer more efficient.

You can see that a token definition is an object with a name, a regex pattern, and (optionally) a group. Following the conventions in Chevrotain’s examples, keyword tokens are named in UpperCamelCase and literal values are named in UPPERCASE. Lexer.SKIPPED is a special group that causes a token to be skipped and not get processed. We will not be grouping any other tokens in this example.

Next, we can define all of the keywords in our language:

So we can recognize literal values in the program, we add token definitions for numbers and hex colors, as well as an identifier token for use with the alias keyword:

Finally, we can instantiate our new lexer from the base class provided by Chevrotain using our list of tokens.

Parser

The next component we need to build is the parser. Chevrotain provides almost everything we need for this in the Parser class, so all we need to do to get a working parser is provide a set of rules defining the syntax of valid statements in our language. Everything that we need to do in the parser happens in the constructor, so we’ll start by extending the Parser class and adding some boilerplate to its constructor.

Because parser rules are defined using a class method called usingthis in the constructor, assigning this to $ is a convention recommended in the Chevrotain documentation to prevent having to type this over and over — it is purely personal preference, and I recommend it. The call to $.performSelfAnalysis() (this.performSelfAnalysis()) applies the rules we will define and is required for the parser to work.

Let’s consider the grammar of our language and define the first rule for our parser. What is the highest-level unit that an input should be understood as? One answer would be a program, the total of all the parts of the input. What is a program made of? A program can be said to be made of a series of statements. In our Chevrotain parser definition, this can be expressed like this (all example code showing parser rules should be understood to be in the constructor of our new LightParser class, and must be defined before $.performSelfAnalysis() is called):

Here, we are defining a rule called program, which contains many (zero or more) statement subrules. What is a statement? Our language can be understood as a series of commands, so let’s say that a statement can be any one of them:

Note that $.OR takes an array of objects rather than a name and a function like $.RULE. The ALT property of the object passed to $.OR is a normal parser rule block and can contain any number of parser class method calls, although in this case we’re just calling a number of exclusive subrules.

Now we need to define rules for each of the subrules we’ve declared for our statement rule. The first two, on and off, are very simple:

Here we introduce the $.CONSUME method for the first time. $.CONSUME indicates that this rule expects to encounter a particular token. Our previous two rules (program and statement) were non-terminal rules, since they each call subrules. onStatement and offStatement (and the rest of the statements in our grammar) are terminal rules, since they consume tokens without calling any additional subrules (and, in the resulting syntax tree, no child nodes are created). The remaining rules are similar to onStatement and offStatement, but each consumes multiple tokens in a row and, for some tokens consumed, may allow that token to be multiple types:

Finally, once we’ve added all of our rules and finished extending the Chevrotain Parser class, we’ll create an instance of our parser, give it some tokens from an input using our lexer (remember that whitespace is not significant in our language, so we can put it in a single line string), create a syntax tree by calling an entry point to parser, which in this case is program, and do some basic error handling:

Visitor/Interpreter

Now that we have a syntax tree representing the structure of our program, we want to be able to do something with it. Chevrotain provides a BaseCstVisitor class that we can extend to create a new visitor class specific to our parser. Chevrotain also provides a BaseCstVisitorWithDefaults class, but since we want to implement every rule anyway we’ll start with a blank slate. We start by getting a base class from our parser and extending it; we’ll also set up the state for our simulated lightbulb in its constructor.

As with the parser, we must call the class' super at the top of our constructor, and also need to call this.validateVisitor(). Our lightbulb has a few states and we also add a scope property to the instance to store any properties created with our alias keyword. Our interpreter class will have an instance method for each rule that we defined in our parser. Each method takes the current context of the node of the syntax tree they’re on, and can return a value (although our implementation does not have interpreter methods returning values). As with the parser, we can start with program and statement:

The grammar of our parser states that a program contains many statements, so our program visitor iterates over its child statements (from context.statement) and visits each one (calls the appropriate visitor method on each child node). Note that any child node of context is always an array if it’s present (and null otherwise); even if the grammar defined program as containing a single statement, the context.statement of program would still be an array with a single element. Since our grammar says that a statement is one of any number of possible expressions, we can see if any one of them is not null and visit that one. When this.visit is called on an array of nodes, it will visit the first one.

The next statements we’ll define in our visitor, onStatement and offStatement are simple. When these statements are reached, we want to change the state of the lightbulb appropriately. For all statements that change the state of the lightbulb, we’ll also log out some messages to see what happened:

The next statements also affect the state of the lightbulb, but as defined in our grammar take arguments that can either be a literal value or an identifier used with alias. Because we know that each statement will only have one or the other, we can check for the existence of an identifier and use the value stored in this.scope at that key if it exists, or use the literal value passed in. Note that each child node has an image property that contains the literal string that was processed into the token.

The final statement we need to define is aliasStatement, which behaves very similarly to the others but sets a property on this.scope instead of retrieving it:

Finally, we can create a new instance of our interpreter and have it visit the syntax tree we created in the previous section:

We’re done! If we run the completed file with node, we should see our program execute:

$ node demo.js 
setting green to #00ff00
setting blue to #0000ff
setting delay to 600
brightness was 100
setting brightness to 75
turning light on (light was off)
waiting 600
setting color to #00ff00 (color was #ffffff)
waiting 600
setting color to #0000ff (color was #00ff00)
waiting 600
turning light off (light was on)

Conclusion

Hopefully this has demonstrated that Chevrotain is an extremely powerful tool for creating languages that is also surprisingly easy to get up and running with. If this article has inspired you to do something interesting with Chevrotain, please let me know at jeremy at knack dot com or at https://twitter.com/Overlapping!

--

--