An Introduction to ANTLR (with Typescript)
What is ANTLR?
ANTLR (an acronym for ‘Another Tool for Language Recognition’) is a compiler that builds a parser for a given language definition. Its input is a grammar definition (in the EBNF format, more on that later) of any language you want, and its output is a complete parser for that language. ANTLR can generate the parser in many different languages, among them Javascript, Typescript, Python, C++, C#, Swift, PHP, Go, Dart, and probably more.
What you do with the parser is up to you, and the possibilities are endless.
A few examples of real world applications — Twitter uses ANTLR to parse the search terms from over 2 billion queries a day, Oracle uses ANTLR in their SQL developer IDE, so does MySQL in the MySQL Workbench, and the Netbeans IDE parses C++ with ANTLR.
Why would you want to build a parser in Typescript?
You might be working on a product that lets users input some custom configuration or even a custom script. You might want to evaluate simple expressions, and do some linting or error checking. It might be a new no-code or low-code product that has limited scripting/coding capabilities, or you might even be building a full blown code editor in javascript/typescript like VS Code.
In any of these scenarios, ANTLR can build a parser for you that can generate a full abstract syntax tree from the piece of code you give it. With ANTLR, you can easily iterate over this syntax tree for quick static analysis, error checking, code suggestions, and even evaluation. All of this can be done inside the browser or in a nodejs process and it is pretty fast.
What we use it for
At Palo Alto Networks, one of our products enables users to query numerous data sources ingested in a variety of ways and formats. Users can easily accomplish this using various widgets, dashboards, and visual query builders. However, for power users seeking full control over the query and wanting to utilize all the advanced features we offer, they can use our proprietary query language, XQL, to build queries directly in the browser.
We employ ANTLR on the frontend to provide users with immediate code suggestions (similar to ‘IntelliSense’) and perform quick error checks for syntax and even context errors. We also use ANTLR on the backend to evaluate these queries and process them on the user’s various data sources.
The Basic Mechanics of ANTLR
ANTLR’s output is a Lexer and a Parser that match the exact grammar you defined. After generating the Lexer and Parser, you can feed it a string to parse. ANTLR’s first step is to run the generated Lexer on the given string, which creates a stream of Tokens (this step is called tokenization). This stream of tokens has some extra information on it as well. Each token is given a specific code and type (i.e., string literal, keyword, operator).
Then, the parser runs on this stream of tokens, constructs rules from groups of tokens, and builds the complete syntax tree representing the given string.
ANTLR is written in Java, but the lexer and parser it outputs can be in any of the target languages that ANTLR supports.
Once all this is done, you can run the output target as part of your code.
You can then implement a Listener or Visitor object (more on this later) that ANTLR will invoke while iterating on the syntax tree. Here is where you can collect information, check for errors, give user suggestions, or evaluate code.
Getting Started
Defining your grammar
The first step is defining your grammar. Grammar files (for antlr 4) are saved as *.g4 files.
These files define the tokens and rules of your language.
Tokens are vocabulary symbols — they can be a specific identifier, operator, or specific keyword.
Rules are a connection of different tokens, with various options that can be applied to the tokens within the rule. Like in regular expressions, we can define different groups of tokens, quantity specifiers, or even recursiveness.
The basic structure of a grammar file will look like this:
grammar my_lang;
<rule name>: <rule definition>;
<token name>: <token definition>;
There can be as many tokens and as many rules in your grammar.
A very basic language might look like this:
grammar cool_lang;
file: (expr ';' | declaration ';' | assignment ';')+ EOF;
declaration: VAR IDENTIFIER;
expr: IDENTIFIER (OP IDENTIFIER)?;
assignment: IDENTIFIER '=' (NUMBER | expr);
if_statement: IF '(' expr ')' THEN '{' command '}' (ELSE '{' command '}')?;
command: STOP | CONTINUE | if_statement | assignment;
IDENTIFIER: WORD (NUMBER | WORD)*;
IF: 'if';
THEN: 'then';
ELSE: 'else';
OP: '+' | '-' | '*' | '/';
STOP: 'stop';
CONTINUE: 'continue';
VAR: 'var';
NUMBER: [1–9]+;
WORD: [a-z]+;
SPACES: [\t\r\n ]+ -> skip;
A few things you should notice here:
- The tokens ‘if’, ‘else’, ‘stop’ and ‘continue’ are case sensitive, so typing ‘IF’ won’t be recognized as a token. We can easily make the ‘if’ statement case insensitive in our language by defining the token as `(i|I)(f|F)`.
- The ‘pipe’ is used to create different options — so the ‘OP’ token can be any one of the algebraic operators +,-,*,/ (and only one of them).
- Our identifiers in our language must start with a letter but can contain numbers after the first letter. (Also here, WORD is defined as only lowercase letters).
- Our language file is built from at least one expression, declaration, or assignment and a semicolon after. The file can have as many expressions as needed.
- `expr` is defined as a single identifier or two identifiers with an operation between them.
- The `if_statement` uses the `command` rule, which recursively can be another `if_statement`.
- We have a special token definition for whitespaces that includes tabs, spaces, newlines, and carriage returns, and they call a special ANTLR action to ‘skip’ — which means ANTLR will completely ignore spaces when building the syntax tree.
The language we constructed here is very limited and probably useless, but it was constructed only to demonstrate a few important features of the grammar definition.
In this language, a file like this will be considered syntactically valid:
var one;
var two;
var three;
three = one + two;
if (two + two) then {
STOP
} else {
if (one * one) then { CONTINUE }
}
When constructing a real language, it is common that your grammar definition file will get pretty long quite fast. You can use an ‘import’ statement at the top of the file in order to use smaller grammar files to keep organized and to share common pieces of grammar among different subsets/supersets of the language.
If you’re curious, you can see the official grammar definition of the Python language in their own documentation (using mostly the EBNF format): https://docs.python.org/3/reference/grammar.html
Or take a look at the ANTLR grammar files that MySQL Workbench uses to parse SQL statements: https://github.com/mysql/mysql-workbench/tree/8.0/library/parsers/grammars
Generating Lexers and Parsers
Now that we have a full grammar definition for our language, we need ANTLR to generate our parser for it.
We do this using the antlr4ts-cli package, so you should install that first:
npm install -g antlr4ts-cli
Note: Since ANTLR itself is written in Java, you will need to have Java installed (JRE >= 1.6) for it to work.
Now you can run ANTLR by running the following command:
antlr4ts -visitor -o ./generated ./cool_lang.g4
I assume you’re running in the same directory as the grammar file we created earlier, and the generated files will be output into the ‘/generate’ directory.
Using Listeners and Visitors
Looking in the ‘/generate’ directory ANTLR created, we should see a few files. The files with the ‘.interp’ and ‘.tokens’ extensions can be ignored — we won’t need to use them at all.
The file `cool_langLexer.ts` holds the token names, rule names, codes given to them, and a serialized ATN (which we don’t need to understand at the moment).
The file `cool_langParser.ts` is the parser — and given a stream of tokens, can construct the syntax tree.
Except for these files we have two generated interfaces, which we can implement in order to create our Listener or Visitor.
The Listener interface has an enter and exit method for each rule defined in our grammar. Each of these methods is called when ANTLR enters and exits each rule while iterating over the syntax tree it created. This is exactly where we can implement any code we want to run.
Each listener method receives a Context object that holds the information relevant to that specific node in the syntax tree.
So if we want to implement a simple Listener that will evaluate expressions, then we can do something like this (missing some code for brevity):
export class CoolLangListener implements cool_langListener {
exitExpression(ctx: ExpressionContext): void {
if (ctx.IDENTIFIER().length === 2) {
if (ctx.OP()?.text === ‘+’) {
console.log(`${ctx.IDENTIFIER(0).text} + ${ctx.IDENTIFIER(1).text}`);
}
}
}
}
ANTLR will trigger all the methods we decided to implement in our listener, so we will see something printed to the console every expr that has two identifiers with the ‘+’ operation.
Using Visitors is very similar, but with the exception that ANTLR triggers the visit method of the node we choose, and then in our visitor implementation, it’s up to us to trigger the other methods to iterate over the whole tree. The advantage here is that we get more control over the tree iteration, so we can decide to skip over certain nodes if we wish to.
Using our listener would look something like this:
// Give the lexer our input as a stream of characters
const charStream = CharStreams.fromString(usersCode);
const lexer = new cool_langLexer(stream);
// Create a stream of tokens and give it to the parser
const tokenStream = new CommonTokenStream(lexer);
const parser = new cool_langParser(tokenStream);
// Walk the syntax tree from the ‘file’ rule as the beginning
const ruleContext = parser.file();
ParseTreeWalker.DEFAULT.walk(new CoolLangListener(), ruleContext);
We haven’t done much here, but you should see messages in the console that we wrote in our CoolLangListener class.