Debugging ANTLR 4 grammars

Bart Kiers
Aug 23, 2017 · 3 min read

A first aid Bash script

When writing an ANTLR grammar and things are not being parsed the way you thought they should, I always start by dumping all the tokens and the parse tree for a small input sample to my console.

If you do this more than a couple of times, you need a Bash script of course!

Below is a script I use, available via a public Gist:

Yeah, that’s quite a bit of Bash voodoo. Let’s break it down.

The script:

  • checks if java and javac are installed
  • checks if there are exactly 3 parameters provided
  • checks if the 1st parameter (the ANTLR grammar file name) really exists
  • checks if the ANTLR jar is in the same folder as antlr4-tester.sh and downloads it when it’s not there
  • generate lexer- and parser classes based on the provided ANTLR grammar file name
  • generate a class with a public static void main method that will use the generated lexer and parser
  • run the class with the public static void main method

Example

Let’s say we want to debug the following grammar:

grammar T;parse
: expr EOF
;
expr
: expr ( '*' | '/' ) expr
| expr ( '+' | '-' ) expr
| '(' expr ')'
| NUMBER
| PI
;
MUL : '*';
DIV : '/';
ADD : '+';
MIN : '-';
NUMBER
: [0-9]+ ( '.' [0-9]+ )?
;
ID
: [a-zA-Z]+
;
PI
: 'PI'
;
SPACE
: [ \t\r\n] -> skip
;

First add antlr4-tester.sh in your PATH. Add the grammar posted above in a file called T.g4 and open a terminal and navigate to the grammar.

If we’d now like to parse the source (1 + 42) * PI by invoking the parser’s parse rule, execute the script as follows:

antlr4-tester.sh T.g4 "(1 + 42) * PI" parse

You’ll see the following being printed to your console:

[TOKENS]
'(' '('
NUMBER '1'
ADD '+'
NUMBER '42'
')' ')'
MUL '*'
ID 'PI'
EOF '<EOF>'
[PARSE-TREE]
line 1:11 mismatched input 'PI' expecting {'(', NUMBER, 'PI'}
(parse
(expr
(expr
(
(expr
(expr 1) +
(expr 42)) )) *
(expr PI)) <EOF>)

Note the error the parser emits:

line 1:11 mismatched input ‘PI’ expecting {‘(‘, NUMBER, ‘PI’}

This might seem a bit odd: the parser telling us it could’t match the input ‘PI’ and it was expecting a ‘PI’ instead… However, if you look at the [TOKENS] on the console, you’ll see that the input ‘PI’ was not tokenized by the lexer rule PI, but by the lexer rule ID.

If we remedy this by moving the lexer rule PI above the ID rule:

...PI
: 'PI'
;
ID
: [a-zA-Z]+
;
...

and then running antlr4-tester.sh T.g4 "(1 + 42) * PI" parse again, everything goes as planned!

Note that the second parameter, "(1 + 42) * PI", can also be a file name.

Errors produced by the lexer or parser are quite often caused by unexpected tokens being created that can (relatively) easy be traced by closely examining the token types printed on your console.

Hope it helps anyone!

)
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade