Parsing multi-line (triple quoted) strings with Antlr4

3 min readJan 15, 2024

Introduction

Triple quote strings, utilized in programming languages such as Scala and Python, enable users to create strings without the necessity of escaping characters. This feature is particularly advantageous when dealing with extensive texts, allowing users to shape their text just as they would in a standard text editor. This capability simplifies the definition of large blocks of text, enhancing user experience in text manipulation.

When it comes to parsing triple quoted strings, it can be quite a challenge.

For example:

""" "Hello"
World!
"""

The above string is expected to be parsed as a string literal, that contains a string :

" \"Hello\"\nWorld!\n"

Antlr4

ANTLR4, standing for “Another Tool for Language Recognition”, is a highly efficient parser generator widely used in software development. It excels in processing complex grammars and is versatile in generating parsers for various programming languages including Java, C#, and Python. With its user-friendly grammar syntax, ANTLR4 significantly eases the creation of compilers and interpreters. Its robust features and improved performance make it a go-to tool for advanced language processing tasks.

Assumptions

Most of the programming languages ignore white spaces and line breaks, in this example we assume an existing rule:

WS : [ \t\r\n]+ -> skip;

That skips white spaces.

Naive approach

The naive approach to define an Antlr4 grammar for tripple quoted strings is:

TRIPLE_QUOTED_STRING: '"""' TRIPLE_QUOTED_STRING_CONTENT '"""';
TRIPLE_QUOTED_STRING_CONTENT : '"' '"'? ~["]  // Match one or two quotes followed by a non-quote
                             | ~["]           // Match any character that is not a quote
                             ;

This is the most obvious approach for triple quoted strings, which seems quite intuitive, but there are two problems with this approach.

It is basically a lexer rule which parses TRIPLE_QUOTED_STRING as a single token and tokens cant be multi-line. Lack of multi-line support is problematic because parsing errors will output wrong error positions, since multi-line strings are parsed as single line tokens. If we move the rule to the parser grammar there still is a problem because line breaks are skipped by the assumption rule.
This definition cannot parse a double quote enclosed by triple quotes:

"""""""

If we use non-greedy approach, then the first 6 double quotes will be parsed and the last one will be an error. If we use greedy approach then if we have 12 double quotes the parsed result will be a string literal containing 6 double quotes, which is not what one would normally expect (the expectation would be 2 empty strings). Therefore some kind of look-ahead is needed.

Solution

A solution could involve using semantic predicates, but this approach has the downside of tying the grammar to a specific language for functionality. Instead, the proposed solution relies purely on grammar.

This solution was developed for a language called Snapi which is a query language designed to easily query multiple data sources simultaneously, and combine, transform, filter, aggregate the result.

Lexer rules:

lexer grammar SnapiLexer;

...
other rules
...

START_TRIPLE_QUOTE: '"""' -> pushMode(INSIDE_TRIPLE_QUOTE);

mode INSIDE_TRIPLE_QUOTE;
TRIPLE_QUOTED_STRING_CONTENT : '"' '"'? ~["]  // Match one or two quotes followed by a non-quote
                             | ~["]           // Match any character that is not a quote
                             ;
TRIPLE_QUOTE_END_2: '"""""' -> popMode;
TRIPLE_QUOTE_END_1: '""""' -> popMode;
TRIPLE_QUOTE_END_0: '"""' -> popMode;

Parser rules:

parser grammar SnapiParser;
options { tokenVocab=SnapiLexer; }

... other rules ...

triple_string_literal: START_TRIPLE_QUOTE (TRIPLE_QUOTED_STRING_CONTENT)*
                              (TRIPLE_QUOTE_END_2
                              | TRIPLE_QUOTE_END_1
                              | TRIPLE_QUOTE_END_0);

... other rules ...

Breaking down the solution

The first problem of the naive solution is solved with usage of modes which allow switching from one grammar to another easily. The INSIDE_TRIPLE_QUOTE mode doesn’t ignore any line break or tab, so the the triple quoted string is parsed “as is”, with correct positions of tokens.
There is one mode enter rule and three exit rules in order to cope with the second problem. Antrl4 will try to match the the longest rule it can. So in order to catch " and ""enclosed in triple quotes TRIPLE_QUOTE_END_2 of TRIPLE_QUOTE_END_1 exit rule will be applied. Please note that we don’t need to catch """ inside triple quotes because it is clearly the end of first """ .

Then, in order to obtain the actual string literal, we can do it in the Visitor/Listener like this:

TripleQuotedStringConst(ctx.getText().substring(3, ctx.getText().length() - 3))

More on Snapi language

Snapi grammar: https://github.com/raw-labs/snapi/tree/main/snapi-parser/src/main/java/raw/parser/grammar

Snapi documentation: https://docs.raw-labs.com/snapi/overview/

Snapi Github repository: https://github.com/raw-labs/snapi