ANTLR: an informal introduction

8 min readJul 13, 2019

Hi, my name is Vladimir Kozhaev and I am an expert in parsers, compilers, parsers, translators, etc
My contacts are emali:vkozhaev@gmail.com skype:vladimir.kozhaev website:www.newlanguageservice.com in case you need to implement the domain specific language, or parser, or translator from one programming language to another, or Eclipse IntelliJ Idea VS code plug-in do not hesitate to ask me

In this article, I am going to introduce you to ANTLR powerful framework. Equipped with this framework we will also write a relatively simple language that would coordinate the process of shearing a metal sheet (or any other sheet). The first steps in the language are going to be relatively simple but with the next articles more and more details will emerge and in the end, we will have something fully-fledged and effectively functioning. If you want to get a quick picture of how ANTLR works you will like this article.

What’s ANTLR and how you can use it

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing or translating structured text or binary files. It’s widely used to build languages, tools, and frameworks. From grammar, ANTLR generates a parser that can build and walk parse trees.

Terence Parr

As it follows from the quotation, ANTLR is used to generate text parsers. They allow taking a text to pieces and we understand whether this text holds up to the predetermined rules; if it does then we do the computations or proceed to another task.

Initially ANTLR was elaborated for Java and written using Java, however, the up-to-date version 4.7.2 allows parser generation for С#, Python 2 and 3, JavaScript, Go, C++ and Swift. Meaning, a parser can be generated for these languages as well. In this particular article, we will refer to Java but everything mentioned below can be easily applied to other languages.

So, what can you actually do with ANTLR?

Write interpreters and compilers for new programming languages.
Analyze logs.
Create utilities for generating images from text descriptions (this is exactly what we are going to do in this article).

Text analysis stages

How do we analyze a structured text? Why structured? It seems like to understand any text is an open problem. We assume that we deal with a subset which complies with a number of specific rules known as grammar. I will dwell on that in my next articles: I will give an extensive definition and share exhaustive information for its practical application. Now, it is enough to know that in the very beginning we need to write the grammar of a language (or borrow a ready one) which we will develop.

Here is what we are going to do with grammar:

We take the text to pieces or constituents that are called «lexemes». In other words, we will perform the lexical analysis. Lexemes are the words in a text. It will also allow us to see whether our systems comply with the rules described.
We use lexemes to form more complex structures, and again we should check whether they comply with the rules.
What we obtain is a grammar parse tree. This is the tree where the main rule lies in the root, interbedded rules are in the nodes, and specific lexemes are found in the leaves.
We can also check (optional) whether our text complies with the rules that can not be complied with even using grammar. Its what we call semantic analysis-parser.
If we detect mistakes we process them.
If there are no mistakes or mistakes are not critical we analyze the parse tree.

How ANTLR actually works

You must have got the picture — text analysis is not a piece of cake. You’ve got to have proper tools. The Dragon Book (‘Compilers: Principles, Techniques, and Tools’) is considered a detailed go-to guide. The thing is its almost 1200 pages! You’ll spend years of your life studying it. And this is where ANTLR comes to the rescue. In terms of grammar it works magic:

It generates a class that takes a text to lexemes.
It generates syntax parser.
It detects mistakes (if any) in lexemes or syntax of the text.
It helps to make a circuit of the parse tree containing the syntactic analysis and it generates visitor and listener classes for this. Meantime we redefine the useful methods.

Well, we are almost done! On a bigger scale, the most critical aspect in ANTLR lies in creating a grammar for the language you choose to work with, the framework does the rest of the things. Yet, as ANTLR has a huge community…

Let’s build grammar

Early in the game, the language is simple. A laser burner (or any other) has only two basic options:

move to point A when off
cut a straight line segment in the sheet.

These two simple operations allow cutting a model of any complexity out of a metal piece because the contour of any curve can be adjusted with the help of whatever number of straight line segments.

Thus, we need only two functions, MoveTo(x, y) and LineTo(x, y), where x and y are the coordinates of the corresponding point. Below you can find the code of our grammar, let’s analyze it.

grammar CuttingLanguage;

actions:action+;

action:moveTo|lineTo;

moveTo: moveToName=’MoveTo’ ‘(‘ x=INT ‘,’ y=INT ‘)’;

lineTo: lineToName=’LineTo’ ‘(‘ x=INT ‘,’ y=INT ‘)’;

INT :’-’? DIGIT+ ;

fragment DIGIT : [0–9];

WS : [ \n\r\t] -> skip ;

Code always begins with the grammar name. Then rules follow. Grammar has no sense without rules, therefore it is impossible.

A rule can consist of two rules and symbols connecting them. For example, «+» symbol in the rule «actions: action+;» means that in actions rule the action rule repeats one or more times. If it was «*» instead of «+» it would mean that the rule is repeated 0 or more times. In the given situation, it makes no sense.

Let’s now analyze the action rule. The vertical line means «or». In other words, the action rule — is either moveTo, or lineTo.

The moveTo rule consists of the keyword MoveTo, its braces mark and variables x and y of INT type. The INT rule begins with a capital letter and differs from the rules mentioned above. This is a lexeme. We mark this and will later return to the analysis of the differences between the lexemes and the rules. INT consists of the option sign «-» signified by the question mark that follows it and one or more DIGIT rules.

We have two curious techniques in the case with DIGIT. First, it is an enumeration in the brackets, which means there can be any symbol from 0 to 9. Second, DIGIT is a fragment. It can be a part of the rule but not the rule itself.
Finally, the last rule — WS; the arrow and skip mean the we redirect the symbols to a certain channel. We will speak about it in detail later on.

Now it is important to remember that the gap, tabulation and line folding are ignored in the given grammar.

Visitor

Grammar alone is not enough. The program text has to be processed, and we have two ways of doing it:

We can compute (handle) rules on entry and on exit. This method is called a listener.
We can handle the rule on entry and return the value that will be processed by the rule higher in the hierarchy. Finally, the handler of the main rule will return the search value.

In this paragraph, we are going to describe the implementation of the second rule, while in the next paragraph we will work with the first one.

Let’s take a look at the code.

The visitor is derived from the class generated by a parser-generator, in our case it is parameterized class CuttingLanguageBaseVisitor. The constructor assumes the initial position of a point and GraphicsContext, which we will draw with; I used JavaFX for this. There is a code in the GitHub project, yet I have to warn that this article is not about JavaFX. «Please do not shoot the pianist.

He is doing his best.» ©.

There are only two methods in this class: visitMoveTo and visitLineTo. The first receives MoveToContext, which is also the class generated by the parser. It has fields x and у. We have stipulated it in grammar: x = INT; y = INT. x and y are type-tags. The same time they are the respective lexemes and their text is returned with the self-evident method getText(), and the code is handled by Integer.parseInt method.

Next, we move the burning point of the cutter into the point (X, Y) and save the current position using gs.save(). Now we return the cutter position point with the renewed coordinates.

lineTo method works in a similar way but draws a straight line in the point (X, Y) before saving.

package org.newlanguageservice.antlrtutorial;

import org.newlanguageservice.ch1.CuttingLanguageBaseVisitor;

import org.newlanguageservice.ch1.CuttingLanguageParser.LineToContext;

import org.newlanguageservice.ch1.CuttingLanguageParser.MoveToContext;

import javafx.scene.canvas.GraphicsContext;

public class CuttingLanguageVisitor extends CuttingLanguageBaseVisitor<Point> {

private Point point;

private GraphicsContext gc;

public CuttingLanguageVisitor(Point point, GraphicsContext gc) {

this.point = point;

this.gc=gc;

}

@Override

public Point visitMoveTo(MoveToContext ctx) {

int x = Integer.parseInt(ctx.x.getText());

int y = Integer.parseInt(ctx.y.getText());

gc.moveTo(x, y);

gc.save();

return point.setX(x).setY(y);

}

@Override

public Point visitLineTo(LineToContext ctx) {

int x = Integer.parseInt(ctx.x.getText());

int y = Integer.parseInt(ctx.y.getText());

gc.strokeLine(point.getX(), point.getY(), x, y);

gc.save();

return point.setX(x).setY(y);

}

Now let’s take a look at how to make it work. Here is the code transmitted to the start button listener:

CuttingLanguageLexer lexer

= new CuttingLanguageLexer(new ANTLRInputStream(textArea.getText()));

lexer.removeErrorListeners();

CuttingLanguageParser parser

= new CuttingLanguageParser(new CommonTokenStream(lexer));

parser.removeErrorListeners();

lexer.addErrorListener(new CuttingErrorListener());

visitor = new CuttingLanguageVisitor(new Point(0, 0), gc);

ParseTree tree = parser.actions();

Point point = visitor.visit(tree);

System.out.println(point);

First, we create the lexer. This is the class that divides the program test into lexemes (I talked about them briefly above), while creating the parser out of lexemes in the second line.

Next, we create a visitor.

We derive the root of the parse tree, which is the entry point into the decision tree — the actions rule. We could take any other rule of the grammar (not a lexeme), and the analysis would start from this rule. We use the listener to go around the entire tree so that in the outcome we get the point where the cutter will come when it finishes the cutting procedure. As you see, it’s not rocket science.

Listener

You don’t need to do computations every time you analyze the parse tree. For instance, if you want to color the keywords it is enough to obtain their coordinates. Let’s use a listener for this. The work of this class somehow resembles XML round check with the help of SAX: the corresponding class method responds on entry and on exit from the rule.

The code class is below:

package org.newlanguageservice.antlrtutorial;

import java.util.ArrayList;

import java.util.List;

import org.newlanguageservice.ch1.CuttingLanguageBaseListener;

import org.newlanguageservice.ch1.CuttingLanguageParser.LineToContext;

import org.newlanguageservice.ch1.CuttingLanguageParser.MoveToContext;

public class CuttingLanguageListener extends CuttingLanguageBaseListener {

List<LexemeWithCoords> lexemes=new ArrayList<>();

@Override

public void enterLineTo(LineToContext ctx) {

lexemes.add(new LexemeWithCoords(

new Point(ctx.lineToName.getStartIndex(),

ctx.lineToName.getStopIndex()+1),

“LineTo”));

}

@Override

public void enterMoveTo(MoveToContext ctx) {

lexemes.add(

new LexemeWithCoords(

new Point(ctx.moveToName.getStartIndex(),

ctx.moveToName.getStopIndex()+1), “MoveTo”));

}

public List<LexemeWithCoords> getLexemes() {

return lexemes;

}

As in the previous case, we succeed from the generated ANTLR of the base class and override the two methods: enterLineTo and enterMoveTo, where we receive the coordinates of the keywords in a text.

Then, already in the editor, we color it the following way:

List<LexemeWithCoords> lexemes = listener.getLexemes();

lexemes.forEach(lexeme->textArea.setStyleClass(lexeme.getCoords().getX(),

lexeme.getCoords().getY(), “keywords”));

Handling errors

Sometimes code may contain errors, and in such cases, ANTLR can help detect those lexical and syntax ones. When the code is grammatically correct but does not produce the desired outcome, it means there are logic/semantic errors which can not be detected: framework can’t know what you want. Yet:)

Lexer and parser that we generated produce the error message. All we need to do is to attach the listener to them that will simply send the message onto the console. The listener realizes the interface ANTLRErrorListener. It has four methods, but in our case, we only need one — syntax error.

It looks the following way:

public class CuttingErrorListener implements ANTLRErrorListener {

@Override

public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine,

String msg, RecognitionException e) {

System.out.println(msg);

}

The complete text of the program is available here: http://github.com/vladimirkozhaev/antlrtutorial

I can’t wait to work on my next article where I will share an extended description of the theory essential for understanding the ANTLR operation principle.

ANTLR: an informal introduction

What’s ANTLR and how you can use it

So, what can you actually do with ANTLR?

Text analysis stages

How ANTLR actually works

Let’s build grammar

Visitor

Listener

Handling errors

Written by Vladimir Kozhaev