ANTLR Magic — Developing Mainframe Language Applications Using Language Recognizer

Published in

CodeX

15 min readAug 12, 2021

Have you ever wanted to write your own language application and integrate it into an IDE?

Developers each have their favorite IDE which facilitates their application development. So, let’s rephrase the question, have you ever wondered how these IDEs deliver modern coding and debugging experiences? Answering this question may not be straightforward. So let’s describe it in a simpler way with practical examples how that can be implemented, and which technology (e.g. Java, ANTLR, VSCode, Typescript, LSP, DAP, etc.) support which part [component, feature], and how those components cooperate together.

Before starting to develop any language applications you need to know the grammar of that language, and need to have a language-recognizer tool (e.g ANTLR[1]) which understands that language and generates the parsers using pre-defined language grammars. Also, to integrate the language application with IDEs to provide modern coding and debugging experiences, you need to implement a language server and a client which have to correspond with LSP [2] and DAP [3] specifications.

In my earlier blogs, I have described the role of these specifications: LSP brings the magic of modern editing capabilities, whereas DAP provides the modern debugging experiences. And in this article I will show you the magic that ANTLR brings and it’s involvement in building language applications (e.g. COBOL Language Support, JCL, REXX etc.).

As a case study, to learn the technical aspects of ANTLR, you will create a sample Java application which allows us to write the shortest COBOL program. To get started with the development, firstly try on manually downloaded jar-file with a command line tool and then switch to Maven with ANTLR-plugins on different IDEs. Then step by step apply the ANTLR features.

Keep in mind that many of the code snippets you see here aren’t complete so just refer to the GitHub repo for the whole example. Source code for the sample-application is publicly available here.

Getting Started with ANTLR v4

In 1989 professor Terence Parr took an important step in creating computer-based language recognition, a recursive-descent parser tool — ANTLR. ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing or translating structured text or binary files [4]. It performs grammar analysis dynamically at runtime rather than statically, before the generated parser executes. Also it simplifies a lot of steps that makes creating a language recognizer and parsers easier and convenient. ANTLR is widely used to write interpreters and compilers for new languages, or to build tools and frameworks, or to analyze logs. For example: Hibernate uses ANTLR for parsing and processing HQL queries and Elasticsearch uses it for Painless. There are also other usages of ANTLR in Groovy, Ruby, Go, Swift, Python, Apache Spark, Intellij IDEA, WebLogic, JBoss Rules etc.

Installing ANTLR

ANTLR is written in Java (so a pre-condition is that you should have installed at least Java 7), and it consists of two main parts: The ANTLR tool (command line tool) used to translate your grammar to a parser/lexer in the Java language (or other target language: JavaScript, Python, Ruby, C# and etc.) and a runtime which is needed to run the generated parsers/lexers.

To install ANTLR manually download the latest “antlr-4.x.x-complete.jar and add it to the CLASSPATH. The jar-file contains all the dependencies necessary to run the ANTLR tool, and the runtime library needed to compile and execute recognizers generated by ANTLR. However, to make all antlr4 and grun commands work properly, follow the instructions for different OS systems. Below are the installation steps on OSX[4]

Alternatively, to install ANTLR via build tools, use either Maven or Gradle, and also integrate the ANTLR plugins into your favorite IDEs (See the section: Integrating a Generated Parser into a Project).

Introducing ANTLR4 grammars

A language is specified using a context-free grammar expressed using Extended Backus–Naur Form (EBNF). Programs that recognize languages are called parsers or syntax analyzers. ANTLR allows you to define the “grammar” of your computer language and it provides a convenient and developer friendly way of defining language syntax via a set of rules. Rules consist of a sequence of tokens and actions that define how a statement should be written in your source language so that it can be identified and parsed correctly. ANTLR grammar structure has the general form shown below:

The file name containing the grammar X must be called X.g4. You can specify the elements (options, imports, token specifications, and actions) in any order. There can be at most one each for options, imports, and token specifications. All of those elements are optional except the grammar header (grammar name) and rules. There must be at least one rule defined, otherwise a grammar without any rules makes no sense. Rules take this basic form: ruleName : alternative1 | … | alternativeN;

To define a grammar, there are two approaches:: top-down and bottom-up.

Top-down starts from the general organization of a file written in your language. Like in Java (or other existing languages COBOL, C++, etc.): package declaration, imports, type declarations, constructors, methods, etc. This approach is good when you have a solid knowledge on language or the syntax of grammar itself. Then you start defining the rule which represents the whole file, and which includes other subrules.

In the bottom-up approach, you focus on small elements first: defining how the tokens are captured, and how the basic expressions are defined. And then you move on to the next higher level element and incrementally build the rule for the whole file. This approach helps to focus on each detail but in most cases you have less knowledge about the product and how it will look in future.

Let’s define a combined grammar (parser and lexer defined under one grammar) which will help us to write a shortest program (consider the code below) using the COBOL programming language.

So you need to provide the rules to describe that language. They could be a data format, a diagram, or any kind of structure that is represented with text. Rule elements specify what the parser should do at a given moment just like statements in a programming language. The elements can be a rule, a token, or a string literal like expression. Parser rule names must start with a lowercase letter and lexer rules must start with a capital letter. For more about the rules refer to ANTLR4 Documentation.

Executing ANTLR and Testing Recognizers

Once you have installed ANTLR and set the CLASSPATH correctly, you can find ANTLR Tool and the runtime called TestRig (utility to test the grammar). For quick startup of the tools, it is best to create a convenient alias or batch file for both: see batch files: run1, run2, run3 which are for CLASSPATH, Tool and TestRig respectively.

> doskey antlr4=java org.antlr.v4.Tool $*

> doskey grun =java org.antlr.v4.gui.TestRig $*

Then you use the antlr4 command to convert grammars into programs that can recognize sentences in the language described by the grammar. See how-to-run file for detailed execution steps.

>antlr4 ShortestCobolGrammar.g4 // generate antlr artifacts, translates grammars to executable Java code

Now you can see some generated files with names ShortestCobolGrammarLexer.java, ShortestCobolGrammarParser.java, … and also *.tokens files.

If you invoke the ANTLR tool without any command line arguments >antlr4, you’ll get a help message which shows the options you can specify when running it. For example, you can specify the target language, to generate a parser in JS/TS, Python, C#, C++, or others in order to support other target languages.

>antlr4 -Dlanguage=JavaScript ShortestCobolGrammar.g4

>antlr4 -Dlanguage=Python3 ShortestCobolGrammar.g4

The other option, which is very handy, is to generate visitors, or exclude listeners etc. ANTLR generates a parser from a grammar, and that parser can build parse trees and it also can generate a listener (or visitor) that makes it easy to respond to the recognition of phrases of interest.

> antlr4 -visitor ShortestCobol.g4 //creates both, listener & visitor

> antlr4 -visitor -no-listener ShortestCobol.g4 //just creates visitor

To test your grammar, use the grun (TestRig) program with the real input. TestRig uses Java reflection to execute compiled recognizers. That is why you have to compile the generated Java source files.

> javac ShortestCobolGrammar*.java

The TestRig takes a grammar name, a starting rule name (e.g. cobolProgram) kind of like a main() method, and various options (e.g. -tokens, to show the tokens detected) that dictate the output you want. To test the grammar, start up grun as follows:

> grun ShortestCobolGrammar cobolProgram –tokens //For: EOF # type ctrl-D (^D) on Unix or Ctrl+Z on Windows

Grun (>grun) also has a few useful options: -tree, -tokens, -gui, -ps, -diagnostics, etc.

To visualize the parse tree (AST) of an input use -gui option. A parser takes a piece of text and transforms it into an organized structure (ie: a parse tree) also known as an Abstract Syntax Tree (AST).

> grun ShortestCobolGrammar cobolProgram –gui

REMARK: If you change the grammar file, then you need to re-generate the files again. Therefore don’t rely-on your coding of generated files, but rather extend them with your own classes, e.g. For Listeners and Visitors usage.

How ANTLR actually works

ANTLR automatically generates the lexical analyzer and parser for you by analyzing the grammar you provided (e.g. ShortestCobolGrammar.g4) or taken from existing grammars. By default ANTLR reads a grammar and generates a recognizer for the language defined by the grammar. E.g. The command below reads an input stream and generates an error if the input stream does not conform to the syntax specified by the grammar.

C:\workspace-eclipse\antlr-magic\using-antlr-cmd>grun ShortestCobolGrammar cobolProgram

If there are no syntax errors, then default action is to simply exit without printing any message.

ANTLR can generate lexers, parsers, tree parsers, and combined lexer-parsers. Parsers can automatically generate parse trees or abstract syntax trees, which can be further processed with tree parsers[7].

Lexers and Parsers

A token is a sequence of characters that represents a meaningful piece of input: typically a word or punctuation mark, separated by a lexical analyzer and passed to a parser. The process of grouping characters into words or symbols (tokens) is called lexical analysis or simply tokenizing. You call a program that tokenizes (performs lexical analysis) the input a lexer (or tokenizer). Programs that recognize languages are called parsers or syntax analyzers. By default, ANTLR-generated parsers build a data structure called a parse tree or syntax tree (representing how a grammar matches the input) that records how the parser recognized the structure of the input sentence and its component phrases.

Parsing (Syntax analysis) is the process of analysing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. So, as you see from the diagram below, the lexer runs first and splits the input into tokens. Then a stream of tokens is passed to the parser which does all the processing. ANTLR generates a ParseTree for you which you can then process with a ParseTreeWalker.

Figure-4: diagram illustrates the basic data flow of a language recognizer[4]

Integrating a Generated Parser into a Project

The ANTLR Tool and the runtime, called TestRig programs, are useful when working on the first draft of your grammar. Once you have a good start on the grammar, you can integrate the ANTLR-generated code into larger applications. To integrate it into your project, you can use Maven or Gradle to build the application and use the power of ANTLR Development Tools (plugins for IDEs) during the coding.

First of all, let’s add antlr-runtime dependency and antlr-maven-plug-in into pom.xml. Basically, with the ANTLR Maven plugin, you follow the same steps (prepare grammar, generate sources and create listeners) as with the command-line approach. So, you put your grammars directly under src/main/antlr4/ and thanks to our configuration, Maven makes sure that the lexer and parser are generated in the directory corresponding to their package, after just running: >mvn clean install

ANTLR Development Tools

There are ANTLR plug-ins for several IDEs: Intellij, Eclipse, VS Code, and for many others. These plug-ins are very handy during ANTLR development. You just open the grammar file in your plug-ins editor and you get support for quick-fixes, syntax coloring, code completion, syntax and semantic error checking, code-navigation, go-to declaration and more. You can do refactoring, you can test your grammar and generate parse-tree, and save it as an image in different formats. Also railroad diagrams and ATN Graphs can be very helpful to visualize the rule types (parser, lexer, fragment).

Eclipse ANTLR 4 IDE

Vscode extension: ANTLR4 grammar syntax support

IntelliJ ANTLR v4

What more can ANTLR do?

Continuing with the ShortestCobolGrammar example, the next goal is to learn what more can ANTLR do (listeners, visitors, error-handling and etc).

But, before doing that, let’s split the combined grammars into two separated grammars, one for Lexer and one for Parser grammars. And then re-use the lexer-grammar by importing it into a parser-grammar. This approach promotes good software design (code reuse and single responsibility principles) especially when you have large grammars and the application will perform support for multiple languages in parallel.

To make a parser grammar that only allows parser rules (ShortestCobolParser.g4), or lexer rules (ShortestCobolLexer.g4) use the following headers.

To learn more about the rule definitions, refer to lexer-rules and parser-rules.

Grammar alone is not enough — Parse Tree Walkers

You have to remember that the parser cannot check for semantics. The parser should only check the syntax. ANTLR v4 encourages you to keep grammars clean and motivates you to use parse-tree walkers (listeners or visitors) to implement application-specific code e.g. to perform semantic checks (validation code, or adding specific logic check, etc.). Listeners and visitors are very handy mechanisms because they keep the application-logic out of the grammar, and keep the grammar application independent and programming language agnostic. So, you can use the same grammar to generate the language-recognizers for any target language (Java, JS, C++, etc.).

There are two tree-walking mechanisms supported by ANTLR in its runtime library: Parse-Tree Listeners (by default) and Parse-Tree Visitors. ANTLR automatically generates these tree walkers which can be used to visit the nodes of the trees to execute the application-logic.

Parse-Tree Listeners

By default, ANTLR automatically generates a parse-tree listener interface, with a base-listener class which is an empty implementation for that interface. The beauty of the listener mechanism is that you don’t have to do any tree walking yourself, listener methods are automatically called (implicitly visits their children) by the ANTLR-provided walker object.

Let‘s use a listener mechanism to print an error message once the DISPLAY text does not start with the word “Hello”. To write a program that reacts to the input (callback methods), all you have to do is implement a few methods (handle the rules on entry and on exit, e.g. enterDisplayStatement and exitDisplayStatement) in a new subclass ShortestCobolParserListenerImpl which extends the ShortestCobolParserBaseListener.

Just build the Maven project and run the program CobolProgListenerDemo to see the error message.

C:\workspace-eclipse\antlr-magic>mvn clean install

Exercise: To solidify your knowledge define an error message for COBOL AreaA & AreaB positions.

Visitor Tree Walkers

As you have seen, the listener-mechanism is automatic, but there are situations where you want to control the walk manually and call the methods explicitly to visit the children. So the alternative to creating a Listener is creating a Visitor. Using visitor mechanisms you can control the flow explicitly and return the value that you want. To generate a visitor tree walkers use –visitor option, then ANTLR generates a visitor interface, with a base-visitor class which is an empty implementation for that interface.

Let‘s use the visitor mechanism to adjust the DISPLAY text, and add the logic (e.g compute hash from the input message). To program this requirement, you need to create a new subclass ShortestCobolParserVisitorImpl which extends ShortestCobolParserBaseVisitor<T> (T — used for return type) and implements the methods (invoke visit() on the node’s children is all that is needed) that are needed to satisfy the business logic.

C:\workspace-eclipse\antlr-magic>mvn clean install

The difference between tree-walker mechanisms

The main differences between tree-walker mechanisms are:

Unlike listeners, users explicitly call visitors on child nodes. Forgetting to invoke visit() on a node’s children means those sub-trees don’t get visited. Simply, in the visitor you can walk on the tree whereas in the listener you only react to the tree walker.
Listener methods (void) can’t return a value, whereas visitor methods can return any custom type. With a listener, you will have to use mutable variables to store values, whereas with visitors there is no such need.
Listener uses an explicit stack allocated on the heap, whereas visitor uses call stack to manage tree traversals. This means that large inputs to a visitor could lead to Stack Overflow exceptions on deeply nested ASTs, while a listener would not have any problems.
Both tree-walker mechanisms use depth-first (when a node is visited, its children will be visited) types of search.
The biggest difference between a visitor and listener is that visitor doesn’t need a ParseTreeWalker, e.g. compare the the code CobolProgListenerDemo and CobolProgVisitorDemo.
Moreover Listeners use the walker algorithm in ParseTreeWalker and the Visitors use algorithm in AbstractParseTreeVisitor. Both ‘consider’ all nodes that is why there is less of a performance difference, even though the listener mechanism seems faster. Other than the implementation differences, the visitor calls involve the overhead of generic return type processing. This should have a seamless impact on performance in any modern JVM.

For more ANTLR features or capabilities (error handling mechanism, actions and semantic predicates, etc), please refer to the book ‘The Definitive ANTLR 4 Reference’ by Terence Parr.

Building a Language Application

To enable the IDEs to support modern coding and debugging experiences, you need to implement a language server and a client based on the LSP[8] and DAP[9] specifications and integrate them into the IDE. To become a client, the code editor (or IDE) adds a small extension, which provides a language-agnostic, front-end editing capability without any awareness of the semantics of the language [see Figure-3].

Figure-3: LSP Architecture offers us flexibility of choosing any technology combination (can be TS/Python, or TS/C++, or TS/TS) to implement client and language-server independently

Language-server provides language semantics and serves to a single client at a time. It listens to the meta-state of the editor and based on syntax or semantic analysis using ANTLR, returns a set of actions. This communication happens based on rules defined by an extended version of JSON RPC v2.0. To extend the consistent experience to debugging, DAP technology should be integrated[9].

LSP support for Mainframe Programming Languages

Based on the above LSP architecture, Broadcom provides multiple extensions dedicated to specific languages, like COBOL Language Support, HLASM Language Support, JCL Language Support and REXX. Also, to install the extensions all at once the mainframe Code4z extension is freely available for the VS Code editor in Marketplace and which can be also used in the Eclipse Theia Cloud Editor.

LSP Support for COBOL is an Open source project (using the technologies: Java, ANTLR, VSCode, Typescript, LSP, DAP, etc.) under the Eclipse License che-che4z-lsp-for-COBOL. COBOL Language Support extension provides you with language aware features like: autocomplete, highlighting, diagnostics, and copybooks support on file extensions .cob and .cbl. It is also able to connect to a mainframe using Zowe CLI z/OSMF profile to support downloading dependent copybooks[8].

Figure-4: COBOL Language Support Use-Case Diagram. Michelle is a modern mainframe developer who uses VSCode to develop COBOL applications.

Using IBM Language References for COBOL, CICS and SQL for z/Os, code4z team have built the grammars for COBOL Language Support and apply syntax and semantic analysis using ANTLR v.4 and which provide the diagnostics to language server [see Figure-4] related to use-cases (all features supported by LSP language-server can be found here).

Summary

This article tackled the topic of ‘Developing Mainframe Language Applications Using Language Recognizer’ beginning with formulating a sample application. Then in following sections ANTLR features and capabilities are described and practical examples are implemented employing those features. As a result you have learned how modern coding and debugging experiences are developed and integrated into IDEs based on ANTLR (a parser generator to process structured text), LSP (provides language smartness), DAP (provides debugging) and VSCode technologies.

For further questions you may contact us on Slack: che4z.slack.com

References

[1] https://github.com/antlr/antlr4/blob/master/doc/getting-started.md

[2] https://microsoft.github.io/language-server-protocol/specifications/specification-current/

[3] https://microsoft.github.io/debug-adapter-protocol/

[4] https://pragprog.com/titles/tpantlr2/the-definitive-antlr-4-reference/

[5] https://www.antlr.org/tools.html

[6] https://tomassetti.me/antlr-mega-tutorial/

[7] https://en.wikipedia.org/wiki/ANTLR

[8] https://medium.com/modern-mainframe/lsp-magic-mainframe-language-support-in-modern-ides-4ea3d81259b3

[9] https://medium.com/modern-mainframe/dap-magic-modern-debugging-experience-for-mainframe-software-deecb40df4c8

Open Source repository: https://github.com/eclipse/che-che4z