Creating a parser for a programming language (xdlang part 3)

Mateusz Bednarski
10 min readJun 1, 2022

--

Welcome to the next part of creating compiler series. Today we are going to create a parser for xdlang! For reference — here is the table of contents:

  1. Create a programming language and compiler using Python and LLVM
  2. Creating a programming language — part 2
  3. Creating a parser for a programming language (xdlang part 3)you are here
  4. Visitor pattern in Python — in progress…

Enough talking; start implementing! First, let's initialize the project (I use poetry):

The first code we are going to write is the simplest xd program. Create a file xd_programs/hello0.xd with the following content:

A regular compiler usually is quiet — when there is no error, it does not print anything to the console. In our case, we would like to see much information from the processing. That's why we start with a xd run command. It will take a *xd file, compile it, run it, and grab the output. For 99% of the time, it is what we need.

Parser

Our approach will be bottom-up. We start with small pieces and then combine them into a working application. The first component to implement is the parser. Create a file grammar.lark:

%import common.WS
%import common.CNAME
%import common.NUMBER
LITERAL: NUMBER
IDENTIFIER: CNAME
!return_stmt: "return" expr ";"
func_def: "fn" IDENTIFIER "(" (IDENTIFIER IDENTIFIER)? ("," IDENTIFIER IDENTIFIER)* ")" ":" IDENTIFIER block
!expr: LITERAL?statement: return_stmt
!block: "{" statement* "}"
program: func_def+%ignore WS

I will not explain EBNF syntax in detail — it was already done in lark's tutorial. First, we define a terminal symbol LITERAL. For now, a literal can only be a NUMBER. It will change with new types- true/false for booleans, "asdf” for string, etc. Second terminal IDENTIFIER represents a valid name for variable/type/function. Lark already has a perfect built-in CNAME. It is basically an alphanumeric string without spaces and all the naming rules you expect from a programming language.

The first and only statement is return. Grammar tells us it is built from return keyword, expr and a semicolon. An expr is just a literal (for now!).

Even though there is only a single type of statement, we still define a rule "statement" — when adding new statements, we still want to refer to them as a generic ones. It is very similar to inheritance.

Block is zero or more statements surrounded by curly braces.

The program consists of one or more function definitions. We know that the main function must exist, but checking their presence is not a task for now. Remember — one transformation at a time. In the meantime, we tell lark to ignore whitespaces.

The last piece is a function definition. It starts with a fn keyword, followed by a valid identifier and a list of arguments in parenthesis. Next, there is a colon after which return type is defined. Lastly, we need a function body — block.

OK, but what are !? before rule names? They influence how a parse tree is formed. For more details — refer to docs. Anyway, we will see how it works in examples.

Having the grammar, we can start creating a parser. In the file xdlang/pipeline/parser.py we write:

The parser takes a path to the grammar file and a start symbol. The start symbol is a place from where lark is supposed to parse code. Normally it will be program as we want to parse the whole one, but for testing it is sometimes helpful to parse smaller units (like an expression). We also set ambiguity=" explicit" to see any ambiguous parts of our grammar. Lark can resolve some of them autmatically but I believe we should not rely on it for learning.

Let's try to run it on our simple program!

if __name__ == '__main__':
parser =Parser()
with Path('xd_programs/hello0.xd').open("rt") as f:
tree = parser.parse_text(f.read())
print(tree)

It works and prints the following:

Tree(Token('RULE', 'program'), [Tree(Token('RULE', 'func_def'), [Token('IDENTIFIER', 'main'), Token('IDENTIFIER', 'int'), Tree(Token('RULE', 'block'), [Token('LBRACE', '{'), Tree(Token('RULE', 'return_stmt'), [Token('RETURN', 'return'), Tree(Token('RULE', 'expr'), [Token('LITERAL', '0')]), Token('SEMICOLON', ';')]), Token('RBRACE', '}')])])])

We can conclude it is kind of a tree structure, but such formatting makes it difficult to track. But we can do better:

import rich
# ...
rich.print(tree)
Parse tree of our first program!

Now we clearly see a tree structure. The program consists of a single func_def with three descendants — name, return type, and body — a block. Recursively, a block consists of {} a single statement(there could be more, of course).

Recall that return_stmt is a statement but there is no statement node in the tree. This is because we defined it with a question mark:

?statement: return_stmt

If we do it without ? the tree would look like this:

In this case a statement would always be redundant, so we can get rid of it right now to simplify further processing.

Also, consider return_stmt In the grammar it was defined with explanation mark. What will hapen if we remove it (return_stmt: "return" expr ";" instead of !return_stmt: "return" expr ";")?

With ! at the left, without at the right side.

The exclamation mark means "keep all tokens". For some reason, we will want this behavior for some nodes.

We are done with the parser for now. Here is a complete file:

AST

Given the parse tree, we can construct the abstract syntax tree. Before we code, let's list what nodes we need:

  1. A literal node. It may turn out that we want other types of nodes (IntLiteralNode, BoolLiteralNode, StringLiteralNode), but let's keep the single one for now.
  2. A literal is also an expression. So there will be ExprNode, and LiteralNode with inherit from it.
  3. A ReturnStmtNode. Which is a:
  4. StmtNode — generic one for all types of statements
  5. BlockNode — it is NOT a StmtNode in xdlang (in contrast to many languages where a block of statements is equivalet-ish to a single statement)
  6. FuncDefNode. it is also not a statement (in many languages, it is)
  7. ProgramNode
  8. A generic Node — parent class for all nodes.
Nodes hierarchy

Node

Nodes are defined in the file xdlang/pipeline/nodes.py. We start with a top-level Node:

import abcclass Node(abc.ABC):
def __init__(self, line:int, column:int) -> None:
self.line = line
self.column = column

It is an abstract class (it makes no sense to instantiate generic node) and takes two arguments: line and column. They reflect the location in xd source file, allowing us to provide better error messages.

Expression and statement

Next, we have nodes for expression and statement:

class ExprNode(Node):
def __init__(self, line: int, column: int) -> None:
super().__init__(line, column)

class StmtNode(Node):
def __init__(self, line: int, column: int) -> None:
super().__init__(line, column)

Literal

The first "real one" is the literal node. It takes an additional argument with the value of the literal. For now, only int is allowed. Note that it inherits from ExprNode

class LiteralExprNode(ExprNode):
def __init__(self, line: int, column: int, value:int) -> None:
super().__init__(line, column)
self.value = value

Return

The return statement contains one child: expression to return. There are a lot of places where we can make a mistake, so let's assert it is, in fact of, type ExprNode

class ReturnStmtNode(StmtNode):
def __init__(self, line: int, column: int, expr:ExprNode) -> None:
super().__init__(line, column)
assert isinstance(expr, ExprNode)
self.expr = expr

Block

Block takes a list of statements:

class BlockNode(Node):
def __init__(self, line: int, column: int, stmts:list[StmtNode]) -> None:
super().__init__(line, column)
assert all([isinstance(n, StmtNode) for n in stmts])
self.stmts = stmts

Function definition

The definition of a function is a bit bigger. It has to remember function name, return type, and body. Note that the type is represented as a string. The parser does not know what kind of types are allowed, and this is OK:

class FuncDefNode(Node):
def __init__(self, line: int, column: int, type:str, identifier:str, body:BlockNode) -> None:
super().__init__(line, column)
self.type = type
self.identifier = identifier
assert isinstance(body, BlockNode)
self.body = body

Program

Last but not least: a program node. It is very similar to block, but there is a list of function definitions instead of statements.

class ProgramNode(Node):
def __init__(self, line: int, column: int, func_defs: list[FuncDefNode]) -> None:
super().__init__(line, column)
assert all([isinstance(n, FuncDefNode) for n in func_defs])
self.func_defs = func_defs

AST creation

We have all nodes for the AST. Now we need to transform the parse tree into it. Lark contains a built-in transformer which is a perfect match for us.

The transform takes the parse tree and applies a method defined by us for each type of node. The return value is used to replace the node in the tree. The whole process starts at the bottom; therefore, all descendant nodes were already visited when visiting a specific node.

Order of transforming parse tree

To implement a Transformer, we subclass lark.transformer and provide methods with names with respect to parse tree elements.

In the file, xdlang/pipeline/ast_builder.py we have:

from typing import Any
from lark import Token, Transformer
import xdlang.pipeline.nodes as xdnodesclass AstTransformer(Transformer):
def LITERAL(self, item:Token) -> xdnodes.LiteralExprNode:
print('literal')
return xdnodes.LiteralExprNode(item.line, item.column, int(item.value))

Recall that in the grammar file we had a LITERAL terminal. That's why we also need to have LITERAL method in the transformer. The single argument provided by lark is an object representing matched token. We are interested in the token line, column, and value. Having them we can easily build the LiteralExprNode.

Expr

In grammar, expr is defined as a rule, not terminal. Therefore lark provides us with a sequence of possible values. Not that the values in items not necessarily are Token.Why? Because non-terminals can have children that were already processed and are Nodes. For expr we know that it contains only a single literal node, so we are free just to return int

def expr(self, items:list[Any]) -> xdnodes.ExprNode:
return items[0]

Return

Return is a bit more tricky. In the grammar, we have: !return_stmt: “return” expr “;” It means that items will be a list of three elements, the return token, an expression (node, not token — as it was already transformed) and ";" token. So, in general, items can contain either tokens or Nodes that were already parsed.

def return_stmt(self, items: list[Any]) -> xdnodes.ReturnStmtNode:
expr = items[1]
return xdnodes.ReturnStmtNode(expr.line, expr.column, expr)

Block

With this knowledge, things are easier. Block is created from a list where curly braces are the first and last elements. Everything inside is a statement.

def block(self, items:list[Any]) -> xdnodes.BlockNode:
statements = items[1:-1]
return xdnodes.BlockNode(items[0].line, items[0].column, statements)

Function definition

This one is a bit different once again 😅 Rule for func_def is not prepended with an exclamation mark. Because of it, items will not contain the keyword fn as well as parenthesis. Therefore items contain an identifier, arguments, return type, and block. For now, we assume that the function cannot have arguments, so the code for transformation is following:

def func_def(self, items:list[Any]) -> xdnodes.FuncDefNode:
type = items[0].value
identifier = items[1].value
body = items[2]
return xdnodes.FuncDefNode(items[0].line, items[0].column, type, identifier, body)

Program

Program is very similar to function definition—just a list of child nodes.

def program(self, items:list[Any]) -> xdnodes.ProgramNode:
return xdnodes.ProgramNode(0,0,items)

We finished writing the transformer. Let's take a look at the results.

XD run

Now is a good time to create a command-line interface for testing the whole parser. I use typer but if you prefer click or argparse you are free to use them. Create a file xdlang/cli.py

We see familiar source file parsing and printing the parse tree. In addition, lark allows us to show a graphical representation (you need to install pydot). But we are primarily interested in viewing the AST tree (line 31). But unfortunately, we see :

<xdlang.pipeline.nodes.ProgramNode object at 0x7f2065e35de0>

Such a shame. It is a ProgramNode as we expect, and because of a lack of__str__ overload it is printed this way. We could do this, but there is a better way! For now, you have to trust me or check in the debugger ;)

In next episode

The solution to this problem is a visitor pattern. We will use it heavily in the later stages of the pipeline, so I decided to explain this separately in the next part. Mainly because it is not commonly used and is a bit more complicated.

Exercises/questions

  1. Try to add/remove !? from the grammar file and check how the tree is shaped
  2. Check what happens if xd source code is invalid. What is the error? Is it helpful? What would be a good error message?

Source code

The source code at the end of this article is available here: https://github.com/mbednarski/medium-xd

A favor for me

Please do me a favor and comment on the preferred way to embed source code. In this post, I used three of them:

  1. Picture from https://carbon.now.sh/
  2. Embedded GitHub gist
  3. Directly in the post

Which one do you believe is best for readers?

Psss, do you know you can buy me a coffee if you like my articles?

--

--

Mateusz Bednarski

AI enthusiast. Focused mostly on NLP and good software engineering practices for machine learning projects. Currently working at Roche.