Creating a programming language — part 2

Mateusz Bednarski
5 min readMay 10, 2022

--

Before proceeding with the compiler itself, we need to make a few design decisions about the xdlang itself. We have a rough idea of what we want to support, but let us make it more specific.

This article is part of a series:

  1. Create a programming language and compiler using Python and LLVM
  2. Creating a programming language — part 2 (you are here)
  3. Creating a parser for a programming language (xdlang part 3)

Rough xdlang specification

For first, a program in xdlang will be contained in a single file — it will make compilation easier.

The source file contains a bunch of functions. Each function can take multiple arguments and returns a single value (or nothing). One of them is a special one — main is an entry point for the program — execution starts here.

Each function is composed of different statements. Some of them:

  1. return — exits the function optionally returning a value. For example: return 0;
  2. let — creating a variable. For example: let float pi = 3.14
  3. mut — changing the value of the existing variable. For example: mut counter = counter + 1;
  4. noop — empty instruction — the equivalent of python’s pass
  5. print — printing variables to the console. I/O will be more tricky than it may seem!

A special type of statement is an expression — which can be defined as “something that can be evaluated for a value”.

  1. The simplest one is literal — it can be a single number, bool value, or character/string.
  2. Another one is a variable (as they have value, they are expressions).
  3. Expressions can be modified using operators. Some of them are unary ( ! for negation and for unary minus e.g.: -5. The other will be binary, and there will be a lot of them: + — / * % == != <= >= < > and or. We are not going to implement ternary ?:(if you do not know it, it is equivalent to python x = 1 if y ≥ 0 else -1 ). At least not for now. Operators have their priority and associativity.
  4. Expressions, of course, can be grouped in parentheses to change their precedence — more on that later.
  5. A special expression is casting: changing the type of an expression to another type. For example cast<int>(78.5)

A bunch of statements can be enclosed a block by containing them inside {} .

To create something interesting, we need control flow structures. For now, it will be:

  1. If-else or if-elif-else — here, we learn what “dangling else” means.
  2. While loop. We may add for and do-while later on.
  3. break/continue for loops.
  4. In fact, return is also a control flow statement.

Last but not least, let’s consider data types.

  1. A 64-bit signed int
  2. A 32-bit signed float
  3. An 8-bit character char — sorry, no Unicode for now!
  4. A 1-bit logical value bool
  5. An “empty” type void.

There will also be a string and array types, but let’s put it on hold. Later we also will support user-defined types.

Our language is quite simple, but still, there are many things to implement. Therefore, the specific syntax for all features above will be introduced as we implement them.

Compiler architecture

Ok, we now know more or less what we want to build; let’s think about how to do it.

Our compiler will be a multi-pass one. That means we will be going through the program multiple times, and during each time(pass), different operations will be performed. There are also single-pass compilers that do everything at once. Single-pass requires less memory, but multipass is easier to implement (IMHO). In real compilers, performance and memory usage can be a concern, but for our fun project, it is not.

Compilation passes

So what will the passes be?

  1. Lexical analysis and parsing — for this, we will use lark package. It will create a tree version of the program code. Parse tree will be next transformed into Abstract Syntax Tree (once again with the help of lark. The difference between them is that AST reflects the semantic structure of the program where each node is a specific piece of the program — for example, function definition, letstatement, while loop, binary operation, etc. In contrast, the parse tree is intended to reflect the source text precisely.
    Our compiler will allow us to print human-readable parse and ast trees to help understand what is happening.
  2. Symbol table creation — having the last, we need to create a structure that remembers defined functions, types of variables, etc. This is because if we read a variable x, we need to know if it exists or not. The same applies when invoking a function. A code like add(x, y, z); may be correct in terms of syntax but not semantically. Used variables may not be defined; there might be no function add And many more possible issues.
  3. When we have symbols, we need to perform type-checking. For each expression, the compiler will check the involved types and whether they are legal.
  4. Additional analysis. At this step we already know that the program has correct syntax, is not calling non-existing functions, and types are ok. But still, we need to check some additional things. Most important is the presence of main function. Unfortunately, I do not have a good name for this step, so for now let’s stick with “additional analysis”.
  5. Code generation. Now we are ready to generate llvm IR code. We will traverse the whole program and translate it into an intermediate language. We could emit a string to a file, but we are going to use helpers from llvmlite package to make our life easier.
  6. Optional optimization. The IR code can be optimized. And LLVM is damn good at optimizing. This step will be optional because often, we would like to see IR as it was generated. Aggressive optimization may remove a significant fraction of instructions, so it may be challenging to track what was translated to what.
  7. Executable generation. Lastly, we need to generate an executable binary that will be able to run. For this, we solely rely on llvm.
Programs will be represented in many different ways during the compilation process.

For implementing specific passes, we will make extensive use of visitor patterns if you do not know it, no worries! It will be explained.

How to proceed

There are many things to do, and we need to have a plan on how to execute them. Our first goal will be to implement the whole pipeline for the most straightforward program in xd:

fn main():int
{
return 0;
}

We will need to implement all mentioned steps for this minimal subset of xdlang for the following language features:

  1. Functions: fn main:int
  2. Blocks: {}
  3. Integer literal: 0
  4. Return statement

Also, we need to have a convenient way to compile and run our programs. The compiler should generally take the *.xd file and output a binary *. But during development, we will run the compiler a lot and validate what changed between passes. Therefore, we will create a command-line utility that takes the xd file, goes through the pipeline, prints all debug info, builds an executable, and immediately executes it.

Technically we will use poetry as a package manager with the following packages:

  • lark — for parsing
  • rich — for pretty-printing data structures
  • llvmlite — for IR code generation
  • typer — for command-line interface

With all this information, we are ready to write the XDLANG compiler starting from the next post ;)

Stay tuned!

--

--

Mateusz Bednarski

AI enthusiast. Focused mostly on NLP and good software engineering practices for machine learning projects. Currently working at Roche.