Create a programing language and compiler using Python and LLVM

Mateusz Bednarski
5 min readApr 16, 2022

--

For me, compilers are one of the most exciting and fundamental topics in computer science. I bet you at least once were thinking about how those magical tools convert code into an executable file. The good news is that creating a compiler for a simple language is not as difficult as it might seem.

Okay, but why do it? There are thousands of programming languages and even more compilers. The answer is: fun and knowledge. I found this journey to be fascinating as hell, and there are many things you can learn. That’s why I want to share it with you.

DISCLAIMER

I am not an expert in compilers, so please understand that there might be some mistakes! It’s more like joining me in a journey to build a programming language, not a reference to how it should be done in the industry. I would like to help you better understand how computers work, especially if you know only high-level languages like Python or JavaScript.

LLVM logo (wyvern)
We are going to use this monster!

The language

The language we are going to implement is called XDLANG. We will write the compiler using Python with a few additional tools. But before we go into tooling, let’s discuss a bit the language itself.

XDLANG is a simple, imperative language providing only the basic functionality. Of course, we can further extend it, but I want to focus on essentials. Let me know in a comment if you would like to see a specific functionality ;). It is statically and strongly typed with four basic types: 64bit integer, 32-bit float, bool, and char. There will be support for string but at this moment I don’t know if it will be a dedicated type or an array of chars or sth else. If you are not familiar with languages like C/C++, you may not know why it is such a big deal for strings — it is okay for now.

Control structures will be pretty simple — if/else statement and a while loop with break/continue. XDLANG also allows defining custom functions with multiple arguments and a single return value.

As of now, I’m not sure about supporting array/list types. We will check it out later ;)

For I/O, xdlang will contain built-in functions to read/write to the terminal. This topic will be very interesting to implement!

Program in xdlang is contained in a single file with *.xd extension compiled to executable *.out file (native for Linux). I’m not sure if we are going to support Windows.

We will likely add additional features; xdlang does not have a formal specification — we can improvise on the fly!

The compiler

There is a wide variety of compilers, their properties, purposes, and architectures. But in general, they follow similar steps:

  1. Lexical analysis: converting plain text into a sequence of tokens. A token is the smallest unit “understood” by the compiler, for example, a keyword, variable name, number, bracket, etc.
  2. Parsing: converting a sequence of tokens into a tree-like structure called a parse tree. Additionally, it usually creates an Abstract Syntax Tree - a “cleaned” version.
  3. Semantic analysis: The AST is validated in various ways — checking if types match, the number of function parameters is correct, variables are not used before definition, etc.
  4. IR generation: AST is transformed into an intermediate representation. It is a special language that is lower level than the source one. It can be C (yes, some languages are transformed into C programs internally) or a dedicated one (CPython byte-code, CIL for .net, java byte-code, etc.). Why do we perform such a step? Our program can target multiple operating systems (Linux, windows, ios, android) and hardware (x86, x86_64, ARM, STM). Without IR, we would need to create multiple pipelines for different targets. And each of them needs to be changed every time we change something in the language. Also, performing optimization is much easier on IR than on machine language — because it is at a higher level of abstraction and does not deal with target system internals.
  5. IR optimization: Once we have IR, we can optimize it — by removing dead code, simplifying expressions, unrolling loops, changing the order of instructions, and many more. Without IR, we would have to create an optimizer for every single target — that would be a lot of redundant work!
  6. Code generation: Finally, when we have an optimized version of the program stored in the form of IR, we can translate it to the target machine code.

Those steps are divided into two groups. Compiler front-end (1–4) and back-end (5–6). Such design allows using a single back-end for multiple languages! Take a look — as long as we can produce IR code, the back-end does not care what the source was.

It seems that there are a lot of things to do! But we will not implement everything from scratch. For the back-end, we will use LLVM Compiler Infrastructure. LLVM is very popular — some notable users are C/C++/Swift/Haskell/Julia and more.

We will use lark for the lexer and parser — a python package dedicated to it. That means we will not implement them by hand. Instead, we will provide our language grammar in a special notation (used by many tools). Semantic analysis and code generation will be our primary tasks.

You may hear of lex/flex and yacc/bison -we will not use them.

Oh, and we will use the Visitor pattern a lot. If you have no idea what is it, no worries — compiler is such a great opportunity to learn :)

Prerequisites

There are no specific requirements other than general Python programming to understand the following content. I will introduce all concepts needed simply and enjoyably (that also means I will not go too deep).

I believe this mini-course will be especially beneficial if Python is your only language. Python is great, but it hides a lot of complexity. Making a compiler will reveal those to you :D

Ending and feedback

That’s all for today. In the next part, we will start defining our language. We will begin with some minimal scope to make things “work” as soon as possible and then extend the language until we decide it is enough.

In the very end, I will try to build an AI-based compiler for xdlang. Wonder if it makes any sense ;)

If there are concepts/functionalities, you would like to see, drop a comment! Let’s make this journey together :D

Next parts:

--

--

Mateusz Bednarski

AI enthusiast. Focused mostly on NLP and good software engineering practices for machine learning projects. Currently working at Roche.