Ballerina Compilation Process

Rajith Vitharana
Ballerina Swan Lake Tech Blog
11 min readDec 17, 2019

This is not a howto post, rather a post which describes technical concepts behind the language implementation.

When we started developing the language, we were newbies in language development, we had to research everything from scratch, in fact, we took leave for two weeks and studied compiler theory first :)

So how is our compiler is designed? the current compiler design mainly consist of two parts, compiler frond end, and compiler back end.

High level design

When it comes to compilers, usually they are written from the language itself which is called bootstrapping the compiler, we have done that for our compiler back end part. So up to now(current version 1.1.0) compiler front end is written in java and the back ends are written in ballerina itself. (BVM back end mentioned in above diagram is no longer there, I have added that just for clarity, so that people can understand that we can plug as many back ends as necessary)

In ballerina we have a concept called module which is the top level language construct that we build from the source files, which is also the shareable artifact in the language, so people can publish their own modules, import others modules etc. In doing so, we have to share those modules somehow, but sharing the full source of the module is pointless(which we had done up to some point). So the idea is to compile these module sources to some shareable extent which is platform neutral and generic. We call this shareable intermediate state as BIR(Ballerina Intermediate Representation). So essentially compiler front end creates this BIR. It compiles down ballerina source files to BIR format, which then gets serialized and fed back into the back end implementations. Back end reads this serialized byte stream, and compiles that down to relevant target platforms(jvm bytecode, machine code etc). This separation has given us the flexibility to bootstrap our back end implementations. (as the back ends are a separate entity which reads a byte stream and works on top of its own tree structure)

So let’s dig deep into the compiler front end first

The compiler front end has two types of inputs, one is *.bal source files and the other is *.bir files(when we import some already compiled modules). But semantic analyzing phases should only run for the *.bal source input, no need to run the semantic phases on top of BIR as it has already ran for them when creating the BIR previously. So what do we need from BIR in the compiler front end and how we have achieved that?

The answer is, from BIR compiler front end only need details of top level things which are accessible from other packages. Specifically, top level symbols and their types only. This same BIR is the one which is getting fed into compiler back end as well, and when it comes to compiler back end, we need more things from the BIR than simply top level symbols, like method bodies, type bodies etc.

public type Person object {
int age;
}
public function pubFunc() returns int {
int localVar = 5;
return localVar;
}

If I explain this with above example, which has a type definition, and a function. Here what we need at compiler front end are top level type and the public function signature, because they can be accessed from other modules which depends on this module. But it doesn’t need to know about the local variable “localVar” as it cannot be accessed from outside this modules(outside “pubFunc” to be exact). But when it comes to back end code generation, it needs to know about the type definition, function as well as body of the function.

This means, we have to read BIR files to achieve two goals, one is to generate the relevant byte codes at compiler back end(which needs to read the full BIR) the other is to recreate required top level symbols at compiler front end side. So we have optimized the serialized BIR with these goals in mind(I think that is a separate discussion entirely, for the moment what we need to know is, with BIR we are capable of recreating the relevant symbols at compiler front end and generate byte codes at compiler back end very efficiently)

So let’s look at how compiler front end does it’s magic.

Compiler front end again is broken into multiple phases and we have used visitor pattern to implement these phases, which made it possible for us to keep all the logic in a single place per phase without scattering them within each node. Below is a diagram depicting those phases in the front end.

Compiler front end break down(Note — this only includes major phases only)

I’m going to give brief explanations about phases mentioned in the above diagram.

First phase is ANTL parser which we uses to parse the BAL file and generate an AST(Abstract syntax tree). To do this we use a lexer and a parser generated by ANTLR, but the AST is our own AST(Not an ANTLR generated generic tree) which we generate by listening to ANTLR events. Top most node in this AST is called module and a module has child nodes which are called top level nodes.

int globalVar = 9;public function testFunc() returns int {
int localVar = 5;
return localVar;
}

So if we take the above example code, which has global variable definition and a function, the function “testFunc” and the global variable “globalVar” are top level nodes where as “localVar” is only a local variable to the function.

For each relevant node we create symbols(light weight representation of node which contains only the relevant details for type checking and linking, basically it has the name and the type information of the node) and define them in relevant scopes. If I explain this bit more, we have scopes for each block level, so a module has it’s own scope, a function has it’s own scope, and an if body has it’s own scope etc. And these scopes are chained, so every child scope has a link to it’s parent scope. So when looking up symbols, what we do is lookup through the scope chain starting from the current scope and go up until the symbol is found.

Let’s take an example to explain this compilation flow. Look at the following sample BAL code.

int a = 9;
function test() {
int b = 6;
int c = b + a;
}

Parser will generate AST as follows for the above code

AST(Note — This only has high level tree for clarity)

Then this tree will go into symbol enter phase, there we iterate through top level nodes and define those symbols in module scope. So in this example, we have two such symbols, one global variable named “a” and one function named “test”. In this phase we don’t do any validations, only just defining symbols in the scope. The idea behind that is to allow cyclic references and fix ordering issues. Example of ordering issue is as follows.

int a = getA();function getA() returns int {
return 8;
}

As we have to iterate through top level nodes, at the time we go through first global variable definition statement node, “getA” function is not defined in the scope, so if we try to validate the statement, we will face with symbol not found issue for “getA” function. Hence this phase is all about symbol enter only.

The next phase is semantic analyzer, here what we do is analyze statement by statement. When it comes to statements, almost all the statement nodes have two parts namely LHS and RHS which are separated by equal sign.

So when it comes to analyzing the global variable definition, we first analyze it’s LHS and find the relevant type(in this case it is derived from type node in LHS to be “int”). Then what we do is we analyze the RHS expression and get RHS type, then we type check whether RHS type is assignable to LHS type. That is the whole point in our semantic analyzer(of course there are other complex scenarios like type inference, symbol linking etc, but essentially this is what we do)

Now it comes to code analyzer phase. In this phase, we do several validations such as follows

  • Loop continuation statement validation
  • Function return validation / unreachable code validation
  • Worker send / receive validation
  • Experimental feature usage validation

I will explain here how we do return validation for the following code

function test(boolean a, boolean b) returns int {
if a { // 1st if
return 9;
} else if b { // 2nd else if
int c = 8;
} else { // 3rd else
return 6;
}
}

First of all I have to explain how we model this if else ladder in our AST to get a better understanding. Our model only has if and else parts, so AST we have here is something like below

AST for if else code block

So the logic is, an if else ladder to be considered as all paths returning, both if and else conditions should return.

We use a boolean(stmtReturns) to keep track whether each path returns, for example, at the beginning of the function, stmtReturns value is false, then we analyze statement by statement in which we get the if else condition, so what we do is analyze if body and mark the stmtReturns variable accordingly, then we keep track of the current states of stmtReturns variable, reset the variable, analyze else body. Now we have two states of both if path and else paths, so we set the final value for stmtReturns variable accordingly(if both paths true, then final value is true, else false). This goes on recursively depending on the depth of the if else ladder. This goes on iteratively for every statement in the function body. With this, at the end of each function, we can determine whether all the paths have return statements or not by looking at the value of stmtReturns variable.

Next comes the data flow analyzer which essentially responsible for detecting uninitialized variable usages. Here what we do is we keep track of each variable(variable symbol) and it’s states(uninitialized or not) in a hashmap. At this point, variable definition and the usages are already linked through the variable symbol. So what we have to do here is simple. We iterate through the statement list, when we find a variable definition, we put the variable symbol into the hashmap with it’s state(if the variable definition has RHS expression, that means variable is initialized). Once we find a variable reference, we lookup the hashmap for the symbol and check it’s states. If this is a RHS reference and current variable state is uninitialized, then that’s semantic failure. But if the reference is LHS reference, then that means we can update the variable state in the hashmap to be initialized. This becomes bit complex when it comes to if else ladders. There what we do is, for each block, keep a separate hashmaps to keep track and at the end merge them accordingly. (if one path returns already, no need to merge, simply take the other path, else merge)

Next phase is taint analyzer. I think it should be a different topic itself. So I’m going to skip this in this post(Will try to do another post on that including what we do in taint analyzer as well as how we do it). For now just think of it as another phase which do some more validations on the AST.

Next phase is desugar phase. This is kind of important phase as we change the original AST here. In our compiler front end, this is the first location that we rewrite the AST. The reason being, the AST just before this phase is the one used in our language server plugins, when it comes to those plugins, they need a semantically validated AST(so that they can show semantic errors as well) and that AST should accurately represent the BAL code as those plugins needs to be able to regenerated BAL code by using the AST.

For example if we have the below tuple destructure statement in the BAL file.

((a, b), c) = tupleVal;//(a - string, b float, c int)

we rewrite above statement to something like below.

any[] x = tupleVal;
string a = x[0][0];
float b = x[0][1];
int c = x[1];

What we do here is create new AST nodes and replace them instead of the previous nodes.

If I take another example with lock statement, before desugar it looks like follows.

lock {
a = a + 7;
}

after desugar, it will become something like below

lock a;
var res = trap a = a + 7;
unlock a;
if (res is error) {
panic res;
}

As you can see, we are creating nodes which are not actually possible with actual language syntax as well.

Now comes another important phase called BIRGen phase, where we convert the current AST to a BIR tree. So what is the different between AST and BIR?

For clarity I will use a sample and explain that.

function test(int a) returns int {
int retVal = 0;
if a > 10 {
retVal = 10;
} else {
retVal = 8;
}
return retVal;
}

For above BAL code, AST would looks something like follows

AST for a func with if else

BIR of that would look like follows

BIR representation of above tree

As you can see, BIR tree is more friendly in the sense of data flow, where AST is just a node tree, BIR tree can be used to do dataflow analysis easily(we had plans to move our current data flow analyzer on top of BIR tree, how ever haven’t done that yet). BIR tree basically is a tree of basic blocks(which are marked as BB0, BB1 etc), within a basic block, what we have are a list of instructions and one terminator, even though terminator is also a instruction, it is kind of terminating instruction for the basic block(last instruction of the block). For example, CONSTANT_LOAD is a normal instruction where RETURN or a BRANCH instructions are terminators.

The next phase is BIR optimizer where it optimize away unwanted instructions and temporary variables. The BIR we have at the end is platform neutral and generic. Now we serialize this BIR with our own serializer and feed the serialized byte array to the back end code gen. I’m not going to explain the back end code generation in detail because the post is already too long :)

But to give you a general idea, what we do is, we read the serialized byte array and build the BIR model again. Then we feed that to the code generator (at the moment we have two types of code generators, one generating direct machine code and other generating java byte code, here what I’m referring to is the jvm byte code generator as the machine code generator is still not in a releasable state) and code generator generate jvm byte codes by going through the BIR tree. And this whole phase is written in ballerina language itself. Which in turn is a good indicator that the language is stable enough to write complex codes.

I hope this is enough for a a single post, and is helpful enough for people who are eager to learn about compilers and particularly insterested in ballerina. :)

--

--