Ballerina Compiler — Design
Ballerina is an open source programming language engineered from day one for developing today’s microservices and other solutions that perform more and more networked interactions.
Ballerina understands JSON, XML, SQL, etc at the type system level allowing programs to easily manipulate these data formats. Ballerina comes with a powerful set of language constructs, types, and built-in libraries to help you with building network resilient or chaos ready programs.
In this post, I will share my experience in developing the Ballerina compiler. It is the output of a collective effort of an awesome team at WSO2 :) Before I dive into the Ballerina compiler design specifics, let me briefly introduce compilers.
What is a compiler
It is a software program that translates source code written in one programming language(source) to another programming language(target). Refer Wikipedia article for more information.
Ballerina compiler
Ballerina compiler transforms source code written in Ballerina to a platform-independent intermediate representation called Ballerina bytecode.
Single-pass vs. multi-pass compilers
A single-pass compiler traverses through the source code or AST of a program only once, whereas a multi-pass compiler traverses through the AST multiple times. There are plenty of articles that you can refer to get more details of these compilers.
At the start of this project, Ballerina language was relatively simpler. We could implement the compiler with just two phases: the first phase builds the abstract syntax tree (AST) with the help of the generated parser, the second phase analyzes the semantic properties of the program by traversing the AST once. Life was simple those days :)
Ballerina language team, lead by Dr. Sanjiva Weerawarna, added features to the language one after the other and made the language to grow in complexity. The two-phase design was no longer a practical one. These two phases became two monolithic big pieces of the compiler, and it was harder to maintain or refactor to add new features. Back in mid-August 2017, we started redesigning the Ballerina compiler to overcome the current limitations. It took more than two months for us to complete this effort. Releases got dragged, other developers were blocked on us, and we were under great pressure to deliver the new compiler before the WSO2Con 2017. Finally, we completed it a few weeks before the conference.
We split compiler’s job into multiple independent phases. Each phase is responsible for one or more related functions. You can find the list of phases below.
- Lexical analysis, parsing —Check for syntactic errors. Builds the AST and define the package-level symbols in the package scope.
- Semantic analysis —Define and resolve symbols, check for type errors, etc.
- Code analysis — Perform basic flow analysis.
- Desugar — Rewrite the AST node (if required) to remove syntactic sugar.
- Code generation — Generate Ballerina bytecode from AST
1. Lexical analysis and parsing
This phase is responsible for Checking syntax errors and building the AST while defining package-level symbols.
The first task is to write down your language grammar which is a set of rules written in a formal language. Ballerina grammar describes how to form Ballerina language constructs according its syntax. These rules do not describe the meaning or the semantic properties of the language constructs. We used a tool called ANTLR to generate the lexical analyzer and the parser code from the grammar. ANTLR is a well-known parser generator and it is used widely over other similar tools.
This generated parser takes the Ballerina source code as the input and fires events as the output. There are other variations as well, but I am not going to talk about them here. So we listen to these events and build the abstract syntax tree gradually. Following image shows how we build the struct tree nodes. You can find the complete class here.
Then Compiler parses and builds the AST of the entry package which is the package in which entry points are defined. Ballerina has two entry points: main function and services. I.e, a Ballerina program can get the execution control either via the main function or services. Once the compiler builds the AST of the entry package, it starts resolving the imports of the entry package one by one. During this phase, the compiler builds the complete AST of the program with imported packages. The imported package tree is a DAG with the entry package being the root node.
All subsequent phases of the compiler visit the complete AST starting from the entry package. Each phase performs a depth-first traversal of the tree.
2. Semantic analysis
This phase analyzes the complete program for semantic errors. Here are some of the tasks performed by the Semantic analyzer.
Define symbols in the corresponding scopes.
Usually, compilers use the term ‘symbol“ for identifiers such as variables, functions, structs. Symbols carry information such as the name, type and the location of the particular entity. E.g. variable symbol contains the variable name, the type of the variable, it’s scope, etc.
Package-level variables are defined in the package scope. Local variables are defined in the corresponding block or function scope. e.g. A variable defined inside an “if” block is defined in the block scope of “if”. Semantic analyzer reports an error if it finds a duplicate variable definition.
Associate referred symbols with their definitions.
Same as symbol resolving. Consider the following example.
function test () {
int a = 10;
int b = a + 1;
}
Here, the semantic analyzer defines the variable “a” in the function scope. The next statement refers the variable “a” . Now the semantic analyzer binds the variable definition of “a” with this variable reference.
Type checking
Type checking is the process of the verifying and enforcing the constraints on types defined in a language. I’ll take the following example to explain this.
int a = "sam";
Here, we are trying to assign a string value to a variable of type “int”. The type checker is responsible for detecting and reporting such errors.
3. Code analysis
Following is a subset of errors that this phase detects. We are planning to implement data-flow analysis in the future.
- Unused imports
- Import package cycles
- Unreachable code blocks
- Functions that do not return always
Consider the following sample. You would notice that this function does not always return.
function getFullname(string fname, string lname) {
if (fname != "" && lname != "" ) {
return fname + " " + lname;
} else if (fname != "" ) {
} else {
return lname;
}
}
4. Desugar
This phase removes the syntactic sugar from the AST by rewriting certain parts of the tree. We could also perform code optimization during this phase. We have plans to work on code optimization in the future. I’ll try to explain the functionality of this phase with the following sample.
Consider the following JSON example. You can see that there are two ways to add a name/value to a JSON object in Ballerina.
json j = {};
j.fname = "Sameera";
j["lname"] = "Jayasoma"
Desugar phase detects such scenarios and converts them to a single format.
5. Code generation
This is the last phase of the Ballerina compiler. This phase transforms the AST to another intermediate language understood by the Ballerina virtual machine(BVM). This intermediate language is called Ballerina bytecode. You can find more details of Ballerina bytecode and BVM in this article.
This phase produces an executable Ballerina program (balx file).
You can engage with the Ballerina community in following ways.
- Ballerina home page
- Ballerina by examples
- Fork Ballerina on GitHub
- Slack channel — Request an invite
- Ballerina user group
- Stack Overflow