Peering into the Ballerina Intermediate Representation

Thushara Piyasekara
Ballerina Swan Lake Tech Blog
10 min readJun 29, 2024

This article is written for Ballerina Swan Lake Update 9 (version 2201.9.0) of the jballerina compiler. While the core functionality described here may remain applicable in future versions, some internal implementations might change.

What is an intermediate representation?

In compiler theory, an intermediate representation(IR) is any representation “between” the source and the target language. Having intermediate representations makes the process of compiler optimizations easier due to its simplicity. Furthermore, one intermediate representation can be used for code generation into multiple targets. This allows the compiler developers to do general optimizations on the IR level and target specific optimizations at the code generation level. In a given compiler design, one or more intermediate representations could exist.

What is Ballerina Intermediate Representation(BIR)?

Ballerina Intermediate Representation(BIR) is the intermediate representation used in jballerina(compiles ballerina source into JVM bytecode) and nballerina(compiles ballerina source into LLVM IR) compilers. Even though the BIR formats used by these two compilers are different, the core concept remains the same.

This article will peer into architecture and the internal implementation of BIR used by the jballerina compiler.

(BIR phase is the very last phase of the ballerina frontend and serves as the input for the backend)

BIR is generated from the desugared abstract syntax tree(known as BLangPackage inside the jballerina compiler). In the desugar phase, the compiler simplifies the high-level constructs into low-level ones. It even generates new functions, type definitions, and certain boilerplate code snippets used for code generation. In other words, it acts as the foundation for the bridge that connects Ballerina source code and JVM bytecode. Due to the newly introduced constructs such as anonymous types and lambda functions, BIR is vastly different from the initial ballerina source code.

Let’s see how we can view the BIR in a human readable format.

Viewing BIR

BIR can be viewed in a human readable format in two ways,

  1. Using `bal build –dump-bir` command.
  2. Debugging the compilation process and inspecting the `BIRPackage`.

Using `bal build — dump-bir` command

See the following example ballerina source code,

public function main() {
int a = add(1, 2);
boolean b = isEven(3);
int[] intArr = [1, 2, 3, 4];
int c = addAll(intArr);
}

function add(int num1, int num2) returns int {
return num1 + num2;
}

function isEven(int num) returns boolean {
if (num % 2 == 0) {
return true;
} else {
return false;
}
}

function addAll(int[] nums) returns int {
int total = 0;
foreach int num in nums {
total += num;
}
return total;
}

To view the BIR in textual format we can use the following command,

bal build --dump-bir

It will emit the following output in the CLI,

================ Emitting Module ================
module thushara_piyasekara/SampleBIR v 0.1.0;

import ballerina/io v 1.6.0;


$annotation_data map<any>;


public .<init> function() -> error{map<ballerina/lang.value:0.0.0:Cloneable>}|() {
%0(RETURN) error|();
%1(TEMP) typeDesc<any|error>;

bb0 {
%1 = newType map<any>;
$annotation_data = NewMap %1{};
%0 = ConstLoad 0;
GOTO bb1;
}
bb1 {
return;
}
}

public .<start> function() -> error|() {
%0(RETURN) error|();

bb0 {
%0 = ConstLoad 0;
GOTO bb1;
}
bb1 {
return;
}
}

public .<stop> function() -> () {
%0(RETURN) ();

bb0 {
%0 = ConstLoad 0;
GOTO bb1;
}
bb1 {
return;
}
}

public main function() -> () {
%0(RETURN) ();
%1(LOCAL) int;
%2(SYNTHETIC) int;
%4(SYNTHETIC) int;
%9(LOCAL) boolean;
%10(SYNTHETIC) int;
%14(LOCAL) int[];
%16(TEMP) int;
%17(TEMP) int;
%18(TEMP) int;
%19(TEMP) int;
%20(TEMP) int;
%21(LOCAL) int;
%22(SYNTHETIC) int[];
%26(TEMP) typeRefDesc<>[];
%28(TEMP) ballerina/io:1.6.0:Printable;
%29(TEMP) ballerina/io:1.6.0:Printable;
%30(TEMP) string;
%31(TEMP) ();

bb0 {
%2 = ConstLoad 1;
%4 = ConstLoad 2;
%1 = add(%2, %4) -> bb1;
}
bb1 {
%10 = ConstLoad 3;
%9 = isEven(%10) -> bb2;
}
bb2 {
%16 = ConstLoad -1;
%17 = ConstLoad 1;
%18 = ConstLoad 2;
%19 = ConstLoad 3;
%20 = ConstLoad 4;
%14 = newArray int[][%16]{%17,%18,%19,%20};
%22 = %14;
%21 = addAll(%22) -> bb3;
}
bb3 {
%16 = ConstLoad -1;
%30 = ConstLoad BIR is cool;
%29 = <ballerina/io:1.6.0:Printable> %30;
%28 = <ballerina/io:1.6.0:Printable> %29;
%26 = newArray typeRefDesc<>[][%16]{%28};
%31 = print(%26) -> bb4;
}
bb4 {
%0 = ConstLoad 0;
GOTO bb5;
}
bb5 {
return;
}
}

add function(int, int) -> int {
%0(RETURN) int;
%1(ARG) int;
%2(ARG) int;

bb0 {
%0 = %1 + %2;
GOTO bb1;
}
bb1 {
return;
}
}

isEven function(int) -> boolean {
%0(RETURN) boolean;
%1(ARG) int;
%3(TEMP) int;
%4(TEMP) int;
%6(TEMP) boolean;

bb0 {
%3 = ConstLoad 2;
%4 = %1 % %3;
%3 = ConstLoad 0;
%6 = %4 == %3;
%6? bb1 : bb2;
}
bb1 {
%0 = ConstLoad true;
GOTO bb3;
}
bb2 {
%0 = ConstLoad false;
GOTO bb3;
}
bb3 {
return;
}
}

addAll function(int[]) -> int {
%0(RETURN) int;
%1(ARG) int[];
%2(LOCAL) int;
%4(SYNTHETIC) int[];
%6(SYNTHETIC) ballerina/lang.array:0.0.0:$anonType$return$iterator$_0;
%7(SYNTHETIC) typeRefDesc<>[];
%12(TEMP) boolean;
%13(SYNTHETIC) ballerina/lang.array:0.0.0:$anonType$return$next$return$iterator$_0|();
%18(LOCAL) int;
%19(TEMP) map<any|error>;
%21(TEMP) string;

bb0 {
%2 = ConstLoad 0;
%4 = %1;
%7 = <typeRefDesc<>[]> %4;
%6 = iterator(%7) -> bb1;
}
bb1 {
GOTO bb2;
}
bb2 {
%13 = $anonType$return$iterator$_0.next(%6, %6) -> bb3;
}
bb3 {
%12 = %13 is ballerina/lang.array:0.0.0:$anonType$return$next$return$iterator$_0;
%12? bb4 : bb5;
}
bb4 {
%19 = <map<any|error>> %13;
%21 = ConstLoad value;
%18 = %19[%21];
%2 = %2 + %18;
GOTO bb2;
}
bb5 {
%0 = %2;
GOTO bb6;
}
bb6 {
return;
}
}

================ Emitting Module ================

At first glance, it might seem daunting to understand. But it’s simple once you get going. Let’s try to understand the output we received.

In the first line it starts with,

================ Emitting Module ================
module thushara_piyasekara/SampleBIR v 0.1.0;

import ballerina/io v 1.6.0;

Here, we start the module name of the emitted BIR. Since a single ballerina project can have multiple modules, this information is needed to distinguish between modules.

It should be noted that modules that are loaded from the cache won’t be included in the emitted BIR report. It is because cached Ballerina packages are directly packed into the final executable JAR to save compilation time. Since no compilation occurs for cached modules, no BIR is generated for them.

If we need to emit the BIR for all modules, we have to clean the cache found inside the <USER_HOME>/.ballerina directory. In our case, only the root module is emitted because the ballerina/io module is already cached. Which will simplify the output.

$annotation_data map<any>;

Next, we see the global variables which are present during the BIR phase. If we check the source code, there are no global variables. But a global variable called $annotation_data map<any> is present in the emitted BIR. This $annotation_data map<any> global variable is generated by the desugar phase to hold the annotation data related to types.

When working with the BIR, it is common to see these generated constructs (usually they all start with the “$” prefix for encoding purposes). So it is safe to disregard the constructs that start with the “$” prefix.

Next, we can see the functions found in the BIR phase. Even though we defined only 4 functions (main, add, isEven, addAll), we see 7 functions in total including the ones we defined.

public .<init> function() -> error{map<ballerina/lang.value:0.0.0:Cloneable>}|() {
....
}

public .<start> function() -> error|() {
....
}

public .<stop> function() -> () {
....
}

public main function() -> () {
....
}

add function(int, int) -> int {
....
}

isEven function(int) -> boolean {
....
}

addAll function(int[]) -> int {
....
}

The reason for this is the same as before, the additional functions we see are generated functions. They are used for module initialization and module elimination during runtime.

Let’s look into the content of a function we defined, the main() function.

At the top of the function body, we can see the local variables defined. The local variable names were replaced with numbers with the “%” prefix.

The reason for this is at the BIR level there is no need to store the user-defined variable names. Therefore, they can be replaced with more memory-efficient integers. They are marked with their BIR VarKind and their BType. For instance, the VarKind of %1 is LOCAL and the BType is int. VarKind is used to reflect the nature of the variable. For instance, SYNTHETIC and TEMP variables are generated by the compiler during the desugar phase. LOCAL variables are ones defined by the user.

Next, we have the basic blocks. A basic block is a sequence of instructions with no branches. Basic blocks can be chained with GOTO instructions. For branching you can provide a condition to determine which basic block needs to be chosen.

We can see branching in action in isEven function,

isEven function(int) -> boolean {
%0(RETURN) boolean;
%1(ARG) int;
%3(TEMP) int;
%4(TEMP) int;
%6(TEMP) boolean;

bb0 {
%3 = ConstLoad 2;
%4 = %1 % %3;
%3 = ConstLoad 0;
%6 = %4 == %3;
%6? bb1 : bb2;
}
bb1 {
%0 = ConstLoad true;
GOTO bb3;
}
bb2 {
%0 = ConstLoad false;
GOTO bb3;
}
bb3 {
return;
}
}

In the above function in bb0, we evaluate a boolean. And based on the value, choose to jump to either bb1 or bb2 in elvis operator style.

%6? bb1 : bb2;

Both bb1 and bb2 are branched back together at the end and returns either true or false in bb3.

    bb1 {
%0 = ConstLoad true;
GOTO bb3;
}
bb2 {
%0 = ConstLoad false;
GOTO bb3;
}
bb3 {
return;
}

Debugging the compilation process and inspecting the `BIRPackage`

Trying to understand the BIR instructions through the emitted textual result is possible for simple instructions. But for complex scenarios, a compiler developer’s best friend is the debugger. We can remote debug the code against the ballerina-lang codebase using BAL_JAVA_DEBUG=<port number> environment variable and visualize the BIR through the debugger value inspection.

We will be putting a breakpoint at the following location inside CompilerPhaseRunner class,

(birGen phase generates the BIR using the AST)

We can start the remote debug session by running bal build command and starting the remote debug session through the IDE. When the breakpoint gets hit, we can evaluate the following object to access the BIRPackage.

(Inspecting the BIRNode.BIRPackage object of the root module)

Since BIR is designed to be a graphical IR, when we emit the BIR using --dump-bir command, the compiler converts the graph-based data structure into a textual linear representation. Certain information is lost during this conversion. Therefore, it is better to use the debugger to inspect the BIR.

Usages of BIR

In jballerina compiler, BIR is used for two tasks,

  1. Bytecode generation in codegen phase.
  2. Loading the symbols for cached modules.

The first task is straightforward. jballerina frontend returns the BIR as the input for the backend, where code generation happens. In the backend phase, bytecode constructs are generated in a nearly 1:1 ratio for each BIR node.

The second task is connected with the ballerina package caching process. Ballerina compiler reuses already compiled thin JARs of ballerina packages. These thin JARs are stored inside the .ballerina folder in the file system. BIR for each package is also stored along with the thin JARs in binary format. This is done by the BIRBinaryWriter. We will come back to the usage of these binary BIR files.

When a new compilation happens, the ballerina compiler first generates a dependency graph using PackageResolution. Next, for each dependency (ballerina/io, ballerina/http), ballerina compiler checks whether a compatible cached thin JAR is available in the .ballerina cache. If such a thin JAR is available, the compiler will not compile that ballerina module. Instead, it will only compile all the non-cached imports and pack the cached thin JARs when generating the final executable JAR.

So where does BIR come into play in this process? When packing the executable JAR, we need to link the references from its imported modules. For that process, we need certain information from the previous compilations. This information can be found in the BIR phase. And subsequently inside the serialized binary files inside the .ballerina folder. The compiler will read back the needed information from the BIR files and use it for linking the cached thin JARs with the rest of the compiled JARs. The logic for reading back the serialized BIR can be found in the BIRPackageSymbolEnter class.

(BIR is used for loading the BSymbols of cached modules)

Future of BIR

As of now we only write about 60% of the BIR information to the cache. And read back only about 20%. Fully serializing and deserializing BIR could be an interesting research project. If we were to fully serialize BIR information, it could open new doors for compiler-level optimizations. Ballerina compiler team has a roadmap for further runtime performance optimization using techniques used in LLVMIR.

Since the first version of BIR, it has been refactored, optimized, and rewritten throughout the development of Ballerina. This iterative process ensures that BIR remains performant, efficient, and adaptable. We can expect this continuous improvement to be a hallmark of BIR in future versions of Ballerina, solidifying its role as a cornerstone technology for building integrations.

--

--