Peeking in Huawei’s OpenArk Compiler
It’s not fully open-sourced, it’s lacking in (English) documents, but it’s designed to be the next generation Common Language Runtime and JVM
Huawei just revealed a new compiler and runtime framework that has a similar mission Java VM, .NET Common Language Runtime, as well as many other frameworks such as GraalVM trying to accomplish since decades ago: Running multiple programming languages in one place.
Currently only part of the source code was released: the compiler IR (Intermediate Representation) specs and some code optimizations implementations. Unfortunately, so far, most of the documentation is written in Chinese (they’re planning to release an English version in the fourth quarter of 2019). So this article will give an overview on what we’ve known about this project in English, and list some questions that people might want to know in the future.
Resources
Currently the source code is released here:
https://code.opensource.huaweicloud.com/HarmonyOS/OpenArkCompiler/home
Mirror : https://gitee.com/harmonyos/OpenArkCompiler
Here is the official website (Chinese): https://www.openarkcompiler.cn/home
High Level Architecture
Figure 1 shows the high level architecture of the compiler (This figure was translated from this image in one of the OpenArk’s official documentation pages).
There are primarily two take aways from this picture:
- Same as JVM and CLR, different programming languages are translated into the same compiler IR. Optimizations and code generations are performed on top of that.
- Different from JVM and CLR, OpenArk is designed to emit native executables (with necessary runtime libraries) by default.
As mentioned earlier, source code for both frontend that translated Java bytecode or other programming languages into OpenArk IR, and backend that do the native codegen are not available yet. But their (64-bits Linux) executable binaries are available in the repo.
From another image, we can further divide the compiler into three phases, sorted in their executed order: the “M2M”, the mid-end, and the backend. Here are their enclosing components:
The “M2M” Phase (all components are open sourced)
Language-specific Lowering, VTable Generation, Exception Handling, and Class-level analysis[1].
The Mid-end (only bold components are open sourced)
SSA Construction, Reference Counting(RC) Insertion, Alias Analysis, “mplt handling”, RC Optimizations, Partial Redundancy Elimination (PRE), Inlining, Side-effect Analysis, De-virtualization, Null Pointer Elimination, Dead Code Elimination (DCE), Boundary Check Elimination, Escape Analysis, Copy Propagation, “Cross-Language Optimization” .etc.
The Backend (non of the components are open sourced)
Stack Allocation, Control Flow Optimization, “EBO” Optimization, Peephole, Register Allocation (RA) .etc.
Okay, we know the backend, and we know most of the mid-end optimizations, but…what exactly is the “M2M” phase?
It turns out that there are actually two tiers of compiler IR: MAPLE IR[2] is used in the M2M (“Maple to Maple” optimization, I guess) phase, and the other IR (they just call it Me) for most of the mid-end optimizations. Now let’s go deep into them.
Compiler IR Design
As we mentioned, there are two tiers of IR: MAPLE and Me. They have a pretty decent documentation and specs on MAPLE but there is basically no documents for Me.
MAPLE is a high-level IR which representing concepts that are closed to the original source code. There are three important building blocks:
- Leaf Nodes can represent constants or address / identification of a “storage” (e.g. memory chunk).
- Expression Nodes evaluate new values based on their operands. They will not generate any side-effects.
- Statement Nodes are usually used to represent control flow, such as loops and branches, or modifications on storage units (e.g. assigning values to a storage).
I think MAPLE tries to lower different programming languages into a “common language” first — it doesn’t loose much information from original language so that compilers can perform language-specific optimizations. It also combines common feature from different languages into single representation, such that compilers can perform language-independent optimizations, too.
This reminds me of some similar approaches in other programming languages or frameworks: For example, the GENERIC intermediate language in GCC and Truffle from GraalVM both provides a framework to translate individual programming languages into a common base. Furthermore, HIR and MIR in Rust, SIL in Swift, or even MLIR from Tensorflow all enable the abilities to perform domain-specific optimizations and analyses before going into lower level optimizations, which are usually lacking in many high-level information.
If you’re interested in how they use MAPLE IR, please refer to the src/mpl2mpl
(i.e. the “M2M”) folder, which contains some Java/JVM language-specific lowering and analyses.
Going forward to the Me IR. Although there isn’t any documents for that, we still can learn something from the src/maple_me
, src/maple_phase
, and src/maple_ipa
folders:
- It’s SSA(Single Static Assignment).
- In addition to SSA values (instructions), it also has concepts of Basic Blocks, Functions, and Modules.
- Currently it looks like when we’re processing Me IR, we still need the reference to MAPLE IR. In another word, Me IR is more like a facade of MAPLE IR (Unlike Clang and LLVM: we won’t hold any reference to AST in LLVM IR).
- It has concept of Phase, which is basically the “Pass” concept in many compiler frameworks like GCC and LLVM.
Other than that, OpenArk basically uses Me IR in the same set of optimizations and analyses you can find in other compiler frameworks: Alias Analysis, Dominator Tree, Dead Code Eliminations…
Summary
OpenArk is a compiler framework that tries to compile different languages into a common intermediate layer and generate native binaries. It adopts a multi-tier IR design which enables optimizations and analyses in different abstraction levels. More source code and (English) documents will be released soon.
Notes
[1] The original text is “Klass-level analysis”. Klass is basically their internal representation of class in OOP languages.
[2] Canadians will love this.