Isekai-LLVM Integration: Technical Update #1
Extending the set of supported input languages of Isekai to LLVM IR-compiled ones
This is the first article in our series. Next articles: update #2, update #3.
Useful info:
Source code: https://github.com/shdown/isekai
The Isekai project: https://medium.com/sikoba-network/isekai-verifiable-computation-framework-introduction-and-call-for-partners-daea383b1277
LLVM IR: https://llvm.org/docs/LangRef.html
Arithmetic circuit: https://en.wikipedia.org/wiki/Arithmetic_circuit_complexity
The Crystal programming language: https://crystal-lang.org
Fantom Foundation, who are supporting the development of isekai: https://fantom.foundation/
Isekai is a tool for zero-knowledge applications. It currently parses a C program and outputs the arithmetic and/or boolean circuit representing the expression equivalent to the input program. Then isekai uses libsnark to produce a rank-1 constraints system from the arithmetic representation. Isekai can then prove and verify the program execution using libsnark.
An arithmetic circuit is a directed acyclic graph, in which edges represent the variables and vertices represent arithmetic operations (addition or multiplication). For example, the following arithmetic circuit computes (a+5–2*b):
Our (@shdownnine, @glushenkov.ig, Igor Evdokimov and Raed Romanov) aim was to develop an alternative frontend supporting LLVM IR as an input:
Requirements
We were given the following requirements:
- Convert LLVM IR to the Isekai intermediate representation (also called DFG for “Data Flow Graph” internally), which is basically a syntactic tree of an expression; it has no support for any form of loops or jumps (see #3).
- Benefit from LLVM SSA form and precise variable bit widths.
- Unroll loops; the programmer should have a way to provide a “maximum number of iterations” hint to isekai.
Analysis
What constitutes a loop
To unroll a loop (and apply an unroll hint), we need to know what a loop is. The existing front-end does not have to deal with this problem since it uses libclang to parse the C source, and does not support goto statements.
LLVM IR is basically an arbitrary control flow graph (CFG). Although a simplification, one may think of a CFG as an arbitrary C code with labels and “goto” statements.
Surely, any CFG can be realized as a sequential code with “if” and “while” statements, according to the Böhm-Jacopini theorem. Let us take a look at the existing projects trying to compile LLVM IR to languages with high-level control flow operators only (no goto) — basically, the goal we are trying to accomplish.
First such project is Emscripten, the LLVM IR-to-JavaScript compiler. It uses the “Relooper” algorithm, which is described in the original Emscripten paper (https://github.com/kripken/emscripten/blob/master/docs/paper.pdf?raw=true). It is also discussed in this blog post: http://mozakai.blogspot.com/2012/05/reloop-all-blocks.html. Evidently, the algorithm is pretty much complex, and also introduces a significant performance penalty.
Another one is the LLVM WebAssembly backend with its “Stackifier” algorithm (LLVM discussion: https://reviews.llvm.org/D12735, blog post discussing the algorithm: https://medium.com/leaningtech/solving-the-structured-control-flow-problem-once-and-for-all-5123117b1ee2). Of comparable complexity, it also introduces the same performance penalty in the worst case (if the CFG is “irreducible”). Moreover, it requires multi-level break statements, which we would have to emulate with additional variables.
One important consideration is that, there could be more than one way to reconstruct the loop structure, so the very notion of “loop” becomes flawed.
To add more confusion to this term, if arbitrary jumps in the source are supported (for example, if it is a result of compilation of a C code with goto statements), then any form of loop could be implemented with a jump and a conditional statement.
But, as it turns out, all this stuff is irrelevant because…
The root of all evil
As you may know, the C standard gives the compiler a lot of room for reorganization of the code, as long as the result is semantically equivalent to the original. Over the last decade, the compilers have started to aggressively take advantage of this, and implemented a number of mind-boggling optimizations.
For example, the clang compiler has got a variety of tricks to make the unroll hint invalid; namely, it may:
- perform loop unrolling by itself;
- split a loop into multiple ones (loop fission, loop splitting, loop unswitching);
- merge multiple loops into a single one (loop fusion);
- swap the inner and outer loops (loop interchange);
- rewrite nested loops into a single one (loop skewing, https://en.wikipedia.org/wiki/Polytope_model#Detailed_example);
- add inner loop (loop nest optimization);
- …and even extract a loop into a static function (!!!): https://llvm.org/doxygen/LoopExtractor_8cpp_source.html.
Decision
Because of the aforementioned problems, our solution will only support control-flow graphs that can be obtained by compiling a C99 code without goto, break, continue, and return statements, with clang -O0.
The unroll hint will be available in the form of a fake external function, `extern void _unroll_hint(unsigned)`, which should be declared once and called just before the loop with a constant argument.
LLVM IR and the SSA form
SSA stands for “static single assignment”; it is a property of the IR, which requires that each variable is assigned exactly once. It is useful for many compiler optimizations.
If a variable needs to be changed, it is, instead, allocated on the stack with the “alloca” instruction; see the following example:
Current state of affairs
As of now, our parser is able to simplify the previous example and reduce it to just adding two input variables. It is also able to compile things like this:
The user has to compile the C file to the bitcode first, e.g. with
clang -O0 -emit-llvm -c add_nizk.c -o add_nizk.bc
and then run `isekai` on the bitcode, e.g. with
./isekai --arith=add_nizk.arci add_nizk.bc
Future work
Over the next two weeks, we will be focusing our efforts on adding support for the larger subset of the C programming language, in particular, on supporting more operations as well as arbitrary variable assignment.