Isekai LLVM update #2: conditionals and loops
How to recognize conditional statements and loops in LLVM IR code
We looked at the LLVM IR block patterns that clang generates for “if”, “switch”, “for”, “while” and “do…while” C statements with “-O0”. We notice that, in general (i.e. with the exceptions mentioned below):
- There is always exactly one block that terminates with the “ret” instruction, corresponding to the “return” statement in C. We will call this block “final”.
- The statements map to one of the following patterns, which we call the conditional pattern and the loop pattern:
There are some exceptions to this, as these patterns will not appear in the following cases:
- For conditional statements where the condition expression is constant, there will be no conditional branch and therefore no C vertex in IR.
- Loops of the sort “do … while(0)” compile to “linear” branches, so there will again be no C vertex in IR.
We also note that:
- In the conditional pattern, C has 1 or more outgoing edges, while in the loop pattern, it will always have exactly 2 outgoing edges.
- For infinite loops of the sort “while(1)” and “for(…; 1; …)”, clang seems to generate an IR that corresponds to the loop pattern. However, we will not support such infinite loops.
On the above graphs:
- The vertices represent LLVM basic blocks.
- The C vertex is a block that terminates with a conditional branch.
- The S vertex is a “sink” vertex, it is the “end” of a conditional or loop statement;
- The edges represent the “may jump onto” relation;
- A curved line with a cloud means “a detached graph connected with the outer world only with one single inward edge and one single outward edge”. This may degenerate into a null graph, in which case the curved-line-with-a-cloud becomes a single edge.
- A line ending with a double arrow means “there is a path from here to the final block.”
To distinguish between the conditional and loop patterns we start by building the shortest-path tree (that for unweighted graphs coincides with the BFS tree) on the inverted control-flow graph (with edge directions reversed) from the final block; let us call this tree Θ.
We then apply the following algorithm to any vertex C that terminates with a conditional branch:
1) Calculate the lowest common ancestor in Θ of all the children of C in IR. This will be the S vertex, which we need so we know where the conditional or loop statement ends, which is when we can stop generating conditional code.
2) Decide if it is a conditional statement or a loop:
if C has either one or at least three outgoing edges in IR, then it is a conditional statement;
elsif C is a child of itself, then it is a loop;
else it is a loop if and only if C has a child B, different from C, such that the shortest path between B and S in Θ includes C.
As S is a common ancestor in Θ of all the child blocks of C in IR, this algorithm is obviously correct.
Now let’s look at the complexity of the algorithm. We define LCA(A, B) to be the lowest common ancestor of A and B, and write “A ≤ B” to mean that B is an ancestor of A.
The good news is that the query “does the shortest path from vertex X to vertex Y in the tree include vertex Z” is equivalent to:
(X ≤ Z or Y ≤ Z) and Z ≤ LCA(X, Y),
because the shortest path between X and Y consists of ascending from X to LCA(X, Y) and then descending to Y.
The lowest-common-ancestor and is-ancestor queries on a tree can be answered quickly — with some preprocessing, of course. For example, if the Tarjan’s off-line lowest common ancestor algorithm is used, the complexity is linear in the number of edges and vertices.
Here you can find the links to the implementation: