Bottom-Tested Canonical Loops in LLVM
Survey on impact of using bottom-tested loops (i.e. LoopRotation loops) as canonical loops in LLVM
What is the proper canonical loop form for loop Passes in LLVM?
Question above is one of the earliest discussion topics in the newly established LLVM Loop Optimization Working Group. The rationale behind it is that most of the loop optimization or analysis Passes only accept certain shape of loops. For example, requires to have a pre-header, a header, two eyes and four limbs. In the ideal cases, every loop Passes will agree on one type, or one shape, of loops. Which is called canonical loops.
Unfortunately, that’s not the case now. From the survey I make here, we can find that despite most of the loop Passes agree on LoopSimplify, which is kind of THE canonical loop written in some of the LLVM docs, there are still some Passes only accept LoopRotation form, which is sometimes called bottom-tested loops since LoopRotation is basically turning the conventional for-loop-style loops into do-while-style loops (alone with some sanity checks). Furthermore, the LoopRotation pass itself also use LoopSimplify loops as input. So we’re of course wondering:
Can we use bottom-tested loops as the only canonical loop form?
Given that it might can satisfy both LoopSimplify-using and LoopRotation-using users. In this survey, I will try to prove (or dis-prove) this point by comparing their high-level control flow structure and by providing some quantified experiment results.
Control Flow Structure: LoopSimplify v.s. LoopRotation
Control flow structure is the first thing a LoopSimplify-using Pass will check, since that’s usually the number-one reason why a Pass relies on LoopSimplify. So our goal here is to examine whether LoopRotation loops will pass these sanity checks designed for LoopSimplify loops.
As mentioned in my previous survey enclosed in the link above, a loop confine with LoopSimplify form if:
- It has a pre-header.
- There is only one back-edge.
- Predecessors of exit blocks are guarantee to be in the loop.
We’re particularly interested in the first property, since this property is demanded by many Passes and most of the loops won’t create pre-header at the first place.
On the other hand, a loop will have the following control flow structure after going through LoopRotation:
As seen in the diagram, the most notable thing LoopRotation does is merging the loop header, body and range test into single basic block. It might create one large, bulky basic block, but it will guarantee that instructions within this block would have the same amount of execution counts. Which can make life easier for some tricky transformations like vectorization.
Another important characteristic is the loop guard block. It will make sure that value of induction variable is not out-of-range when it first entering the loop. And in fact, it will be the pre-header for our LoopRotation loop. Furthermore, the guard block will always be generated since we need to follow the default for-loop-style range testing mechanism. So now we’re pretty confident that LoopRotation will not break the pre-header requirement.
Regarding rest of the two requirements. There is no specific process in LoopRotation that will break the number of back edge requirement. And LoopRotation will also try to maintain the exit block predecessor property by splitting critical edges at the end of the transformation (thanks Bardia for pointing out).
In order to give some numbers and providing a preliminary testing on the acceptant rate of using LoopRotation as the canonical loop, I did a simple experiments on the existing LLVM regression tests.
The methodology is simple: For those Passes that requires LoopSimplify loops as input, I inserted a LoopRotation Pass in front of it and run the tests, then see how many of them pass the test. For example:
RUN: opt -loop-fusion -S -o - < %s | FileCheck %s
RUN: opt -loop-rotate -loop-fusion -S -o - < %s | FileCheck %s
Note that despite that many of these loop Passes might also been used in tests for other components, I only extract tests specifically put under these loop Passes’ testing folders.
This approach is definitely not the best way. Since current LLVM regression test is built on string matching and it barely can tell if a generated loop is “effectively equivalent” to the golden output. But the idea is to provide us a quick impression of how LoopRotation interacts with these LoopSimplify users while leveraging the existing testing codebase, which is kind of acting the de facto specification describing the behavior of a Pass. Here is the table of failure rates:
There are total of 54 failures and it’s impossible for me to go through them all by myself. So let’s pick only part of them and see what we’ve got.
The first thing caught our eyes is the high failure number for LoopStrengthReduction(LSR for short) and high failure rate for LoopFusion. I quickly pick two of the failures (by random) in LSR, 2012–03–15-nopreheader.ll and lsr-overflow.ll . These two are failed because the extra LCSSA instruction generated by LoopRotation cause a string mismatch. Where by default LSR won’t actively maintain a LCSSA form. Other than that, the outcome looks good, and extra LCSSA is of course not a problem.
We’re also interested if there are other Passes in our list that do not actively maintain LCSSA form since it might help us cross out some of the failure reasons. Here is the list:
Despite that LCSSA might be the root cause for LoopFusion, its five-out-of-six failure rate just doesn’t look right (I’d even highlighted it in red!). So let’s look at it, picking inner_loops.ll and four_loops.ll, which is representing two of the complex loop forms. Turns out that these two test cases are not caused by LCSSA. To see if it really failed the test, I think the fastest way is dumping the CFG.
LoopFusion try to fuse multiple similar loops together. The inner_loops.ll contains nested loops, so we’re expecting to see our LoopRotation + LoopFusion loops showing nested structure, here is the graph:
Great! looks like we do get nested loops, where
bb11 is the header of outer loop and
bb14 is that of inner loop.
For four_loops.ll it’s testing if LoopFusion can fuse four continuous loops together, so we’re expecting to see, well, only one loop. Here is the graph:
The loop definitely need some work out to loose some weight but I think we can call it a success.
In this survey I presented the possibilities of using LoopRotation as the canonical loop form. The key behind this is to see if LoopRotation breaks some characteristics of LoopSimplify loop, which is kind of the canonical loop form currently used.
By comparing control flow structures of LoopSimplify and LoopRotation loops, there is no outstanding drawbacks showing that LoopRotation loops will not accepted by LoopSimplify loops users.
Finally, I ran a simple experiment to see if we ran LoopRotation before every LoopSimplify loop user Passes in existing LLVM regression testsuite, what is the failure rate. The result showed around 10% failure rate in total. I’ve hand-picked some of these failures and found that the resulting loops are still equivalent to the expecting one. This is just a preliminary experiment, not all of the failures are invested either. But I hope it can pin-pointed places that might not agree with LoopRotation loops.
As a personal verdict, I think from these experiment, migrating canonical loops to LoopRotation loops seem to require little of works on existing loop Passes, I’m pretty positive on this migration. And of course, we definitely need feedback from the community and real-field test results.