Implementing Machine Code Optimizations for RISC-V in LLVM — A Detailed Look

This blog post describes three optimizations — shrink wrapping, save/restore & machine outlining — which I have been working on for the RISC-V backend in LLVM. These optimizations focus on transforming a machine code representation of a program in order to improve performance or decrease the total size of the program.

Background

Basic blocks, prologues and epilogues

Basic blocks are sets of these instructions, which can be considered the building blocks of control flow within the program. For example, the body of an if statement will usually end up as a basic block. These basic blocks can be given a label if they are needed as a target to jump to, and can either end by falling through to the next block, or with branch and/or jump instructions.

The following shows an example of some basic blocks as represented in assembly — note the use of left-justified labels and comments to denote a basic block:

A function will be constructed from a collection of basic blocks, with at least one basic block terminating with a return instruction (except in the rare case that control flow never needs to leave the function).

A function will usually contain code to setup and destroy the frame in the ‘prologue’ and ‘epilogue’ respectively. The frame is the word used to describe the section of the stack used by the current function, and the purpose of the prologue/epilogue code is to manage the use of the frame. This involves adjusting the stack pointer such that it points to a valid location in the frame; adjusting the frame pointer when it is needed such that it references the unchanging location of the frame itself; and saving and restoring state of registers which the caller of the function assumes will be preserved.

The following is an example of prologue/epilogue code in assembly — this simply saves and restores the state of some registers on the stack.

Implementing the optimizations in LLVM

Shrink wrapping

Prior to my patches, if the shrink wrapping optimization was enabled in RISC-V it would cause an assertion failure when the prologue was emitted, due to the following line:

In other words, the assertion triggers if we attempt to insert the prologue in any basic block other than the very first block in the function, which is exactly what the shrink wrapping optimization attempts to do. So my approach was simply to delete this line, and see what other code needed to be changed to get this to work correctly.

The changes that were required were simply to update the logic which determined the insertion point of the epilogue; previously this logic decided that the epilogue would be inserted exactly one instruction before the last instruction in the block, since it could assume the block ends with a return instruction (IE: a terminator). I modified this logic to properly account for situations where this is not the case:

More complex logic was needed once save/restore was implemented, since this complicated the requirements for being able to insert prologue and epilogue code in a given location. Most importantly, we could not move the epilogue, since the routines used for save/restore use a ‘tail return’. A tail return assumes that the function calling it has no more code that needs to be run, so instead of returning control to that function, it returns to the function which called that function instead.

The full implementation of shrink wrapping for RISC-V can be found on Phabricator. Currently this can be enabled by passing -mllvm -enable-shrink-wrap to the Clang compiler.

Save/restore

There is no ‘pre-packed’ logic in LLVM for emitting calls to assembly routines in prologues and epilogues, but we need to make sure that LLVM knows that we know how to handle the reserved registers ourselves, so they don’t need stack locations to be calculated for them:

A slight difficulty with using an assembly routine for saving registers is that we haven’t yet saved the state of the return address register, so we can’t actually use it for the call to the routine itself. The workaround is to use a different register as a temporary return address register:

The full implementation of save/restore for RISC-V can be found on Phabricator. Currently this can be enabled by passing -msave-restore to the Clang compiler.

Machine outlining

Machine outlining is another optimization that is already available in LLVM for targets to utilize. The algorithm can be described as follows:

  • Map each machine code instruction to a number, unique to the exact behaviour of that instruction. Construct a ‘string’ representing the entire program from these.
  • Transform this into a suffix tree — a data structure representing a string as a set of common substrings with different suffixes.
  • Repeatedly query the suffix tree for the next outlining candidate — the longest non-leaf substring with more than one occurence — and replace each occurence with a call to an outlined function.

For more detail , Jessica Paquette’s talk at the 2016 LLVM developers’ meeting gives a fantastic overview of the machine outliner:

For the target implementation, we need to define some hooks to tell the algorithm when it is safe to outline code. For RISC-V this involves checking that outlining wouldn’t override expected linker behaviour:

After the machine outliner has produced outlining candidates, the target can provide information about the candidate sequence and the locations in which it occurs. For RISC-V we also use this opportunity to ensure that we are able to insert a call to an outlined function using the same temporary return address register as the save/restore implementation uses:

Finally, the outlined function is constructed and the required calls to the outlined function are inserted in place of the original sequences. The target has the opportunity to perform any modifications on the outlined function, and must insert the calls to the function. For RISC-V, we use the same method of calling the outlined function as the save/restore implementation uses to call the assembly routines.

The full implementation of machine outlining for RISC-V can be found on Phabricator. Currently this can be enabled by passing -mllvm -enable-machine-outliner to the Clang compiler.

Results

The Embench benchmark suite

Size measurements — dummy libraries

Compiling the benchmark programs with dummy libraries linked instead of the real libraries reduces the impact of the size of the libraries on the absolute code size measurements. In our case this would theoretically make it easier to see small differences between each configuration.

It’s worth noting that I removed the data for the shrink wrapping optimization for the size measurements, since it has no impact on code size. This is not unexpected — despite its name — since the optimization is just moving code around. I have also chosen to report the absolute sizes; Embench by default reports measurements relative to a standard setup to keep the results in context, however for our purposes we are comparing directly to our own baseline — LLVM without our optimizations — so comparing absolute sizes is more appropriate.

Size measurements — real libraries

In this case, each of the configurations are being linked against real libraries. This actually helps us to present a more accurate comparison between the optimizations, since the save/restore optimization requires linking in a small amount of additional library code — the assembly routines — and it can be seen that in a few cases this does outweigh the initial decrease in code size.

Speed measurements

Since two of the three optimizations are focussed on size, it wasn’t as relevant to investigate the speed results as much. However it’s important to show that performance relative to the standard setup remains reasonable. One interesting thing to note here was that on these benchmarks shrink wrapping did not have much of an effect, indicating that the optimization is only expected to make a marginal difference, or even that the type of programs used didn’t present much opportunities for the optimization to have an effect.

Conclusion

Currently work-in-progress implementations are available on Phabricator, and a constantly updated branch exists on Embecosm’s LLVM fork, where you can use the experimental flags described above to experiment with the optimizations.

University of Bath Student, UKESF Scholar, and Software Engineer at Embecosm. Working in embedded AI and LLVM compiler development.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store