Neurosymbolic Assembly Transpilation
This work was published in the 2024 International Conference on Learning Representations. Conference paper linked here. Poster linked here.
Motivation
Computers runs on 0s and 1s represented as electrical impulses on increasingly complex hardware chips. Computer programmers write software in higher-level programming languages, which are analyzed and optimized and compiled down to assembly code: machine instructions that map directly to the 0s and 1s that machines use.
Assembly code is specific to the hardware of the machine in question. In the example image above, that instruction is specific to MIPS32 machines. Different languages of assembly codes are called Instruction Set Architectures (ISAs).
This means that once software is compiled to assembly/machine code, it cannot be run on a machine with a different architecture. So when certain architecture providers start making changes that get people talking about moving to different architectures altogether, this leaves open the question: what do we do about all the legacy software in that ISA?
If we want to run code natively on a different architecture, we need to translate it to the target architecture’s ISA.
One hope is that one has the original high-level source code and a compiler for the target architecture, so you can simply recompile from the original high level language code to the target ISA. Unfortunately, in many cases of legacy code, we do not have this original high-level source code. Even in cases where we do have it, many high-usage source code files contain in-lined assembly code, which requires porting to alternate hardware architectures.
We want to perform assembly-to-assembly translation, i.e. transpilation.
The problem with existing methods: engineering effort
Existing methods to handle this issue are:
(1) emulators, virtual machines, and containerized applications allow users to run software on a different host hardware by simulating the architecture of the hardware platform that the software is compiled for. However, this option is often unweildy and does not run the software natively.
(2) transpilation tools are hand-engineered for the specific source and target ISA, so scale poorly as new ISAs are introduced or modified.
Enter automated methods
Neural machine learning techniques are a natural fit for this type of task. Language models have demonstrated remarkable performance in various translation tasks, and scale well to new tasks given sufficient data. However, they notoriously make errors, notably today in mathematic reasoning and hallucination.
Program synthesis methods present another viable method. Assembly programs can be executed to verify their correctness, and the semantics of each assembly line is strictly defined. Program synthesis methods are symbolic in nature, and are built around correctness. However, program synthesis methods quickly become intractable as the search space of potential programs grows.
We argue for a neurosymbolic approach to this task, which leverages the scalability of language models with the correctness of program synthesis methods.
The objective
Formally, the objective is as follows: given input program sequence x in the source ISA, produce output sequence y in the target ISA such that for all execution parameters in the domain D, their programs produce the same execution.
Guess via a language model (neural)
We use a transformers-based language model to propose candidate transpilations y for a given input x. Since we know that language models may make errors, what we also need is where in y there may be errors, and how to correct those potential errors.
We identify potential errors using a general heuristic and a domain-specific heuristic. The general heuristic is whether the confidence of a predicted token is lower than some threshold:
The domain-specific heuristic is where y references some unknown entity. This happens rather often in assembly, where memory and function blocks are defined and referenced elsewhere in the program. If the language model prediction references some entity it has never seen before in the input or output, the generated reference is likely hallucinated and may be erroneous. Reference these potential errors as E.
Since this task is translation, we know that the semantics of the input x must match the semantics of our outputs. We also know that in this setting, smaller parts of the input and output are semantically aligned. Assembly “basic blocks” are sequences of code lines that have a single entry point and single exit point. That is, there are no branching operations within the code sequence, so they run all the way through. Basic blocks in the input program sequence generally have a corresponding basic block in the target program sequence. This means that to know the correct semantics of a given output basic block, we only need to know which basic block in the input that it corresponds to. To get this, we use the attention matrices in the language model.
Without any external training objective, the attention matrices about three-quarters through the layers of the model pretty closely reflect the true alignment of basic blocks. We extract basic block alignment A by highest aggregate attention score.
At this point, the language model has provided us a candidate translation y, its places of potential error E, and basic block alignment A to the input sequence.
Search for correct subprograms via Sketching (symbolic)
Sketching is a program synthesis method in which a user provides a correctness criteria and an incomplete program: a general “sketch” of high-level ideas and a “hole” in lieu of lower-level details. The “sketch” and the correctness criteria are handed off to an SMT solver, which searches the space of programs defined by the “sketch” until it finds one that satisfies the given correctness criteria.
With (y, A, E) from the language model, we are perfectly set up to do Sketching. Potential errors E points out parts of the program to replace with Sketching holes. Alignment A from the surrounding basic block in y to x provides a correctness constraint: the semantics of the corresponding basic block in the input.
After Sketching, corrected basic blocks are updated to produce y’.
In full, this approach can be thought of as either of the following:
- Using program synthesis methods to correct language model predictions.
- Using language models to reduce the search space for program synthesis methods.
Results
We compare our method to a couple baselines:
- few-shot prompting GPT4
- an engineered transpiler (we found one by researchers at IBM Haifa for the ARM -> RISC direction, but not for the reverse)
- fine-tuning code foundation models (Starcoder 15.5B, CodeLlama 13B)
- training an encoder-decoder model only on the task (340M parameters)
To be clear, the encoder-decoder baseline model is the exact same model that we use in the neural Guess phase of our approach. So to compare the encoder-decoder to our method is to directly measure the value of adding the symbolic component of our method.
Our neurosymbolic method outperforms all alternatives.
Some findings in error analysis of the baselines:
- Pre-trained models (GPT-4, StarCoder, CodeLlama) tend to produce assembly instructions from other architectures, specifically the x86 and MIPS ISAs.
- The engineered transpiler fails in cases where the input assembly code includes some operation or pseudo-operation that is not handled by the rules-based engineering setup of the transpiler.
- The encoder-decoder model makes few ISA mistakes, but runs into a number of errors in mathematical semantics and out-of-scope references, a number of which are resolved by the solver in the neurosymbolic GUESS & SKETCH approach.
While our method is more effective than baseline approaches, there are still several remaining open challenges:
- The symbolic component of our method is dependent on alignment with the source sequence, but the alignment extracted by attention does not always follow true basic block alignment. When this happens, errors cannot be corrected using our method.
- Memory management issues are still hard for the symbolic solver to resolve. These include reasoning about values on the stack at any given point in the program, register choice decisions that are incorrectly propagated through autoregressive generation, and loading memory addresses into the register. A solver can be augmented to support these issues, but it will signifcantly scale the program search space.
- The best performing model that we tested is a mid-size encoder-decoder, which is strong at pattern matching but likely cannot perform programmatic reasoning. Potentially larger code models could better solve some of the symbolic transpilation issues, if instruction hallucinations could be reduced.
- All transformers-based language modeling approaches are limited in length by the context length.
- We have no formal proof of equivalence for the final programs, only checking execution on a small finite set of inputs.
For more details of training data, training setup, prompting setup, error analysis, examples, and test sets, please see the paper or reach out to me.
Takeaway
A neurosymbolic approach is beneficial for tasks that have strong structure and verification priors, such as coding.
A heartfelt thank you to my collaborators: Abdulrahman Mahmoud, Michal Kurek, Simone Campanoni, David Brooks, Stephen Chong, Gu-Yeon Wei, and my advisor Alexander “Sasha” Rush.