Common subexpression elimination for the Solang compiler

Lucas Ste
Coinmonks
9 min readDec 8, 2021

--

Photo by JJ Ying on Unsplash

This is my second article about my work as a mentee for the Solang compiler. If you have not seen my first article, please check it here. For a brief introduction, in June 2021, I applied for the Linux Foundation mentorhship and got accepted to implement undefined variable detection, unused variable elimination and common subexpression elimination. In this article, I will explain my algorithm for removing repeated expressions from Solang’s intermediate representation.

The elimination uses two passes over Solang’s control flow graph (CFG). In the first one, we detect repeated expressions using a graph data structure and, during the second one, we remove them. The first pass tracks the available expressions at each CFG instruction. Solang saves expressions inside each expression. For instance, the Set instruction shown below contains its location (where in the Solidity file the instruction is located), a number that indicates the variable index in the function’s symbol table and an expression, which is the right hand side of an assignment.

In Solang’s intermediate representation, the order of its operations is equivalent to an in-order traversal of the tree that represents each expression of an instruction. One way to easily track an expression we have already calculated is to create a modified representation of the tree, in which we add all expressions we have processed, taking care to avoid duplicates. This representation contains every possible operation calculated so far. We are basically merging all expression trees we have processed into a graph. As an example, consider the following code:

When we parse the first line, which contains the expression ‘a+b-54’, we have the following graph:

Variables ans constants have in-degree equals zero, as they do not originate from any operation. The graph data structure shows us what expression are available for the next instruction and allows us to quickly identify repeated expressions. When we process x*f*(a+b), the graph becomes the following:

The node e1=a+b has an out degree of two, which means the expression a+b has been used twice throughout the code. Variables may change value when they are reassigned. If this happens, all expressions that contain such a variable must become unavailable. In terms of our graph, this operations translates to deleting the variable node and all its descendants. When we assign another value to x , the graph becomes the following:

Notice that not only all expressions that depended upon x have been removed from the graph, but also the node x=e1-e2 has changed to e4=e1-e2 , as x does not represent the expression e1-e2 anymore.

It is worth mentioning that the date structure we created also enables us to track partial expressions. In the given example, we were able to track a+b , which is only an expression within another one.

Now that we have shown how we can identify available expressions using our graph, there is a new problem we must address: branches. Consider the following Solidity function:

The expression a-b becomes available on both branches and is used at the return statement. Our algorithm should be capable of detecting that a-b is available after the branch. To solve this, we create a copy of the graph for each branch (the if and the else), so that new expression will be inserted to each copy independently. After the branch, we must intersect both graphs, but there must be a way to differentiate between expressions available before the branch, common expressions created inside both branches and expressions created inside the branch that are not common.

We assign a unique identifier (ID) to each node in the graph. This way, expressions available before the branch have the same ID on both branches. Each new expression added to the graph, regardless of which copy will have a new unique ID. We can identify that a-b is available after the branch because the IDs of a and b and the subtract operation coincide on both branches’ graphs when we are intersecting them. We identify each node in the graph by a hash of its operands’ ID and its operation (a hash of ExpressionType shown below). Using such logic, we can quickly identify if an operation is available in the graph. The following code snippet shows how we store the graph in Rust.

The reasoning we described works only for expressions that consist of operands available before the branch. If we have the following situation:

Our code is not capable of knowing that e-1 is available on both branches, because its operands have different identifiers. We also assumed that developers would be able to realize they were using two equivalent operations on all branches and would try to change their code for a better readability. We show below what the graph would look like for both branches. Even though a-b has a different ID on each copy, we detect it is the same expression because the operands’ IDs coincide, as well as the subtract operation.

We’ve set a series of rules to intersect the graphs seamlessly. Nodes with the same identifier are kept, as well as those whose global identifier differs, but whose ExpressionType is the same (this the a-b case from the example). In this case, we must remove from memory all children from such a node, because we are not tracking expressions whose operands were not available before the branch.

So far, we explained how we conducted an available expressions analysis. We’ll explain now how we leveraged the graph data structure to eliminate common subexpressions. During the first pass, we can use the graph to identify repeated expressions, as we already explained. We created a new structure to track common subexpressions. As soon as we find a repeated one, we save such an expression in a hash table (the tracker itself). The hash table’s key is the ExpressionType, which consists of the operands’ ID and the operation.

The tracker saves not only the repeated expression itself, but also the block this expression should be instantiated, and if there is already a local variable that carries the expression’s value. We need to know the block where the expression should be instantiated because, if we are dealing with expressions made available inside branches and used after them, the expression must be resolved after the branch command. An example of this case is shown below:

Expression a-b should be resolved before the if condition, so that there is a variable carrying its value at both branches and at the return statement. The optimized CFG looks like this:

We also track the variables to which an expression has been assigned, so that we avoid creating unnecessary temporaries. In the above example, if x*(a+b) appeared again after the assignment, we would exchange it by d .

When our tracker has all the repeated expressions, we perform a second pass over the CFG. We build again the graph data structure and, as it is deterministic, it preserves the node’s unique identifier from the first pass. Upon processing an expression that is in our tracker, we identify it by a hash of its ExpressionType and, if there is a variable available containing the resolved expression’s value, we can exchange the whole expression by this variable. It can be either a temporary or an existing local variable.

Each instruction in Solang contains the expressions that must be evaluated so that the instructions can be executed. For instance, if the instruction is a function call, we must evaluate its arguments before calling the function. Therefore, to perform common subexpression elimination during the second pass, we traverse the expression tree of each instruction. Our traversal recursively regenerates each node if no variable is available and exchanges the node by a variable node, if there is one available.

Temporaries are instantiate in the instructions preceding the first usage of the expressions they represent. In the case of temporaries that must be available inside branches, they are created in the first common ancestor block of the branches’ blocks.

Solang’s reaching definition does not work yet for contract variables, struct members and arrays, so our common subexpression elimination is limited to local variables. The reaching definitions is very important for the algorithm as a whole. Take the following example:

Expression x+d is clearly available at the loop. However, as the value of x keeps changing, we cannot exchange x+d by the local variable p . The reaching definitions tells us that a definition from a descendant block is available at the beginning of the while block. This means we must kill x before continuing our analysis, even though we haven’t processed an assignment to x . Despite that, the algorithm detects that x+d is calculated twice in the loop before x gets assigned a new value, so the expression is optimized, as shown below:

The algorithm has shown to work properly for our use cases. Below there is a good example of a function and its optimized CFG.

From the generated CFG, we notice some nice aspects about the algorithm. First, it did not create a temporary for (((arg #0) + (arg #1)) + int256 5) beause x was already available for that. Second, at int(p2+9) it did not create a create a variable for p2+9 , for we only needed int(p2+9) later.

The common subexpression elimination was a completely new pass for Solang. After doing some research online, I couldn’t find any documentation about its implementation, besides theoretical explanations, so I devised my own approach to solve the problem. The Linux Mentorhsip has been an enriching experience, in which I deep dove into compilers and made new connections.

Join Coinmonks Telegram Channel and Youtube Channel learn about crypto trading and investing

Also, Read

--

--