Entropy of abstract syntax trees

Viktor Tóth
Mindsoft
Published in
9 min readJun 20, 2018

Take an Abstract Syntax Tree (AST).

This would be equal to something like:

def fact(n):
if n == 0:
return 1
else:
return n*fact(n-1)

A real AST is more complicated and filled with details unnecessary at the moment. The point is that the tree representation is more true to the inherent structure of your code than indented lines of strings. It’s easier to parse by a compiler/interpreter and by machines in general.

There are different measurement of code complexity, simplest being line counting. Other complexity metrics takes into account the tree like structure of a program, like the Cyclomatic Complexity — which uses the control flow graph instead of an AST, but let’s not get into that now. Complexity measures in general give a proxy for the ease of understanding and thus, maintaining the measured code base. A 10,000-line long C++ program is more difficult to update than a 100-line one. Similarly, given by Cyclomatic Complexity, deciphering 200 distinct execution paths is a heavier job than just following through a single one. The entropy metric I came up with is different. It gives you, in short, how repetitive the code is. In conjunction with the previously mentioned methods, it can give you a more accurate complexity metric. Here’s an extreme example: a 10,000-long code file is not that complicated if the same 1000-long code segment is repeated 10 times.

Now, here’s where entropy comes into the picture. Shannon entropy measures information content — the more “random looking” your code is or the less repeated segments it has, the more information it contains, AKA the more bits you need to encode such a string. Let’s take the Shannon entropy of strings as an example. Here are two strings: aaabbb, abcdef. Each string has 6 consecutive symbols. To calculate entropy or information content, we need the probability of each symbol appearing in the string.

aaabbb: p(a)=3/6, p(b)=3/6
H(aaabbb) = -p(a) ⋅ log₂(p(a)) + -p_b ⋅ log₂(p(b)) = 1.00

abcdef: p(a)=p(b)=p(c)=p(d)=p(e)=p(f)=1/6
H(abcdef) = 6 ⋅ -1/6 ⋅ log₂(1/6) = 2.58

You can follow the exact calculation of Shannon entropy here. A somewhat unorthodox step at this point would be to say that two consecutive characters are taken as a symbol, by also letting these symbols to overlap. So, abcdef becomes a string of these symbols: ab, bc, cd, de, ef; similarly, aaabbb becomes aa, aa, ab, bb, bb. This actually makes sense if e.g. the ab symbol has more quality, meaning to it, than just the sum of the sub-symbols a and b has. We could even go further and take overlapping three long segments as symbols, so abcdf would become: abc, bcd, cdf. Obviously, the entropy measured at different symbol lengths are not necessarily the same.

What happens if we swap strings to trees, particularly to ASTs? Symbols become nodes of the tree, or operations, literals as for syntax trees. Taking the example of the above shown AST, we can compute the probability of each symbol’s appearance:

p(if) = p(==) = p(0) = p(*) = p(fact) = p(-) = 1/11,
p(1)=2/11, p(n)=3/11

Essentially, we can flatten the tree to a string, by traversing the tree inorder for example: n=0i1n*fn-1. Taking only the first characters of the symbols in this case, so I can just pass it to an online Shannon entropy calculator. The resulting entropy is 2.85.

WAT? In this sense, what’s the difference between trees and strings anyways? Remember, we want to quantify how much repetition the code has. Not just the redundancy of single nodes, or operations in case of ASTs, but whole peaces of code, maybe a whole function. So, by doing the previously described unorthodox expansion of symbols, overlapping tree samples of given size can become a symbol. A tree sample here is defined as a tree inside a tree — a connected subset of nodes. Here are some overlapping 2 sized tree samples taken from the above shown AST:
if → ==, if → 1, * → n , fact → -, etc.,
where is the parent-child relationship.
3 sized tree samples taken from the same example AST would be:
== ← if → 1, == ← if → *, 1 ← if → *, n ← == → 0, n ← * → fact,
… you can imagine how the symbols or tree samples can be expanded further to arbitrary sizes.

To compute the probability of each symbol appearing in the tree, we “just” need to count how many times each distinct tree sample is repeated and divide that by the number of all possible tree samples. The denominator is the most computational heavy and least trivial implementation-wise, but after we have the probabilities for each unique tree sample, the entropy calculation is the same as before.

Maximum entropy can be achieved if all tree samples are unique in the tree. A tree has minimum entropy if all tree samples are the same, e.g. when all nodes are the same.

Algorithm

Mining the AST out of the code is the part has to be implemented separately for each programming language; the rest is generic. So far, I’ve coded it for C++ and Python. For the former, I used LLVM to first compile, then to spit out the tree. As C++ code symbols can be ambiguous, the compilation is necessary, which then obviously requires the code to pass the compilation phase without an error. Just think for a sec how many interpretations this C++ code snippet might have: x * y;

  • Declaration of y pointer of type of x,
  • Multiplication of x by y.

The Python version, being an interpreted language, is not that needy and can derive the AST unless the syntax is faulty.

As you might have guessed, keeping track of all unique tree samples becomes close to impossible for large enough trees and expanded enough tree samples. Some trickery is needed to deal with the complexity.

Trick #1

To assess tree sample repetition inside the AST, we need an algorithm to keep track of all tree samples, and group them according to their structure and node ids. After the grouping, we get to know, what kind of tree samples (symbols) are present in the tree and also, how many of them. Essentially, we have the nominator in computing the probability of a symbols’s appearance, pₛ=repeatedₛ/#symbols; arriving at #symbols, or the number of all overlapping tree samples in the tree is a separate problem.

The number of tree samples to track increases exponentially as we raise the size of the AST and the size of tree samples. There is no way to track all symbols of size >10, even in a small sized AST (~5000 nodes).
But, we only have to track the repeating symbols, the rest can be discarded, as each of the non-repeating ones has a repeatedₛ value of 1, so their pₛ=1/#symbols always. I made a visualization of how repeated tree samples of increasing size look like in a small tree. Let’s see if you can figure out the trick just by inspecting the animation:

Depiction of repeated overlapping tree samples at different sizes (k).

It’s a mess, I know. Nevertheless, one thing is apparent: unique nodes (ones that are not circled at k=1) are never part of any repeating tree sample, whatever the k is. One step further, a unique, connected pair of nodes in the tree (k=2) is never part of the repeating tree samples of size k>2. So a general rule can be discovered that by keeping track of repeating tree samples at increasing sizes, we know that at size k+1, we only need to extend the already found non-unique tree samples of size k. Even going further, when we extend the tree samples — expand the tree by one node that is connected to it in the original AST — , the extensions themselves should be something that is repeated to get a potentially repeating k+1 sample. So an easy way of tree sample expansion would be to keep all the repeated parent-child pairs (k=2), and try to attach them to the samples at size k in the following way: for each k sized (repeated) tree sample uᵢ, and each 2 sized (repeated) tree sample qⱼ, check if uᵢ contains the node parent(qⱼ), but not child(qⱼ); if that fits, uᵢ can be extended by child(qⱼ) at the proper position to arrive at a likely repeated symbol of size k+1.

After all possible expansion is done, the tree samples are grouped, and the member of the groups containing more than one tree sample are kept alive for the next iteration, so that they can be expanded again. The grouping part of the algorithm requires fast comparison of tree samples, which I solved by comparing the hashes of said samples. Also, when extending trees, we need a fast method to check if a node is part of the sample or not; and again, it’s straightforward to have a hashset of nodes for each tree sample.

Trick #2

To get the probability of each symbol, and thus the entropy of the AST, we still need to compute #symbolsₖ, which is the number of all k sized tree samples in the given AST. At k=1, #symbols₁=#nodes. At k=2, #symbols=#nodes - 1 or the number of children in the tree, all except the root.

All 2-sized overlapping tree samples.

Deriving #symbols₃ is still intuitive: at each node vᵢ the possible symbols rooted at vᵢ are counted; if vᵢ has m number of children, 2 of them regardless of order, have to be chosen, so a 3 sized tree sample is created. Another way of constructing a 3 sized tree is to take a branch from vᵢ that is at least 2 nodes long, so together with vᵢ, they make up 3 nodes; so let’s say we have the number of children of children of vᵢ, m². m²gives us the number of 2 long paths starting from vᵢ . Thus, #symbols₃=SUMᵢ ((mᵢ choose 2) + m²ᵢ). When k>3, our hands have to get dirty.

This problem in essence is described on this reddit page. In short, at each node vᵢ, we have to count the number of possible tree samples of size 1, 2, …, k rooted at vᵢ. This is performed in a postorder traversal of the AST, ascending from the leaves upwards. At vᵢ, we should know the tree sample sizes up to k of all its children. To count all the combination of k nodes rooted at vᵢ, we have to solve the subset sum problem — take combinations of tree samples rooted at the children of vᵢ, so the sum of their sizes equals to k-1; which is the same as having a set of numbers (children’s tree sample sizes), and trying to find combinations (subsets) of these numbers to sum up to k-1. Because we need
k-1 nodes below vᵢ, to have k nodes together with vᵢ included. Consult the implementation of ntreesample and ntreesample_ functions for further details— I have done extra optimization on top in there — , but I hope you got the gist of it.

I’ve also implemented a fast, approximate version of tree sample counting, because the solution described above likely has a k degree polynomial complexity that blows up when k > 10. The approximation is made by randomly generating mini trees that are representative of the degree and structure of nodes and their surroundings in the original AST. After counting the number of tree samples of size k on those mini trees, I average the counts over them, then multiply the average by the number of nodes to get the result. This approximation is as far off from reality as gender politics; nevertheless, an error here, becomes log(error) in the entropy value, so it only has to hit the ballpark.

Further development

Both the repeated tree sample expansion and tree sample counting algorithms are subject to parallel implementations. It is sort of needed to perform these computations in parallel, as the achieved running time could make the entropy metric applicable to industry sized code bases. All in all, tree entropy is ridiculously hard to compute for large enough ASTs, but it’s also a delicate measure to assess the regularity of code segments.

Check out my implementation from here, and play around with it! It takes one run to calculate the tree entropy of your C++ or Python code.. you might have to kickstart LLVM first, but only for the C++ part!

--

--