Basic understanding of Abstract Syntax Tree (AST)

Jessica López Espejel
5 min readJan 1, 2023

Abstract Syntax Tree (AST) is a tree representation of the source code. Since the tree describes the hierarchy of each node, it makes simple to understand the relation between each node. For example, we have two mathematical expressions: 1) 10 — [(5 / 4 ) + 1] and 2) [10 — (5 / 4)] + 1. Both expressions have the same numbers, but the way of using parentheses and brackets are different. Therefore, figure 1 shows the syntax trees generated by both expressions (left side — expression 1, and right side — expression 2).

Figure 1. Syntax tree of expression 1 and expression 2, respectively.

The syntax tree is composed by nodes and edges, formally we can denote it as T=<N, E>, where N is the list of nodes, and E is the list of edges. The nodes contain the data, and the edges indicate the hierarchy among the nodes. In the nodes of a tree we can identify the root node (red node in figure 1), and the leaf nodes (orange outline nodes). The root node has the highest hierarchy in the tree, and the leaf nodes are the last nodes in the tree.

We can traverse the abstract syntax tree recursively. However, there are already some libraries that make our work easier, for example, tree sitter.

Nodes AST using tree_sitter

Tree sitter is a powerful parser able to generate syntax trees from more than 40 programming languages. For this example, we will use tree_sitter to get the list of nodes from the 10 - [(5 / 4 ) + 1] (expression 1).

from tree_sitter import Language, Parser

# We instanciate the parser, and we indicate the programming language.
parser = Parser()
LANGUAGE = Language('parser/my-languages.so', 'java')
parser.set_language(LANGUAGE)

def tree_to_token_nodes(root_node):
"""
It returns a list of nodes
"""
if (len(root_node.children)==0 or root_node.type=='string') and root_node.type!='comment':
return [root_node]
else:
code_tokens=[]
for child in root_node.children:
code_tokens+=tree_to_token_nodes(child)
return code_tokens

if __name__ == "__main__":
tree = parser.parse(bytes(code,'utf8'))
root_node = tree.root_node
ast_token_nodes = tree_to_token_nodes(root_node)
print(ast_token_nodes)

The generate list of nodes is the following:

[
<Node type=decimal_integer_literal, start_point=(0, 0), end_point=(0, 2)>,
<Node type="-", start_point=(0, 3), end_point=(0, 4)>,
<Node type=identifier, start_point=(0, 4), end_point=(0, 4)>,
<Node type="[", start_point=(0, 5), end_point=(0, 6)>,
<Node type="(", start_point=(0, 6), end_point=(0, 7)>,
<Node type=decimal_integer_literal, start_point=(0, 7), end_point=(0, 8)>,
<Node type="/", start_point=(0, 9), end_point=(0, 10)>,
<Node type=decimal_integer_literal, start_point=(0, 11), end_point=(0, 12)>,
<Node type=")", start_point=(0, 13), end_point=(0, 14)>,
<Node type="+", start_point=(0, 15), end_point=(0, 16)>,
<Node type=decimal_integer_literal, start_point=(0, 17), end_point=(0, 18)>,
<Node type="]", start_point=(0, 18), end_point=(0, 19)>,
<Node type=";", start_point=(0, 19), end_point=(0, 19)>
]

It can be seen that expression 1 generated 13 nodes, i.e., the abstract syntax tree of expression 1 contains 13 nodes. Attention: these are only the nodes that compose the tree, it is not the tree yet. Each of these node has the type of the node, the start_point and the end_point. The type node can be decimal_integer_literal, identifier, character_literal, keyword, etc.

  • decimal_integer_literal — numbers (either integers or decimals)
  • identifier — name assigned by the programmer
  • character_literal —are the strings. For example in System.out.printl(‘helllo’), the character_literal is ‘hello’
  • keyword — own words of the programming language such as public, private, Integer, and so on.

Attention: Each programming language and each tool we use defines differently the properties of each node.

Generating Abstract Syntax Tree using AST explorer

AST explorer is a website so useful to generate AST in a large number of programming languages, such as, Java, JavaScript, Python, etc. Moreover, there are a large number of parser like acorn, @babel/parser, seafox, shift, and so on. Continuing with expression 1 of the previous example, the figure 2 shows the AST generated by the JavaScript programming language and the @babel/parser.

Figure 2. AST generated by AST explorer website

Figure 2 shows the abstract syntax tree. The tree indicates that this is a file that has multiple attributes, e.g., a type, an start, and an end. In turn, there is a program, and within the program is the body of the tree. It is precisely the body of the tree that contains the data we are going to analyze in Figure 3.

Figure 3. Tree body details

Figure 3 shows the formal structure of the tree (Figure 1 — left). The BinaryExpression (-) is highlighted in red (root node). This node is divided into left (blue color) and right (green color). On the left side you can see the value 10 of the node, however, the right side has more data ‘(5/4) +1’, so only a small part is shown.

AST applications

  • Compiler — for a better explanation here
  • Code maintenance
  • Code refactor
  • Clone detection
  • In the last years, abstract syntax trees have been used for artificial intelligence. For example, StructCoder used AST to make the model understand and recognize the syntax of java programming language in the task generation task.

Conclusion

Abstract syntax trees have been used mostly for building compilers, however, they are also useful to see the differences from one version to another in a script. Nowadays, there are many tools to obtain the ASTs of different programming languages, and with different parsers. Recently, they have been used for code generation in artificial intelligence. Much work remains to be done to exploit their properties and apply them in understanding syntax for AI models.

This short blog only presents the introduction of ASTs, and some of their applications. In future blogs I will explain step by step how to traverse an AST, and other practical examples using the tree sitter tool.

References

[1] https://www.newline.co/courses/practical-abstract-syntax-trees/what-is-an-ast

[2] https://www.twilio.com/blog/abstract-syntax-trees

[3] Tipirneni, Sindhu and Chandan, Ming and K. Reddy, Zhu. StructCoder: Structure-Aware Transformer for Code Generation. Proceedings of the 39th International Conference on Machine Learning, 2022.

--

--