Semantic Code Graph — the idea behind Graph Buddy

Krzysztof Borowski
Jun 1 · 10 min read

Graph Buddy is an experimental tool created to visualize code structure as 2D/3D graphs. It aims to support code comprehension [1] processes and speed up reading and browsing the source code. However, that is only one of the many yet-to-come uses of the broader idea of Semantic Code Graphs. The general thinking is — the code structure and semantic dependencies of any program can be represented as a directed graph, precise enough to be valuable in various analyses and visualizations. Let me elaborate on this topic.

You can learn more about Graph Buddy at graphbuddy.virtuslab.com.

Importance of static code structure analysis

It’s hard to estimate how many new lines of code we create each year but there is one thing we can safely assume — the numbers are huge. Sage in his post estimates around 93 billion lines of code are written each year. Last year Github reported over 60 million new repositories created in 2020 and 1.9 billion contributions from over 56 million users. What is more, users’ activity has increased even more due to the current Covid-19 situation. The importance and usefulness of various software solutions have been proven during these hard times and software companies will hopefully continue to deliver. Having said that, the software is probably not going to disappear and active maintenance of huge numbers of existing systems will be crucial in the coming years. This implies that even a small improvement that reduces software maintenance overheads, multiplied by the number of developers and lines of code written each year, will have a significant economic impact and hence is worth pursuing.

One of the visible problems of today’s software systems is maintaining a high code quality which usually easily degrades over time [2]. What I find particularly interesting is the quality degradation of code structure which I have observed over the years in different systems.

Controversial claim — the problem of maintaining a clear project structure and dependencies between code units is one of the main reasons for microservice architecture’s popularity. Microservices work around this issue by dealing with smaller pieces of code and hence making code structure easier to maintain. In many cases, there is nothing wrong with well structured monoliths (or at least a bigger macroservice).

Note: it seems that the importance of this problem is getting wider recognition nowadays and libraries like ArchUnit are quickly gaining popularity.

What is even worse, often you, as a programmer, join an existing project where there was not enough focus or time to keep the software structure and appropriate layers in place. It’s almost impossible to figure out all the twisted dependencies to introduce required changes and it’s even more challenging to systematically improve the project structure quality. Refactoring a particular method or place in the code can be hard, refactoring a spaghetti-like graph of different dependencies is even harder. Speaking of this graph — what is it, can we somehow extract and visualize it?

Code as a graph

Representing code as a graph is not a new concept. There are two main types of graphs used today, Control Flow Graphs (CFG) and Program Dependence Graphs (PDG) [3].

CFGs are about representing program flow. Nodes correspond to the basic blocks (usually a set of statements in the program) and edges set the control path between them (simply speaking the order of statements execution). CFGs are often used for compiler optimization and some static code analysis like computing test coverage.

PDGs are an extension of CFGs. They contain two types of edges, one to represent the control flow path and the second one to represent the statement dependencies.

Let’s consider a simple program:

Compute maximum from two integers and increment by one

The corresponding Program Dependence Graph can look like this:

Example of Program Dependence Graph (edges — control flow path and data dependency)

We can try to construct such a graph not only for a particular program (program here only means a set of instructions) but also include the overall system structure, declarations and dependencies between them — and then it will become a System Dependency Graph.

There are multiple other extensions of these two types [3]. Call Graphs [4] (a type of CFG) are particularly interesting and widely used. Each node represents a procedure and edge from the procedure to indicates that procedure is calling procedure . This concept can be used, for example, to find procedures never called, to detect some anomalies in code execution (like recursion, transitive cyclic dependencies) or to act as documentation to support the program comprehension process for the programmer. Another wide application for program graphs is so-called program slicing [5] — the process of extracting some parts of the program which fulfil provided requirements.

Semantic Code Graph — structural and semantic information linked to your source code

The code graphs presented above tackle some parts of program meta information but none of them can represent the whole system with enough detail to make educated assumptions about the quality of the project structure and help with its comprehension.

Note: when we look at the code as a text file with some language keywords, brackets or significant indention, nested code blocks — we talk about code structural information. Everything that goes beyond that, we treat as semantic information. For example, does this variablereference this declaration? We cannot answer that without looking at the actual meaning of this source code tag (and usually without the help of a compiler or interpreter).

We seek a graph that will address the following requirements:

  • can represent the code structure of the whole system,
  • preserves as many of the semantic details as possible (mostly semantic dependencies between different parts of the program),
  • stores information about the exact place of a given code piece in the source code.

We can try to model the code as a property graph where the nodes represent declarations and definitions, edges model the code structure and other semantic details (like call graphs or semantic dependencies between different modules). Additional characteristics like a location in the source code can be stored as a node and edge properties. This approach is significantly different to PDG where nodes represent statements rather than code declarations and definitions. As programmers, if we browse the source code in the quest for understanding twisted dependencies, we usually go through a set of definitions and method calls and build a mental dependency graph.

The implementation for Scala

To create a graph with this level of details we need two types of information:

  • Information about the code structure (syntax)
  • Information about the code semantics

Code structure

A compiler has full knowledge of the syntax and semantics of the programs. During the compilation, the compiler creates something called an Abstract Syntax Tree (AST). Wikipedia states:

AST is a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code.

The AST looks different in every programming language. What is more, as it is an abstract concept, there are often multiple AST implementations for the very same language (for example, in Scala we have AST generated by the scalac compiler and separate project scala.tree with its own representation of Scala program structure). But, after all, it’s always some sort of tree and represents at least the code structure. While traversing the particular tree we can easily figure out the node’s structural dependencies; which node is a parent, which one is a child. For constructing our graph, it is particularly useful as we can create correct aggregation nodes (for example make a “DECLARE” edge from the class to its method). You can probably spot that, to some extent, we are constructing a variation of a Program Dependency Graph.

If we are talking about Scala, you can experiment with the AST here. For our use, we can utilize the scala.tree library.

But that is only structural information, usually built per file, meaning that we don’t really know how it connects to other parts of our system or if the code is even semantically correct. For example, in the snippet below, the Scala syntax is correct but there is a missing symbol definition:

Missing `b` definition — semantically incorrect code

Code semantics

The compiler also knows additional details about the code, such as the type, scope, package and symbol references between different compilation units. We can incorporate that data to enhance the graph node with additional properties and also add important edges between previously created nodes. Sometimes the AST is directly augmented with such data partially or even fully (i.e. with types, so-called Typed Abstract Syntax Trees, i.e. in Scala Tasty). Particularly interesting is the possibility to extract the calls between different units and mark them as “CALL” edges — in this way we are constructing a Call Graph of our system.

To obtain all the semantic details of the Scala code we can use the SemanticDB library. SemanticDB extracts all the semantic data during the compilation process and stores it in a structured way as protobuf files.

Examples

Let’s experiment with some examples and see how it works.

Note: provided examples are for Scala language but similar concepts can be applied to other languages as well.

The first one is self-explanatory — class has a method returning type :

Intuitively the Semantic Code Graph could look like this:

Semantic Code Graph visualization for class `A`

In fact, as Scala generates some constructs or connections behind the scenes (in this case: constructor for class , marked as , and reference to class) the full graph for class will look like this:

Semantic Code Graph for class `A` visualized by the Graph Buddy tool

Let’s add some class with the reference to class via its method :

And the Semantic Code Graph for and classes together will look like this:

Class `A` and class `B` represented as Semantic Code Graph in the Graph Buddy tool

There are a few interesting observations you can make on the visualization above. Firstly, has a “CALL” edge to directly, not as maybe you could expect, indirectly via parameter. As it is stated in the “Code as graph” introductory paragraph, a direct edge from parameter to would construct a data flow path (from Program Dependence Graph), not a Call Graph path which we are trying to represent by “CALL” edge. Secondly, there is a "CALL” edge from to . Printing the scalac typed AST with from our example will reveal that the sign in scala is indeed just a method name, in this case, a method name called on type.

$ scalac -Xprint:typer Demo.scala
[[syntax trees at end of typer]] // Demo.scala
package <empty> {
class A extends scala.AnyRef {
def <init>(): A = {
A.super.<init>();
()
};
def methodA(): String = "methodA"
};
class B extends scala.AnyRef {
def <init>(): B = {
B.super.<init>();
()
};
def methodB(a: A): String = "methodB".+(a.methodA())
}
}

Last, but not least, it is worth mentioning that each node and edge can have a variety of properties, like a kind (represented by a particular icon), name, file path, location range, package, visibility scope and possibly others added at later stages (depending on the type of analysis we are doing).

Properties for the class `B` node

Possible Semantic Code Graph applications

There are a few areas where this theoretical model could be utilized:

  • source code visualizations (you can try visually browsing the source code dependencies in the Graph Buddy tool)
  • monitoring source code structure quality (by applying network analysis theory to analyze the Semantic Code Graph properties)
  • implementing semantic search functionality (with advanced features like search by parent type, kind or other node or edge properties)
  • others — (maybe you can name one and leave the idea in the comments section? :))
phototrip class dependencies visualization based on a Semantic Code Graph, created in gephi

Summing up

In this blog post, I wrote about one of the branches of static code analysis — representing the program as a graph. I introduced the new and promising variation of Program Dependence Graphs called Semantic Code Graphs, which focus on the code structure and its semantics. Each node represents declarations or definitions and edges represent different relations between them. There are two unique aspects of this approach: semantically aware universal representation of the program code structure and close mapping between the graph and actual source code of the project. This theoretical model implemented for Scala is a good start for exploring its various applications. One of the possible directions is code visualization pursued in the Graph Buddy tool, others are yet to come. Stay tuned!

Fun with Semantic Code Graphs and gephi tool ;)

Bibliography

  1. Von Mayrhauser and A. M. Vans, “Program comprehension during software maintenance and evolution,” Computer, vol. 28, no. 8, pp. 44–55, 1995. Available: https://doi.org/10.1109/2.402076
  2. S. G. Eick, T. L. Graves, A. F. Karr, J. S. Marron, and A. Mockus, “Does code decay? assessing the evidence from change management data,” IEEE Transactions on Software Engineering, vol. 27, no. 1, pp.1–12, 2001. Available: https://doi.org/10.1109/32.895984
  3. V. Arora, R. Bhatia, and M. Singh, “Evaluation of flow graph and dependence graphs for program representation,” International Journal of Computer Applications, vol. 56, pp. 18–23, 2012. Available: https://doi.org/10.5120/8959-3161
  4. B. G. Ryder, “Constructing the call graph of a program, ” IEEE Transactions on Software Engineering, vol. SE-5, no. 3, pp. 216–226, May 1979. Available: https://doi.org/10.1109/TSE.1979.234183
  5. K. Androutsopoulos, D. Clark, M. Harman, J. Krinke, and L. Tratt, “State-based model slicing: A survey,” ACM Comput. Surv., vol. 45, no. 4, Aug. 2013. [Online]. Available: https://doi.org/10.1145/2501654.2501667

VirtusLab

Virtus Lab company blog

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store