Exploring the hidden architecture of your codebase

Charles Géry
7 min readJul 21, 2021

--

We often talk about “exploring” a codebase, particularly when joining a new team or project. Instead of exploring by opening files and folders in an IDE, how about if we could explore in the way that we explore a new city, investigating different neighborhoods and following connections between them. We’ve built a tool that allows you to visualize your codebase and explore it in this way.

In what follows, we present Viseagull an open-source tool that helps you analyze and visualize your codebase.

Ludic design / The approach

The approach of this project is not to provide a tool to provide a particular analysis. Inspired by ludic design, this project aims at creating an interactive and playful visualization, where users can try different things, and set their own goals. We hope that it will inspire curiosity, exploration, and reflection.

Given this approach, we would love for you to try the tool on your own codebase, and let us know what you find. Please post a comment below letting us know how you used a tool, and if you can, a screenshot of how your codebase looks in the visualization. Your feedback will help us better understand how you interact with the visualization, how it can help understand and navigate a codebase, and so on.

How to try this for yourself!

To try the tool you need to clone the project’s repository and run it on your own machine to analyze any local repository or online open-source repository. Further instructions on how to install and use the tool are provided in the project’s README.md.

Feel free to ask any questions you might have on the tool in the comments or directly on the Github page of the project. Feedback on what you think and how you use the project would be greatly appreciated!

The rationale: couplings between files

The idea to create this tool comes from the fact that files are coupled in a repository. They interact with each other in ways that are not necessarily easily detectable. Different types of couplings between files exist, but in what follows we will focus only on two of them :

  • Logical couplings: it is the kind of couplings you have between files that are modified in the same commits. For instance, if you have a source code file and the related test file, they will probably often be modified in the same commits: they are logically coupled.
  • Semantic couplings: they are couplings between files that share a common lexicon. For instance, if two source files are part of the same feature, they will probably share a part of their lexicon: they are semantically coupled.

Knowing these couplings helps better understand how your codebase is structured. The goal is then to display these couplings for developers to easily visualize them.

In what follows we will explain how Viseagull computes the couplings and then displays them.

How does it work?

Viseagull works in two main steps :

  1. It analyzes your repository to find the couplings between the files and get clusters of strongly coupled files,
  2. It creates a visualization of your repository based on the results of the analysis.

Let’s further explain each step :

1. Analyzing the repository

Viseagull implements two types of couplings — logical and semantic couplings — but the way we get them is similar. We first get a distance matrix containing the distance (i.e. the strength of the coupling) between each pair of files. This distance matrix is then input in a clustering algorithm to detect groups of strongly coupled files.

The precise way the algorithm works for each type of couplings is as follows :

  • For logical couplings, we first compute a data frame containing the files as rows, and the commits hash as columns. If a file was modified in a commit, the corresponding cell will be set to 1, 0 otherwise. From this data frame, we can compute a distance matrix containing the Jaccard distance between each file. Finally, the distance matrix is used for clustering. We get the clusters of files that were frequently modified in the same commits.
  • For semantic couplings, we compute the tf-idf vector (after preprocessing the files) of each file and then compute the cosine similarity of each pair of files. This cosine distance matrix is then used for the clustering. Files with the same lexicon will be grouped. For now, semantic couplings are only supported on Python files.

2. Visualizing your repository

Once the analysis is over, the tool will create an interactive visualization. As Software as Cities, this visualization makes an analogy between codebases and cities: codebases are displayed as cities where the files are the buildings.

In practice, our visualization works as follows :

  • Files are represented as buildings, whose shape, color, and other contextual information can encode information about the files (we will into more details about that below).
  • The files are grouped in clusters/cities of strongly coupled files, based on the results of the analysis.
  • The height of the buildings is the number of times a file has been modified in commits (i.e. the higher, the more commits it has been modified in),
  • Roads link clusters/cities if files from both cities have been modified in common commits.
A cluster/city containing several files/buildings. The city is connected to others ones by several roads.

To choose where to place the cities we run a t-SNE dimensionality reduction on the distance matrix from the analysis earlier. This gives us a 2D position for each file. We then get the 2D centroid of each cluster of files to get their position. Finally, we use a spring-layout-like algorithm to avoid overlap between the clusters/cities and preserve as much as possible the distance between the clusters. Using this approach, we ensure that similar clusters are close: the distance between the clusters should be as close as possible to the Jaccard/Cosine distance between the files that compose them.

The visualization aims at being interactive. Therefore, the users can scroll, zoom and hover over the different elements to gain more information.

Hovering, Zooming and Scrolling example
Clicking on a cluster to see the routes it is connected to.

The visualizations of logical and semantic couplings share similarities, we will get into more details about that right now. We will also describe the parameters of the visualization you can play with :

Similarities between the visualization of logical and semantic couplings

When you select the type of couplings you want to display, only the way the files are clustered in cities will change on the visualization. The rest of the visualization will be the same, no matter the types of couplings you chose to display :

  • The height of the buildings will always be the number of times a file has been modified in commits (i.e. the higher, the more commits it has been modified in),
  • Roads will still link clusters/cities if files from both cities have been modified in common commits.
Same repository as in the previous images, but visualized with semantic couplings.

Visualization parameters

Several parameters are available in the visualization to help you explore your codebase. These parameters are listed below :

  • You can change the color of the buildings based on: last modification date or creation date. In both cases, the more recently a file has been modified/created, the redder it will be.
The more recently modified files are redder.
  • Highlight the buildings modified in a commit: if you input the hash of a commit, all the files/buildings modified in it will be colored in blue.
  • Display only the roads with a width bigger than X (where the user can choose the value of X). It helps to remove noise (i.e. roads with small width).
Changing the road width threshold.

Visualization interpretations

The visualization wants to be ludic by nature and has no specific goal. Depending on the repository you are looking at, the interpretation of the visualization might vary. Moreover, the quality of your commits (for logical couplings) and your naming (for semantic couplings) can affect the quality of the results of the analysis. But here are a few general interpretation leads that might apply to your specific repository.

You should aim to get a “sustainable city”. As in real cities, you don’t want to have cities that are too spread out (lots of roads, lots of small buildings that are not strongly correlated). It means that all the components of your codebase are loosely correlated but loosely correlated to a lot of other files.

You also want to avoid having a single cluster that contains all your files. It means that all the files are strongly correlated and that there is no file specialization: the files are in charge of a lot of different things.

On the other hand, you want to aim for a balanced city. You want to have medium-sized clusters and keep the roads to a minimum. The goal is to have clusters of specialized files (for a specific feature for instance) that are coupled only with the necessary other clusters.

--

--