Imagine that you’re a Software Engineer and you’ve joined a new company, or moved internally and you’re confronted with a new codebase or a new subset of the codebase which you haven’t encountered before. To get a jump start on the new assignment, you may decide to spend some of your free time reading existing sources. Maybe you have a mentor who can point you to exactly which files to read, and in which order. However, if you’re left to your own devices, it can be difficult to get the most out of your source-reading time. This article addresses such a scenario by providing an approach to identifying the most important files in a Git repository. After delineating the steps, it will look at the results of running such a search on some of the more popular open source repositories on GitHub. Lastly, it will discuss potential improvements and ramifications.
To start, let’s convert the git history into a weighted undirected graph. The list of files in the repository (or sub-folder of the repository) will form the vertex set. The weight between any two nodes will be equal to the number of times that those two files have appeared in the same commit.
For example, in a dummy repository which has one commit containing two empty text files, we would see a graph with two nodes and one edge between them with a weight of one.
After determining how often files were committed together, the edge weights of the graph should be replaced with numbers which are inversely proportional (i.e. x = 1000/x), so that a small edge weight means that two files were committed together often. This will lend itself to the next step.
After constructing the graph to model the repository, it is amenable to standard graph algorithms. For this approach, we’ll turn to a popular algorithm for identifying the “key” elements of a graph called betweenness centrality. This algorithm assigns an integer score to each vertex; the higher the score, the more “central” the vertex. A vertex score is incremented if it falls on the shortest path between any two other vertices in the graph.
After calculating the vertex scores, it is appropriate to consider the files with the highest scores as the most important files in the repository (or sub-set) in question.
In addition to Git, this approach will work for any version control software whose basic unit of contribution is a commit (or similar concept; e.g. a changelog.)
This section is comprised of listings, and so it is a bit dry, but it may be of particular interest to those who use one or more of these libraries. The libraries profiled are: Google Guava, ZooKeeper, Git, Spark SQL, and GraphX.
According to the above approach, these are the five most important files in the Google Guava repository:
According to the above approach, these are the five most important files in the ZooKeeper repository:
The Git project is managed using git. According to the above approach, these are the five most important files in the Git repository:
According to the above approach, here are the five most important files in the Spark SQL subset of the Spark repository:
According to the above approach, these are the five most important files in the GraphX subset of the Spark repository:
There is no benchmark to quantitatively determine the validity of these results. After all, file importance is a subjective attribute. However, in my experience, showing these results to people with relevant experience will generally elicit head nods. Unfortunately, sometimes doing so will also elicit groans, as the approach here is particularly adept as identifying particular types of anti-patterns (e.g. the God object antipattern.)
One challenge of implementation is to take into account refactors that split one file into many, or change the name of a file. Unfortunately, the prototype built for this article does not account for these, and so some of the results include files which are not currently part of their respective repositories. This could be frustrating to a newbie, but it could also help them understand a little about the history of the project.
In addition to the use outlined in the introduction, this type of analysis could also be useful to white-box QA engineers looking for errors or vulnerabilities in a codebase.
In some cases, repository histories can span decades, and reducing that sum of data into a prioritized shortlist can be a challenging but worthwhile endeavor. If you’re interested in running this on your own repository, please send a note to email@example.com.