Analysing source control history with Rust

Published in

THG Tech Blog

15 min readApr 7, 2020

A software project’s source control history contains a vast swathe of information that can be useful beyond merely debugging who made a breaking change viagit blame…

Catalhoyuk Archeological Site, Konya, Turkey — Hulki Okan Tabak

There is a wealth of data that can be derived from combing through a commit log. Taking the THG WMS project as an example of what can be gleaned from scanning through the commits, we should be able to see:

Areas of the code which exhibit significant churn
Number of changes per component / source folder
Languages in use and number of changes by language
Number of changes that reference a pull request vs number of changes without a pull request

My hypothesis:

By gathering this data from our project commit history, we can gain a better understanding of the project health and how we can improve the codebase and/or our software development practices.

This exercise feeds into our desire as a team to look for means to improve our project and software engineering processes.

Aside: Project Visualisation

A common (and rather mesmerising) project tool for visualising a project commit log, gource allows us to see how the project has progressed over time.

Example of Gource visualisation for YUM

Having a visualisation of the progress is good for presentations, but it’s less useful in the context of a continuous integration environment, although it is possible to constrain gource to only produce a video of the changes between two tags:

first pipe the output of git log to a file
then pass the log file to gource via --log-format git <filename>

Using this it’s possible to create a gource webm or mp4 file for each release as part of our standard build pipeline in Jenkins. If nothing else this fairly hypnotic visualisation looks great.

Extracting the data

Following on from understanding our git branch timeliness, we again make use of the libgit2 library for rust. An alternative would have been to use PyDriller which is a python library which allows you to extract similar information.

Before digging into the code, let’s look at the data we’re interested in (or we can get our hands on from the git repository).

Churn rates

One of the first questions we wanted to investigate was “Which modules in the project are changing the most?”

Knowing the areas of code which are frequently changed could point to a mis-understanding of the requirements or of poor quality code being released to production (requiring significant bug fixes and changes to make the system stable).

On the other hand, an area of code that has been unchanged for a long period of time could be an area of the codebase that is poorly understood (particularly by new-starters). Conversely this static code could be awesome so it never needs to change — the change rate of code is just a hint of it’s quality, but having this data available would allow us to make strategic decisions around prioritising technical work.

Famously, Google has no qualms about rewriting their software every few years, despite rewriting working software having no immediate or apparent financial upside:

Rewriting code cuts away all the unnecessary accumulated complexity that was addressing requirements which are no longer so important. In addition, rewriting code is a way of transferring knowledge and a sense of ownership to newer team members.

Programming Languages analysis

The THG WMS system is made up of various sub-systems, using a variety of programming languages. The majority of the code is java or scala, but there is also a significant amount of python, shell and javascript in use along with a few other languages.

Given the mix of statically-typed vs dynamically-typed languages in play, it would be interesting to see if one language shows significantly more churn (especially when normalised against number of lines of code as there is significantly more lines of code in java than say clojure).

As a team we’re considering rewriting some of the subsystems and knowing which languages seem to correlate with churn more frequently could be a hint that these languages have difficulty expressing the solutions to the problem domain we’re currently working with and perhaps a higher-level, or a more strictly-typed or a more dynamic language would be a better fit?

Missing Pull Requests

In some prior work, we looked at the commits between tags to get the number of changes in a release that didn’t have an associated Pull Request on github. As we use a standard github-based workflow, including code-review via Pull Request, a commit to master which doesn’t reference a pull request is a sign that the process the team have agreed to use, is not always being followed.

Most Productive Engineering Day

On a lighter note, we can also find out which day of the week has the most commits to master. This could be useful for the management team so that they avoid organising meetings etc. on the most productive days, but mainly I was just curious which day of the week the team seemed to be most productive.

Parsing git commit logs

Once again making use of libgit2, we need to set up a walk through the entire history of the project:

Generic function to walk git history

We start with a simple function that creates a Diff for every commit in the history of the project and returns this set as a Vec<Diff>.

Now that we have this collection of Diffs that we’re interested in, we can create functions to extract the information we want.

Counting commits with Pull Requests

One of the simplest tasks is to count the number of commits which are missing a reference to a Pull Request (PR). To do this we need to be able to identify a PR from a commit message.

This function is simply applying a regex to the commit_message. Normally there would be little to write about this sort of code, however with such a large volume of commit messages to parse, we needed to improve the performance. As the actual regex is identical for every iteration, it is sensible to only instantiate it once.

To achieve this we make use of the lazy_static crate which provides a macro to allow the creation of a statically defined Regex. In python, ruby etc we would be able to simple define a global variable, however

In Rust, there is really no concept of life-before-main, and therefore, one cannot utter this:
static MY_REGEX: Regex = Regex::new("...").unwrap();

Now that we have a solution that should be faster, we need to actually test our hypothesis, that recreating the Regex for every iteration is a performance issue in the code.

Benchmarking Rust code

Rust has a robust benchmarking solution, criterion, based on the similar tool for haskell. To use criterion, first we need to specify it in our Cargo.toml file:

[dev-dependencies]
criterion = "0.2"

[[bench]]
name = "my_benchmarks"
harness = false

We don’t need criterion in the final binary so it is only specified in a dev-dependencies section. We then add an additional section for configuring criterion itself.

The benchmarking code lives under benches/my_benchmarks.rs at the root level of the project, alongside src and tests. Criterion provides a cargo plugin to allow us to run the benchmarks via cargo bench.

Finally we need to define the benchmarks for assessing if our use of lazy_static has improved the performance of the code or not

For our initial benchmarks, we have placed the function to be tested and the benchmark code in the same file for readability — in future iterations of the code we’ll properly separate these and import the function under test.

After a quick cargo bench we can see the following output:

extract pr from commit message with lazy_static
                        time:   [300.84 ns 302.80 ns 304.60 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severeextract pr from commit message
                        time:   [60.213 us 60.858 us 61.575 us]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

So using lazy_static instead of instantiating a new Regex for each iteration is giving us a decent performance boost — approximately 300ns compared to 60μs for the original code.

We’ll return to the extremely useful criterion again later.

Component & Language Changes

To get a sense of the “churn rate” for each of the components we need to count changes / component. This is essentially the same code as the previous function, a simple Regex to select the component name from the path of the source file (or files) changed in the commit.

For source file languages we have almost the same code, but using the file extension rather from the source file(s) changed in the commit.

In both of these cases, using the lazy_static crate gives us the performance we need to process the many thousands of commits in the project in a reasonable time.

Walking the walk

Initially the code created the set of Diffs to evaluate and then iterated over this list for each of the set of stats to extract. This strategy works fine for counting the number of Pull Requests, component and language changes. However it was far too slow for calculating the number of commits by month or day of week.

To address this, these calculations were moved in-line to reduce the number of iterations over the set required. This change makes the function more complex:

Even with this small optimisation, processing all the commits to extract all the data required still takes a significant amount of time. It’s time to look at concurrency.

Data Parallelism

Taking a look at the set of Diffs we’re interested in, there is no interdependency from one to the next — the order of processing them doesn’t matter and actually each one can be processed entirely independently from the others. This is a good fit to process these in parallel.

Rust provides a decent set of concurrency primitives baked into the language and stdlib. However there are higher-level abstractions built on top of these primitives which are significantly easier to use. It seems the pre-eminent library or “crate” for this is Rayon.

Rayon

With this library, we can replace simple iterators with parallel iterators. However not all of the code in the core function that processes the commits in the git tree can be easily replaced in this fashion. Good, old-fashioned, threads and mutable data-structures work here to give us the performance required when processing such large amounts of data.

After creating the Vec of component_names using Rayon, we pass this Vec to a thread to count the occurrences and store the results in a BTreeMap for lookup later.

It may seem counter-intuitive to convert the Vec<Diff> into a Vec<String> before counting the occurrences of each component (essentially looping twice when we could just loop once), surely it would be faster to iterate over the data only once O(n) vs O(n*2)?

Back to Criterion to validate this assumption:

extract names and sum v1
                        time:   [79.810 us 80.367 us 81.019 us]
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severeextract names and sum v2
                        time:   [40.291 us 40.573 us 40.883 us]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

The extract names and sum v1 uses the par_iter and then counts the occurrences in a background thread. The extract names and sum v2 performs all the processing in a single loop in a background thread. Intuitively it makes sense that the second version is more efficient (nearly double the performance according to this benchmark).

However when testing with the real (very large) dataset (git repo), counting the occurrences of each component in a single thread looping over the Vec<Diff> was slower than first processing the Vec<Diff> in parallel with par_iter then re-looping over the output in a separate thread, cutting > 100 seconds of total processing time despite looking inefficient on the surface (and in initial benchmarks).

This suggested to me, that the large number of strings extracted from the git diffs does not fit in ram when processing the target git repo using the in-line (single iteration strategy). When splitting the work into two discrete iterations the code applies less memory pressure, ensuring that the code completes faster despite the micro-benchmarking suggesting otherwise.

To quickly check this hypothesis I ran the single-background-thread-loop variant and followed with the par_iter + secondary-loop version:

Swap used for the single-background-thread version of the code — peaked at 7.1GB but sustained > 6.5GB of swap

Swap used for par_iter + secondary-loop version — peaked at 5.2GB but sustained ~ 4GB of swap

Peak Performance

Based on the criterion benchmarks, we now have a new hypothesis: we should be able to halve the processing time, if we can avoid running out of memory.

With the hypothesis that memory is a the main constraint holding back the performance of the code, the structure can be adjusted to try to reduce memory consumption.

Changes to the code to reduce memory consumption and the amount of duplication of processing the same items repeatedly (and finally creating a release build instead of a debug build…) led to radically better performance.

Significantly better memory consumption!

After refactoring the code to get the total time down to approximately 2 and 1/2 minutes to process the entire project git repository, it was time to move on from micro-tuning and onto the results.

Results & Analysis

The first interesting set of information is the distribution of commits (to master) by day of week:

Wednesday is the **most productive** day

We can see that since the start of the project the team commits the most changes on Wednesdays. With Monday and Friday being the slowest days of the working week. We expect weekends to be much less productive and this is shown in the data with 565 commits on Saturdays and just 223 commits on Sundays. Based on this data it may be fair to say that the team should schedule team-wide meetings on Monday or Friday to avoid disruption of their most productive days.

When we consider the commits and how they are distributed over the months of the project we get a different picture:

Here we can see that in June / July 2016 there was a drastic change in commit frequency. Actually this was a process driven change that changed from direct commits to master to development branches and squashed commits after a successful pull request. Let’s just look at the data from mid 2016 onwards to exclude the early days of the project:

In this sample of data we can see that over time there has been a reduction in the number of commits to master. The sharp drop-off in commits towards the end of each year is also quite noticeable. This drop corresponds to the days and weeks prior to Black Friday where the focus changes from feature development to minor fixes and changes to ensure stability during the peak period for sales and fulfilment of orders via the WMS.

Splitting the mono-repo

There’s a steady decline of commits over time — again this fits in with the fact that as the system has been deployed to production and has been in use daily for the past 4 years the need to make larger changes (many commits) has reduced and although development work is still ongoing to add new features etc. there is just less ‘work’ to do compared to when the project was being driven by the initial development work.

A further aspect that would help explain the decline in commits — separate projects and repositories. At the end of 2018 the decision was made to develop new features and sub-systems in separate repositories. As these new sub-systems see lots of commits as features are developed, the fact that they are separated from the main repository gives a false impression into project related commits if we only view the main repository in isolation.

Hybrid Cloud

Further strategic choices in the same time frame around the use of public and hybrid clouds led to “infrastructure as code” folders of the codebase being moved out of the main repository as they were migrated to use newer platforms and technologies.

Obviously this decision drove a large number of commits (to another separate repository) that are not captured in this dataset — leading to a skewed view of development productivity or velocity compared to the previous year.

Ĉu vi parolas Java?

The stats for the breakdown of the commits by language are mostly as expected. The vast majority of the files in the repository are the common languages you would find in an enterprise java application: java, properties, yaml, css, js.

There is a single component written in clojure that has only had 17 commits to master in total — this component was written in late 2017 and hasn’t needed any maintenance since; a low-churn subsystem.

This has pros and cons. On the one hand, it is a success as it has performed correctly for a period of time with little modifications. On the other hand, the language didn’t proliferate into other components and wasn’t considered as a suitable tool choice by other engineers in the team — lisp == Lost In a Sea of Parentheses indeed 😒

In comparison there are a group of scala components that handle stock keeping — these are obviously in the firing line for every single change to how the business wants to record or track stock in the warehouse — 2733 commits to master referencing a scala change is still quite good.

This was initially a single component, however over time the use of scala has expanded to three components now — so the java engineers were more eager to adopt scala than clojure as a replacement JVM language. (The eagle-eyed would have spotted a small amount of kt files — yes there is now a kotlin component that was added then later moved into a separate repository).

The python files are predominantly scripts used for maintenance tasks, test automation and build/development tasks along with some simple web hooks for monitoring purposes.

The yaml/yml files are a mix of “infrastructure as code” (ansible, AWS cloudformation & GCP deployment manager) with Springboot configuration files.

The real outlier is the properties and sh file types which suggests that there is a script that is modifying these files on every commit — something to be investigated as the core java properties and wrapper shell scripts don’t change that frequently. Beyond the other information gleaned from the analysis, this points to some form of defect in the “merge to master CI pipeline”.

Commit inconsistencies

Of the total commits to master 66437 were missing Pull Request messages linked to Jira issues. However this total includes the time period when direct commits to master without a Pull Request was the standard process. If we subtract those commits (65770) we are left with just 667 that didn’t follow the correct process, which is just under 10% of commits since the process was introduced that didn’t comply to the correct commit message format.

Commits & LOC per language

Using the excellent Tokei to get a breakdown of the lines of code for various languages in the repository we can then compare the number of commits by language per lines of code in the repository.

In this repo the java code comprises 53% of the lines of code and about 14% of all commits contain at least one java change. In comparison, looking at the scala code that comprises the stock system, scala is just over 3% of the source code and around 2% of all commits contain at least one scala change.

Given the discrepancy between the number of components written in java and the number in scala it isn’t surprising that the scala code is changed at a reduced frequency. We can perform the same analysis for the other languages in the repo.

Wrapping Up

The next steps are to make the analysis code more configurable so that the engineering managers can do a commit analysis by tag/team member/component only etc. There was also a request to add time ranges/restrictions as an option rather than processing the entire log from the beginning of the project.

One final feature enhancement would be: process each commit and store the resulting data in a relational database. Instead of processing and analysing in place, break the processing and analysis into two discrete steps — classic ETL effectively.

A further optimisation of this approach, to allow for running significantly more expensive data transformations; the analysis could be executed after every merge by CI tools (between the range of commits, rather than running across the entire git history). Again the output would be inserted as records into a relational database.

This would give major benefits in enabling the engineering managers to perform ad-hoc queries against the data extracted. (From my point of view I would be “forced” to modify the code to use rust’s database libraries — oh dear what a shame 😉)

The current rust code to generate these statistics from the WMS team’s git repository is available for study here. It’s (highly) likely the requested features from the engineering managers will be added to this codebase over time.

We’re recruiting

Find out about the exciting opportunities at THG here: