Measure the Cadence of Commits in Your Git History

How to mine data from the git log in order to construct test oracles

Noah Sussman
Jan 29 · 4 min read
Photo by Michael Nöthiger on Unsplash

The Git Log as a Source of Oracles

In order to properly test the source code in a Git repository for any abstract definition of “quality,” we need to first establish some baseline measurements of the source code itself. In software QA these baseline measurements are generally called oracles.

In software QA jargon, “oracle” is the general term for the methods we use to evaluate the correctness of a test or suite of tests. When someone looks at your test results and asks “how much confidence do you have in those test results?” whatever you answer them with, is an oracle. In terms of software testing theory, an oracle is a necessary component of any complete test.

What oracles can we look to when testing the quality of source code in a Git repository? One is the cadence of commits in Git history: How often is code committed? Knowing this, it then becomes possible to test assumptions about (for instance) how often new code is put into production, or how often existing code is refactored.

Constructing an Oracle: “Cadence of Commits

In the rest of this piece I’m going to build up an increasingly granular measurement of cadence of commits, starting with the absolute simplest use case for counting commits in Git history. I hope this will be practically useful (the code shown here will work with any Git repo) and that it will also help to show how to construct oracles from data generally.

The one-line shell script above prints out the number of commits in the Git log (for the current branch). What does the count mean? It’s an oracle, so its exact meaning is contextual.

However, the raw number of commits in a repo is a useful data point on the frequency with which a codebase gets touched by its authors. For instance, relatively few commits in a longstanding repo would indicate that development has historically been performed in large batches, rather than the preferred pattern of many small commits over time.

In order to determine the age of a Git repo, use the following one-line shell script:

This will print out the relative age of the repo, which will be a string something like 36 months ago to indicate that your repo was created 36 months ago.

In order to get a general idea of the cadence of commits in any Git repo, you could now write a three-line shell script, like the one below, which should print something like 360 commits since 36 months ago:

A maximally high-level view of the cadence of commits in the Git log.

Once you have the age of a Git repository, plus a count of how many commits are in its history, you can have a metric composed of more than one variable — something that’s all-too-rare when it comes to measurement!

However, the shell script above does not really get at “cadence” because we cannot use its output to invalidate assumptions about how code is committed over time.

Cadence of Commits Over Time

The Git log contains entries with timestamps but it’s not a time series. Rather, the git log command is a client for running queries against a complex data structure stored in the.git directory.

The most direct way to build a time series out of the timestamps of commits in the Git log is to extract all the timestamps of all the commits and then count how many commits fall into a certain period of time.

For instance, how many commits per month, over the lifetime of the repo? It turns out this question can be answered with relatively few lines of code:

A one-line shell script that extracts the number of commits per month from a Git repo.

The script above is admittedly terse and dense. It should print something like this:

The output is tab-delimited, meaning that if you redirect it to a file and give that file a .tsv extension, you can import it into Excel or Google Sheets and immediately make a visualization of commits-per-month in your repository. You can also perform standard data analysis tasks such as finding the average, median, and standard deviation of commits-per-month in your Git repository.

Note that months are calculated according to your current time zone. If you aren’t in your office’s time zone, you can change the value of the TZvariable at the beginning of the script to reflect the time zone that most of your work happens in — e.g.: TZ=-0500. Alternately, you could set TZ=UTCif your team is globally distributed.

Weekly Cadence Of Commits

Although monthly cadence is the most direct way to a time-series measurement of a Git repo, it would be more useful to derive the weekly cadence of activity, since for the most part software teams plan in terms of (blocks of) weeks.

The weekly cadence of commits can be calculated using dateroundfrom the dateutilspackage:

Adding the dateutils command to the recipe above enables bucketing commits by week instead of by month.

This will again output tab-separated values (TSV) that you can save directly to a file with the .tsv extension and then use a data visualization tool to visualize.

I will now conclude, since having drilled down to the weekly cadence of commits we’ve arrived at a Git measurement that’s granular enough to be of use when talking about code quality. In other words, we have arrived at a useful oracle!

Better Programming

Advice for programmers.

Noah Sussman

Written by

Better Programming

Advice for programmers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade