The story of the lost commit: how to solve this mystery

Aleksandr Izmailov
Bumble Tech
10 min readMar 9, 2021

--

It was already evening by the time the developer contacted me. The deadbeef commit patch had disappeared from the master branch.

They showed me the evidence: the output from two commands. The first one was:

git show deadbeef

This showed changes to the file: let’s call it Page.php. The canBeEdited method and its usage had been added to it.

And the output from the second command

git log -p Page.php

did not contain the deadbeef commit. Nor did the current version of the Page.php file have the canBeEdited method.

Not having found the solution quickly, we made another patch to the master branch, uploaded the changes, and I could then return to the problem with a clear mind.

Note: I have tried to make this article as accessible as possible for everybody. Even for those who don’t have much experience with Git. If you’re one of the Git codebase contributors, or you’ve written your own git book already, you may spot the problem quicker than usual.

Had someone done it on purpose? Had the file been renamed?

My starting point in looking for the problem was a request for help published on the release engineer team chat. Among other things, they are responsible for repository hosting, and automated processes related to Git. Frankly speaking, they could have been the ones who had deleted the patch, but had they done so they would have left no trace.

One of the release engineers suggested running git log with the —-follow option. Maybe the file had been renamed which meant Git wasn’t showing some of the changes.

—-follow

Continue listing the history of a file beyond renames (works only for a single file).

We found deadbeef in the git log —-follow Page.php output, but no deletions or renamings of the file. And, what’s more, there was no sign of the canBeEdited method having been deleted anywhere. It seemed that the — follow option had played some part in this story, but it still wasn’t clear where the changes had got to.

Unfortunately, the repository under consideration is one of our biggest ones. From the moment the first patch was added, until it disappeared, there had been no fewer than 21,000 commits. Having said that, we were also fortunate that the relevant file had only been edited by ten of them. I looked at them all but found nothing interesting.

Call for witnesses! We need livebear

Hang on a minute! I thought it was deadbeef we were looking for just now. Let’s consider this logically: there has to be a commit, let’s call it livebear, after which deadbeef stopped displaying in the file history. Maybe this wouldn’t come to anything but it was giving me food for thought.

The git bisect command exists for searching the Git history. The documentation says it enables you to find the commit where a bug first appeared. In practice, you can use it to search for any moment in history, providing you know how to determine whether or not that moment has occurred. For us, the bug was the fact that there weren’t any changes in the code. I could have checked this using another command: git grep. For me, it would have been sufficient to know whether the canBeEdited method was on Page.php. A little debugging and some documentation to read:

livebear [build]: Merge branch origin/XXX into build_web_yyyy.mm.dd.hh

This looks like an ordinary merge commit: merging the task branch with the release branch. However, this commit allowed me to repeat the problem:

$ git checkout -b test livebear^1 2>/dev/null$ grep -c canBeEdited Page.php
2
$ git merge — -no-edit - — no-stat livebear^2
Removing …

Removing …
Merge made by the ‘recursive’ strategy.
$ grep -c canBeEdited Page.php
0
$ git log -p Page.php | grep -c canBeEdited
0

To be fair, I didn’t find anything interesting in livebear, and its connection to our problem remained unclear. Having thought about it a bit more, I sent the results of my searches to a developer. We agreed that even if we got to the truth, the procedure for reproducing the problem would be too complicated, and we would be unable to protect ourselves against a similar recurrence in future. So, we officially decided to abandon our searches.

However, my curiosity remained unsatisfied.

Stubbornness is no vice

I came back to the problem several times, ran git bisect and found more and more different commits. They were all suspect, all merge commits, but this got me nowhere. There was one commit I seemed to be getting more often than others, but, in the end, I am not certain that this was the guilty party.

Of course, I also tried other search methods. For example, several times I went through the 21,000 commits performed when the problem occurred. This was not particularly interesting to do, but I did notice an interesting pattern. I ran the same command over and over again:

git grep -c canBeEdited {commit} —- Page.php

It turned out that “bad” commits, with the wrong code, were all on the same branch. And searching that branch I quickly realised why:

changekiller Merge branch ‘master’ into TICKET-XXX_description

This was also a merge of two branches. And, when I tried to repeat it locally there was a conflict in the relevant file: Page.php. Based on the state of the repository, the developer had left their version of the file, discarding changes from the master (and those were the ones which were lost). A long time had passed and the developer could not remember what had actually happened, but in practice the situation repeated itself in terms of a simple sequence:

git checkout -b test changekiller^1
git merge -s ours changekiller^2

We still needed to work out how a legitimate sequence of actions could lead to this result. With nothing about this in the documentation, I decided to delve into the source code.

Was Git the murderer?

The documentation stated that the git log command accepts several commits as input parameters and is supposed to show their parent commits to the user, apart from commit parents preceded by the ^ symbol. So, git log A ^B is supposed to show commits which are parents of A but not of B.

The command code turned out to be quite complex. There were lots of different optimisations for working with memory and reading code in С has never been that enjoyable anyway. The main logic can be represented by the following pseudo-code:

// commit is a type and a name of the variable
commit commit;
rev_info revs;
revs = setup_revisions(revisions_range);
while (commit = get_revision(revs)) {
log_tree_commit(commit);
}

Here the get_revision function accepts revs — a set of control flags — as an input parameter. Each time it is called, it should, as it were, send, in the right order, the next commit for processing (or an empty value, when we get to the end). There is also a setup_revisions function, which fills the revs structure, and log_tree_commit, which exports information to the screen.

I felt I had worked out where to look for the problem. I sent the command to the specific file (Page.php), because it was only its changes I was interested in. That means that git log should have some logic for filtering “superfluous” commits. The setup_revisions and get_revision functions were used in lots of places — it was very unlikely that they were the problem. The one left was log_tree_commit.

To my delight, this function did indeed have code which calculated which changes had been made in this or that commit. This is what I thought the shared logic should look like:

void log_tree_commit(commit) {
if (tree_has_changed(commit, commit->parents)) {
log_tree_commit_1(commit);
}
}

But the more I examined the actual code, the more I understood I had been mistaken. This function only output messages. That will teach me to trust my feelings!

I went back to the setup_revisions and get_revision functions. It was difficult to understand the logic of their work because it was obscured by a ‘fog’ of auxiliary functions, some of which were required for working with pointers and memory correctly. It all looked as if the main logic involved a breadth-first search of the commit tree, i.e. a fairly standard algorithm:

rev_info setup_revisions(revisions_range, …) {
rev_info rev;
commit commit;

for (commit = get_commit_from_range(revisions_range)) {
revs->commits = commit_list_append(commit, revs->commits)
}
}
commit get_revision(rev_info revs) {
commit c;
commit l;
c = get_revision_1(revs);
for (l = c->parents; l; l = l->next) {
commit_list_insert(l, &revs->commits);
}
return c;
}
commit get_revision_1(rev_info revs) {
return pop_commit(revs->commits);
}

A list is generated (revs->commits). The first element (the top one) in the commit tree is added to that list. Commits are then gradually taken from the start of that list, and their parents are added to the end of the list.

As I read the code, I discovered that, in the midst of the ‘fog’ of auxiliary functions, there is a complex logic for filtering commits — just what I had been looking for such a long time. This is what happens in the get_revision_1 function:

commit get_revision_1(rev_info revs) {
commit commit;
commit = pop_commit(revs->commits);
try_to_sipmlify_commit(commit);
return commit;
}
void try_to_simplify_commit(commit commit) {
for (parent = commit->parents; parent; parent = parent->next) {
if (rev_compare_tree(revs, parent, commit) == REV_TREE_SAME) {
parent->next = NULL;
commit->parents = parent;
}
}
}

In a case where several branches are merged, if the state of the file remains the same as in one of the branches, there is no point considering the other branches. If the state of the file hasn’t changed anywhere, we will only leave the first branch.

Here is an example. Let’s mark commits in which the file has not changed with a “0”, those in which the file has changed with a “1”, and branch merging with an “X”.

In this situation the code will not consider the feature branch — it doesn’t have any changes. If the file there has actually been changed, then the changes have been jettisoned in X, and that means that their story is not very relevant: the code in question is no longer there.

Something similar happened in our case as well. Two developers made changes to one file — Page.php. One of them made changes on the master branch, in the deadbeef commit; the other one made changes on the branch relating to their task.

When the second developer sent changes from the master branch to the task branch, a conflict occurred. As the second developer was in the process of resolving this conflict, they simply jettisoned the changes from the master. Time passed, they finished work on the task, and the task branch was sent to the master, thus deleting the changes from the deadbeef commit.

At the same time, the commit remained. However, if you ran git log with the Page.php parameter, you wouldn’t see the deadbeef commit in the output.

Optimisation is not a good thing

I immediately embarked on a careful study of the rules for sending changes and bugs to Git itself. After all, I thought I had identified a serious problem. Just the thought of it: some of the commits would simply disappear from the output, and this was default behaviour! Fortunately, there turned out to be a lot of rules, it was late, and by the following morning my zeal had fizzled out.

I realised that this optimisation really speeds up the work of Git on large repositories such as ours. And, what’s more, we found documentation for it on man git-rev-list, and this behaviour is very easy to turn off.

Incidentally, what was the role of —-follow in this story?

In actual fact, there are lots of ways to influence the working of this logic. We found a specific comment, from 13 years earlier, concerning the follow flag in the Git code:

Can’t prune commits with rename following: the paths change.

P.S: Personally speaking, I have been working on the Bumble release engineering team for several years already, and lots of us reckon we know all about Git.

(Original source: xkcd.com/1597)

In this connection, we need to get to the bottom of the problems which occur in this system. Some of them, it seems to me, are quite intriguing, such as for example, the one described in this article. Problems are very often solved quickly. There are a lot of things we have encountered already; some are well described in the documentation. This case was an exception.

In actual fact, the documentation did have a section entitled “History Simplification”, but I missed it.

And, for those who have read to the end of this article, I have a little bonus for you to take home. I have a little repository where the problem in question is replicated:

$ git clone https://github.com/Md-Cake/lost-changes.git
Cloning into ‘lost-changes’…
$ git log --oneline test.php
edfd6a4 master: print 3 between 1 and 2
096d4cf init
$ git log --oneline --full-history test.php
afea493 (HEAD -> master, origin/master, origin/HEAD) Merge branch 'changekiller'
57041b8 (origin/changekiller) print 4 between 1 and 2
edfd6a4 master: print 3 between 1 and 2
096d4cf init

Thanks for reading!

--

--