Git surgery to retain history

Mark van Straten
NN Tech
Published in
7 min readJan 21, 2021

One of my co-workers at NN Group created a Proof of Concept to transform an old Java codebase into a mavenized setup and split it up into smaller pieces. To do so he started with copy-pasting the original files into a new git repository (🤐) and slowly committing changes to get the maven setup working 🎉 . All the while work was continuing in the original codebase and both had diverged from each other… 🤔 Git surgery to the rescue! 🩺

➡️ This article assumes you are comfortable with day-to-day usage of git, have a working understanding of the CLI and how branching/merging works.

Our situation 😓

The original repository containing multiple applications without maven (simplified):

README.md 
JavaSource
Configuration/application-1
Configuration/application-2
Configuration/application-...n

The JavaSource folder contained generic code which could be retrieved using maven.

The target mavenized repository for application-1 (simplified):

README.md 
pom.xml
mvn/
src/main/configuration/application-1 <-- containing the application-1 Configuration files

Rough steps taken to get it working with maven:

  • Contents of JavaSource was packaged and moved to a maven dependency
  • Application configurations application-{1..n} were split up into separate repositories, files copied into it without history
  • Iterate, tweak until working setup emerged

The plan 🩹🥳

Ideally, you would have changes be based on a branch of your original codebase so you can merge them as a new feature. Since that was not the case we needed to bring in some serious git tools to rewrite history to our liking.

  • We needed to remove the files inside /src/main/configuration/application-1 copy-pasted into the mavenized setup to prevent conflicts and confusion
  • We needed to extract only the history + changes of the files application-1 in our source repository
  • merge the mavenized changeset over the original files so they become complete again

These steps will result in retaining the full file history of all application-1 files. The downside is that the commits where the mavenizing setup was created are not atomic - the source files are absent. This was taken as an acceptable tradeoff.

Put on your scrubs 👩‍⚕️

First things first; make two local checkouts in a working directory so we do not break anything unrelated. Because git is distributed we can do all the following operations on a local copy on your machine and only when you are satisfied push it back to your version control system (Github, Gitlab, ..)

mkdir git-surgery && cd git-surgery 
git clone git@github.com:crunchie84/blogpost-git-surgery-source.git source
cd source

Let's check our git history:

Extract our application from the source repository

Okay, we need to clean this up so we only have the application-1 part which we need. To do so we can use a very handy command existing in git: git substree split which allows us to extract a folder out of our repository and place only those changes in a separate branch:

git subtree split -P Configuration/application-1 -b rewritten-history-application-1

After executing this command the history now looks as the following:

The newly created branch rewritten-history-application-1 only contains commits (or the part of a commit) which involved the folder Configuration/application-1. It is noteworthy to observe that the new branch does not share a common ancestor with the original master branch because it is a total rewrite of history.

Prepare our new repository

We are going to take the rewritten history of our source repository and use this as the master branch of our 'new' repository (which we are going to mavenize).

# back to the root directory `/git-surgery` 
cd .. mkdir mavenized-solution
cd mavenized-solution
git init -b master
git pull ../source rewritten-history-application-1

We now have pulled the local filesystem based git repository source with specified branch rewritten-history-application-1 into our (empty) master branch and as a result a clean history without any dangling commits:

A side effect of subtree split is that all the files in the extracted folder are placed top level in your commits which we are going to address in a bit:

As a final preparation, we are going to move the files as if they have always lived in src/main/configuration/application-1 to make the history a bit more readable. To do so we can use git filter-branch to rewrite where the files have been all their lives:

# we are in the /git-surgery/mavenized-solution folder 
git filter-branch --force --prune-empty --tree-filter ' dir="src/main/configuration/application-1" if [ ! -e "${dir}" ] then mkdir -p "${dir}" git ls-tree --name-only $GIT_COMMIT | xargs -I files mv files "${dir}" fi'

This command will iterate over the commits and execute the passed script for every one of those. Files will be moved to the correct place before the commit is rewritten.

⚠️ You will get a big warning about the side effects and possible gotchas with filter-branch but for our task, it will suffice:

Now onwards to prepare our mavenized-poc to merge onto our prepared repository!

Cleaning up the mavenized PoC repository

We are going to clone the repository and remove the copy and pasted files of application-1 from the commit history.

cd .. # back to the git-surgery root folder 
git clone git@github.com:crunchie84/blogpost-git-surgery-poc-target.git mavenized-poc

Original history:

We are going to use the community plugin filter-repo which you can install using brew install filter-repo (on MacOS) to ease our syntax what we want to do:

cd mavenized-poc 
git filter-repo --path src/main/configuration/application-1 --invert-paths

We are filtering out all (parts of) commits that have anything to do with the folder containing the copied application-1 source files. The result is a clean history of only the steps to get the mavenized setup working:

The only thing which we need to do is apply the maven setup to our extracted application-1

Merging our cleaned up mavenized setup onto our application repository

We are going to re-use the git pull trick we have used before to get commits from a different repository into ours. But with a twist:

# working dir = /git-surgery/mavenized-poc 
git checkout -b mavenizing-application
cd ../mavenized-solution
git pull ../mavenized-poc mavenizing-application --allow-unrelated-histories

First, we make a branch in our clean mavenized setup repository because the name will end up in our commit history and this is an important piece of information. The true magic resides in --allow-unrelated-histories Given that git pull is a shorthand for git fetch && git merge this option allows us to merge unrelated histories. Normally git will always look for a common ancestor when merging but that does not mean it can not merge without!

After invoking the git pull you should be presented with a git merge commit message dialog in which you can add as much extra information that you deem relevant.

Now we can observe in our git commit history what happened:

Two unrelated histories have been merged together retaining the commit history of both. All that is left is to test it out locally. When satisfied you can now (force) push it to your origin as the new repository 🎉

Surgery success, patient dismissed! 🚀

Parting thoughts

  • It would have been easier if the proof of concept had been directly created as a branch on the original repository. When we were at this point that was already water under the bridge 😬
  • Because of the copy-pasted files it proved to be very difficult to re-attach the new repository directly as a branch onto the original repository
  • Taking the original application repository, creating a branch and then copy-pasting the mavenized proof-of-concept codebase over it would also be an (easy) option but then we would lose the history of those changes ⚖️… We opted for a solution that tried to retain both histories

References

Git repositories used as an example in this blogpost

The complete script explained in this article:

Originally published at https://markswanderingthoughts.nl on January 21, 2021.

--

--

Mark van Straten
NN Tech
Writer for

Espresso lover. Code craftsman. Public speaker. Father.