Where is “git split”?

 The unexpected agony of splitting a repository


Tl;dr — go to Conclusion, below.

How difficult should “splitting” a repo be? I’m sure the Gitanians are thinking: “It isn’t difficult at all, this guy simply doesn’t know what he’s doing,” and there’s probably some truth to that—as stated by some, Git is one of the most used, yet one of the most misunderstood version control systems—but that doesn’t excuse the poor interface as it compares to other tools where the workflow is significantly simpler and less error prone.

Some months ago I started a project and created a single Git repository that would contain everything, and as the project grew I could bifurcate the sub-projects as needed into their own repository. The need for the first split came yesterday because one of the areas, the website, needed a place of its own for deployment purposes.

I started at around 6:00 pm. A few minutes later Google had given me numerous answers, all of which showed a fairly straightforward—yet not ideal—way of moving the directory to its own repository. At least it seemed that five minutes later I would be done with the task. Five minutes became five hours.

Take one

I created a new empty remote repository, made a fresh local clone of the repo I wanted to split, and finally I ran the following command as suggested by Bitbucket’s post “Split a repository in two”:

git filter-branch --index-filter ‘git rm --cached -r lildir lildir2' -- --all

where lildir and lildir2 are the directories that would get erased from the history of the repository.

The idea behind this approach is that after removing all of the unwanted directories you are left with a repository that only contains the desired commits and history. Then it’s just a matter of unlinking from the old repository and pushing it to the newly created one. Finally, in the old repository delete the directories that remain in the new one (using the same command above). A bit convoluted and feels like a cheap workaround, but it should be fairly easy, right? This is what I got:

fatal: pathspec ‘’ file did not match any files

which got solved after some more googling. Adding --ignore-unmatch (suggested in the filter-branch command’s man page) would do the trick, so I tried:

git filter-branch --index-filter ‘git rm --cached --ignore-unmatch -r lildir lildir2' -- --all

That appeared to be working. It took several minutes to complete, but once it did, the contents of the directory I had specified were gone. I ran it again, but now removing all of the other directories I didn’t want, and waited patiently, again. Another error, but this time the solution was easy as it simply required to add -f to force overwriting the backup that Git creates when running some commands.

Cannot create new backup. A previous backup already exists in refs/original/

Before pushing to the remote repo I checked the log of the repository I had just created, and to my surprise, all of the commits were still showing up in there, all. Apparently the command doesn’t get rid of empty commits, i.e. those that only have changes in files removed by the command. Adding --prune-empty did the trick. Empty commits were gone, but I was still seeing a lot of stuff that should have been gone long ago. I was seeing orphaned branches of files that I had removed from the repository months back. Maybe the repository needs pruning?

git reflog expire --expire=now —all
git gc --aggressive --prune=now

Not much changed. The working copy looked correct, but the log showed thousands of commits and the repository was still over a hundred MB when it should have been about one MB. It was time to start from the beginning.

Second attempt

I deleted the test repository, cloned again and tried a slightly different approach. I only needed one directory to become its own repository, so the following command would do the trick:

git filter-branch --subdirectory-filter lildir -- --all

Same thing. The working copy looked as it should, but the log told a completely different story. There were some warnings after running the previous command that hinted at some tags not getting updated (I’m sure there’s a good reason for this behavior, but wouldn’t it be more intuitive to update relevant tags when rewriting history?) Adding --tag-name-filter cat:

git filter-branch --tag-name-filter cat --subdirectory-filter lildir -- --all

After many permutations to the above commands I figured out that the tags had something to do with what I was seeing. Each tag I deleted would get the log closer to just having the commits of the working copy. Now the problem was that I had to delete many tags. There must be a way to delete several or all with one command, right? Well, just as with the simple task of splitting a repository, I didn’t find an easy way of doing it (no wildcards when specifying the tag name).


Conclusion

To actually split the repository I had to clone again, delete the tags one by one, run the filter-branch command, and clean up (prune) behind me. Here are the steps that worked for me, in case someone runs into a similar problem:

// Get a local copy of the repository to be split
git clone gitRepoAddress localDir
// Move into the new local repository
cd localDir
// Here remove tags that do not belong to the new repository
// This will remove them from the remote (which I needed because they were tags that belonged to code that had been deleted, yet they somehow made it back into the remote).
git push --delete origin tagname
// Detach from the old remote repository
git remote rm origin
// Get rid of all the history except for that related to lildir
git filter-branch -f --tag-name-filter cat --prune-empty --subdirectory-filter lildir -- --all
// Cleanup
git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
// Cleanup
git reflog expire --expire=now --all
// Cleanup
git gc --prune=now
// Link to a newly created repository
git remote add origin gitNewRepoAddress
// Push to the new repository
git push origin master

Wouldn’t it be nice to simply have something like:

git split srcDir newRepoPath

The command would make a new repository at newRepoPath with the contents and history of srcDir, and would remove all related info from the self repository.


I hope to make this the beginning of a conversation that will show me, and the many other people running into these issues, how easy it is to use Git, or perhaps that there is a need to revisit Git’s interface (and no, I don’t think Mercurial is much better, although splitting a repository is much simpler).