Want to Remove large files/blobs from git history permanently?

Prankul Garg
5 min readMay 2, 2018

--

In our engineering life, we get so many problems but we pays attention only when we got stuck with them.

My Project’s code repository size was 1GB and sometimes its increases exponentially till 2GB and more and hence we had to ask the bitbucket support team to run the gc(garbage collector) on this repository again and again. But GC was not a permanent solution.

Ideally, repository size should be in between 100 to 300MB. To give you some examples: Git itself is 222MB, Mercurial itself is 64MB, and Apache is 225MB.

Why my repo size is 1GB?

I was so curious to get to know it. I figured it out the repo contains around 4000 branches and 2000 was fully merged, So I give a shot to delete all the fully merged branches but after this, size got reduced by just 30MB. This is reasonable too as branches are nothing but the refs only.

So I decided to get the list of all the files in my repo along with their size. I ran the below command to get the list(sorted by size in decreasing order).

$ git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| awk '/^blob/ {print substr($0,6)}' \
| sort -r --numeric-sort --key=2 \
| gcut --complement --characters=13-40 \
| gnumfmt --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

After analysing the list of files, I got to know the repo contains some unwanted files. Ideally these file extensions should be added in .gitignore to not get tracked, binary/profiling files are the most common ones. But what if you are realising it after committing the mistake?

In my case, actually, some of the folks uploaded the profiling files(*.hprof) of around 700MB and later on someone deleted those files(it all happened 2 years ago). But just the deletion of such files from a particular branch or main(master) branch will not help as git will still keep them in the repository.

Then I realised we need to clean the repo. I explored the few approaches and pick BFG repo cleaner as my solution which is much faster comparison to git-filter-branch. BFG is faster, simpler and beautiful for Removing Big Files, Passwords, Credentials & other private data. It will rewrite the complete commit history i.e. all the commit hash, branch/tags refs will get changed without interfering with commit message and branch names. So notify all the team members before performing this activity.

Process to be followed before Cleanup:

1. Make sure to stop any activity on repo during the cleanup.

2. No one should have any un-committed code in local/stash/shelves.

3. No pull requests should be in open state.

4. If someone has created some branches in his forked repo and that to be merged in the original repo then push that branch to the original repo as your current forked version will not be merge in the new original repo.

5. Remove the access of all from the original repo, so that no one can push during the cleanup process.

I would recommend to fork the original repo and then perform the cleanup in the forked repo. If all things goes well then remove the original repo and share the access of new forked repo to all the team members. As it would be a forked version(new remote url) it will force each team member to clone a new repo from new remote url which will prevent mixing of old and new repo. Also, make sure no one just change the remote url in the older local code repo instead of taking a fresh clone.

Cleanup Steps:

First clone a fresh copy of your repo, using the --mirror flag:

$ git clone --mirror git://example.com/some-big-repo.git

This is a bare repo, which means your normal files won’t be visible, but it is a full copy of the Git database of your repository.

Now you can run the BFG to clean your repository up:

To delete all files which are more than 10Mb in size

$ java -jar bfg.jar --strip-blobs-bigger-than 10M some-big-repo.git

To delete all files named ‘id_rsa’ or ‘id_dsa’ :

$ java -jar bfg.jar --delete-files id_{dsa,rsa}  my-repo.git

For further command line options, you can run the BFG without any arguments.

The BFG will update your commits and all branches and tags so they are clean, but it doesn’t physically delete the unwanted stuff. Examine the repo to make sure your history has been updated, and then use the standard git gc command to strip out the unwanted dirty data, which Git will now recognise as surplus to requirements:

$ cd some-big-repo.git
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive

Finally, once you’re happy with the updated state of your repo, push it back up (note that because your clone command used the --mirror flag, this push will update all refs on your remote server):

$ git push

At this point, you’re ready for everyone to ditch their old copies of the repo and do fresh clones of the nice, new pristine data. It’s best to delete all old clones, as they’ll have dirty history that you don’t want to risk pushing back into your newly cleaned repo.

Visit for more detail or download the BFG jar from here

Moreover, you can apply a pre-receive webhook(server side hook) over the repository which will check the size of each file before updating a file on Bitbucket cloud and throw warning if file size is 5+ MB and error in case of 10+ MB.

Process to be followed after cleanup:

1. Remove your local code repository/folder from your system.

2. Take a fresh clone of the new repo.

3. Delete your existing forks if any from your account and take a fresh fork.

4. If you couldn’t push your local changes earlier in the old repo then replicate them in new repo manually.

5. If you created a Pull request in old repo and that couldn’t merge then create a new Pull request in new repo again. But keep in mind to don’t mix the old and new repo branches.

6. Add unwanted file extensions in .gitignore, so they can’t get tracked in future.

7. Give access to this new repo.

If someone mixes old and new repo then you will have to do the whole exercise again! So please pay attention here and delete previous repo and clone a fresh new or in a case of forked remove the forked repo and fork again the new repo.

My results before and after cleanup:

Still, its not the ideal size but much better than the earlier.

Hope, it will help someone!

--

--