Migrating to Git LFS for Developing Deep Learning Applications with Large Files
Doing Machine Learning and Deep Learning involves dealing with big data and huge files, such as images, audio, or video. Dealing with huge files in such coding projects involves dealing with version control issues limitations. Hopefully, it is possible to use Git Large File Storage (Git LFS) as a simple and effective way to manage huge files more efficiently within Git! And it is possible to do so, despite the fact that your commit history might already be huge and dirty from containing big files and relatively big files. That’s what we will address in this post.
Basically, to migrate to Git LFS, the Git history needs to be rewritten to remove the huge files to then rather store them with tiny “file pointers” which will point to the real huge files that are stored elsewhere — and are thus processed differently and faster than Git would. That is, the implementation is seamless while having Git LFS installed on your station: the huge files will still be there locally, they will be small in the remote base repository, and they will also be stored in a special Git LFS section too, often called the “LFS Store”. That way, when using Git LFS, the size of the repository stays small, and Git operations such as git status, push and pull, should be fast once huge files are moved to Git LFS.
Git LFS works like a charm when things are done perfectly and in a very precise way. However, in practice, it currently has many undocumented pitfalls and issues, which makes things hard to set up at first. In this guide, I’ll explain how to setup Git LFS correctly while making you aware of the potential pitfalls or issues that could arise. Git LFS might be a bit limited for public, open-source repositories, however, it is a very interesting tool for private Git repositories.
Let’s get started: convert the repository and rewrite history
Entering in maintenance mode
First, rewriting the Git history is not a trivial thing. Your devs will temporarily need to stop working on the code while the maintenance is done, and they will be unable to push their changes since the history will have changed.
I will assume that you are running on a Linux environment but I will also point out where things could differ for other OS. We’ll work with Bitbucket for the rest of this post.
Avoiding Git LFS’ login problems
Before reducing the size of the repository, make sure to have backups. The best way to have backups which will be useful for the rest of the instructions is to clone the repository and also a mirror of it. However, you might currently authenticate with user and password for cloning, pushing and pulling code. This will be bad later on, so I would advise you to set up a Git SSH authentication now.
Otherwise, you’ll have problems with Git LFS always asking your password for every file tracked by LFS, and that would require you to cache your password to work efficiently.
Backup your project
Time to get your backups:
git clone --mirror git@bitbucket.org:voobaninternal/the_git_project.git
git clone git@bitbucket.org:voobaninternal/the_git_project.git
Since we’ll use those folders as a working directory for the migration, you should also zip or compress the two folders elsewhere in case you later need to revert to them. Note that the mirror clone’s folder will be named the_git_project.git
and that the normal clone gets its folder named the_git_project
.
Hitting on Bitbucket’s hard limit of 2 GB for repositories
Note that if you did not reach such as limitation with Bitbucket or that you use another storage service than Bitbucket, you’ll probably want to ignore this step and go to the next one.
So, with Bitbucket, if you are searching on how to convert a repository to use Git LFS, you have probably already hit a limit repository size such as the hard limit of 2 GB like we did. Bitbucket restricts the repository to a read-only mode once the repository is too huge and therefore it will it reject any new commits except deletion commits which lightens the repository.
The first step before moving to Git LFS is to remove this limitation which makes that we can’t push any new commits. Despite the fact that rewriting the Git history would lighten it and move large files to LFS, Bitbucket would refuse those commits simply because they are new, even if those commits overwrite all the old ones. Reminder: be careful here and ensure you have a backup, because basically we’ll burn things down in the remote directory, temporarily, to then later be able to setup Git LFS. Most likely, your repository has grown in size and has got bigger with the very last commit which activated the read-only limitation on the repository. According to the documentation, you should undo that last commit:
cd the_git_project
git reset --hard HEAD~1
git push --force
To sum up, if you skip that cleaning step and that you would rather try to first prune and push the repository with the cleaning we will do below in this post, the push of the cleaning would FAIL, despite the fact that it would effectively reduce the size of the repository.
Don’t forget to commit again your changes once you will have transferred to Git LFS by using the backups. We just deleted a commit and we will need to do it again later in the new version of the repository.
Finally: rewriting the Git history with BFG
Let’s rewrite the past. Automatically. Now that you have backups and a write access to your repository, it’s time to get serious and to lighten the repository.
Before proceeding, ensure that you have enough space in your LFS Store, that you have backups, and that you have pushed everything. How you calculate space must include current and past versions of the files since that’ll also be converted up to LFS with BFG. BFG is a cleaning tool that will remove huge files from the Git history to then rather track those files with Git LFS. Java >= 7
is required locally to perform the installation of BFG. You’ll also want to install the git-lfs command-line client.
Ok, so let’s download and compile BFG. Notice that the compiling instructions are in the “BUILD.md” file. To do that, we’ll first need the Scala Build Tool (SBT) to build BFG. There are many ways to install it. Here, I do it for Ubuntu:
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
sudo apt-get update
sudo apt-get install sbt
We are now ready to install BFG by running those commands (note: the bfg/assembly
command must be ran within the sbt command line interface, so that’s why we’ll input those commands one at a time):
git clone https://github.com/rtyley/bfg-repo-cleaner.git
cd bfg-repo-cleaner
sbt
bfg/assembly
It’ll then compile shortly and you should be able to see a success message including the path to the .jar file which is the java executable result of the compilation:
[info] Packaging /home/gui/Documents/git_lfs_migration/the_git_project/bfg-repo-cleaner/bfg/target/bfg-1.12.16-SNAPSHOT-master-97ec208.jar ...
[info] Done packaging.
[success] Total time: 22 s, completed Jun 14, 2017 5:14:20 PM
We’re now ready to run BFG in our copy of the repository to move all of those file types to BFG: pkl, jpg, png, npy, hdf5, ipynb
. We need the mirror copy of the repository, which is located in the folder with the .git
extension AND with the project name before it. We got this when we cloned the repository earlier with the --mirror
command argument. Therefore, we’ll run the java BFG conversion command in the folder parent to that the_git_project.git
folder.
cd ..
java -jar /home/gui/Documents/git_lfs_migration/bfg-repo-cleaner/bfg/target/bfg-1.12.16-SNAPSHOT-master-97ec208.jar --convert-to-git-lfs '*.{pkl,jpg,png,npy,hdf5,ipynb}' --no-blob-protection the_git_project.git
This might take a while to run depending on the size of your project. You’ll then see an output with a list of the files that have changed, and also some more useful information:
In total, 357 object ids were changed. Full details are logged here:
/home/gui/Documents/git_lfs_migration/the_git_project.git.bfg-report/2017-06-14/17-28-22BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive
Let’s then run Git‘s garbage collector’s cleaning as indicated, which might take a while, too:cd the_git_project.git
git reflog expire --expire=now --all && git gc --prune=now --aggressiveThen, we'll need to install the Git LFS pre-receive hook. A hook is a command that is executed before or after certain actions, so Git LFS here is added as a hook to the Git commands. We need this hook to push the Git LFS files to the LFS Store rather than to the standard Git repository, which will only contain small file pointers. Let’s init Git LFS in our repository to install its hooks:git lfs installYou should see the output:Updated git hooks.
Git LFS initialized.Now, let’s push our modifications!git push --forceThe mirror we have now is not a working tree. If you simply git clone the repository again, you should be fine to work with it again. However, you’ll probably want to edit the .gitattributes file to make it more explicit once you clone the repository anew. Right now, the content of that file could look like that:*.{pkl,jpg,png,npy,hdf5,ipynb} filter=lfs diff=lfs merge=lfs -textI would recommend to edit it like this with your chosen LFS-tracked file extensions to make things more explicit:*.pkl filter=lfs diff=lfs merge=lfs -text
*.jpg filter=lfs diff=lfs merge=lfs -text
*.png filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.hdf5 filter=lfs diff=lfs merge=lfs -text
*.ipynb filter=lfs diff=lfs merge=lfs -textAnd to then do:git add .gitattributes
git commit -m “cleaned .gitattributes”Then, just to make sure that everything works fine, you may want to validate that running all of those commands, it should always say that the files are already tracked:git install lfs
git lfs track ‘*.pkl’
git lfs track ‘*.jpg’
git lfs track ‘*.png’
git lfs track ‘*.npy’
git lfs track ‘*.hdf5’
git lfs track ‘*.ipynb’After doing that, the .gitattributes file should not have changed, since we normally already track all those file types with LFS. By the way, this .gitattribute file is similar to a .gitignore, however we use it for tracking files with Git LFS rather than ignoring files for Git as classically with the .gitignore. Checking for the repository's status should say that nothing changed:git status
git lfs statusThen push your edits, and voilà! Your repository should work fine with Git LFS. Don’t forget to tell the others to install Git LFS and then to clone the project again. To install Git LFS, they will need to download and just run the installation executable as you did in the beginning.
Conclusion
For the remaining of your work, you‘ll mainly have to ensure that Git LFS remains installed for everyone that will want to commit to the repository, and also you will have to ensure that your LFS-tracked files stays tracked by Git LFS and not by Git. You can check that by looking at the arrows for the changed files listed in the git lfs status command. Ensure that your LFS-stored files do not go to the normal Git repository. As an example, moving a file from LFS to Git would be bad and would tamper with the git history, contaminating it with those huge files and requiring again to rewrite history again. For example, if someone clones the project without having installed Git LFS, he may damage your history with his recent commits, which would be hard to undo and require to use the --force again. I would say that Git LFS is a bit unstable for now, but when you do things right and that you have the right workflow, things should stay simple. Enjoy!By the way, you might as well be interested in that other article of mine on using Hyperopt for optimizing machine learning model's hyperparameters automatically. For deep learning projects, having a good workflow is important. With the right tools and the right development environment, it is possible to move faster and to achieve better results.