The first time I encountered Git LFS was in my third week of data science bootcamp. Some of you might already be thinking — whoa, just git is difficult enough, why learn about another feature of git? Or most importantly…what is Git LFS and why should I continue reading this article right now? Well, you are not alone. I was also asking myself these questions when I first started learning about it.
I guess I should provide some context around how I came to hear about Git LFS to explain why I decided to dive into this. My partner and I were working on a project involving movie data analysis of information from OMDB, Rotten Tomatoes, and IMDB. As we were pushing our work to Github, we received an error message while uploading the IMDB datasets, like the one below.
It was at this time that I learned that git and Github have file size limits of 100MB. Files with a size of 50MB trigger a warning message but can still be pushed through.
At the time, we were under a deadline to submit our project and of course, as beginners to git, the first thing we thought was: “OK… let’s just try to undo it and push up the files one by one instead.”
Spoiler: it did not work. So, we went with the next option, followed the error message and tried Git Large File Storage.
To understand Git LFS, it is helpful to know about git. So, before I dive into this topic, let’s talk briefly about git.
What is Git?
According to Atlassian’s Bitbucket website:
Git is a distributed version control system, meaning the entire history of the repository is transferred to the client during the cloning process.
Let’s back up a little. What is a version control system? A version control system is a tool which manages the changes to source code, files, and other forms of information. Changes are tracked as commits, which are snapshots of edits at a particular time. A distributed version control system is a type of version control system, which allows for the entire code base including all of its history (all of the changes) to be on each developer’s computer. This allows each developer working on the project to see the entire timeline of edits to the project.
Git was first created in 2005 by Linus Torvalds, the creator of Linux, after he and other developers stopped using Bitkeeper, a propriety source control management (SCM) system, when it was no longer available for free use. According to Wikipedia, after trying with no luck to find a free and open system to replace Bitkeeper, Linus decided to create a version control system that would be small, fast (“take no more than 3 seconds”), supports branching (helpful for team collaboration and software development), be the complete opposite of concurrent version systems, and includes safeguards for data assurance.
Because git was created to be small and fast, it was primarily built to support only source code, not large files. Remember, it is a distributed version control system, which means the full history of changes of a project is transferred each time it is cloned or pulled. As a commit is added, it adds to the history, which increases the file size of the overall project, and over time, it can become unwieldy.
However, there are many fields that work with large files for their projects, such as those involving music files, picture files, and datasets. What can we do in these situations? That is where Git LFS comes in.
What is Git Large File Storage (LFS)?
Git LFS is a git extension, programmed in Go, which was created by developers at Atlassian and Github as well as other open source collaborators to circumvent the file size restriction in git. It does this by storing large files in a separate location from your repository and placing a pointer file in your repository directing to its location.
The best way for me to understand how this works was to first forget about Github, Bitbucket, Gitlab, and remote repositories for a second. Let’s just focus on the local computer, which is depicted as a monitor below with three sections (“Working Copy”, “Local Repository”, “LFS Cache”).
The local repository is a directory or folder you see on your computer, which has been initialized as a git repository using git init or cloned from a remote repository. The working copy is the representation of the files or folders that are being edited in the local repository. The LFS cache is the separate storage for large files once they are pushed through git. Keep these terms in mind — they will come in handy as we go through the steps on how to work with Git LFS in the next section.
Working with Git LFS
The great thing about Git LFS is that we can continue to use the usual git commands and workflow we all know and love with it. The only changes are a few additional commands and another storage location to keep in mind.
Ok, now that we have some information about git and Git LFS, let’s walk through how to use it. I will go through two possible scenarios but first, download Git LFS via Homebrew (brew install git-lfs) or through their website (https://github.com/git-lfs/git-lfs/releases).
Scenario 1: Using Git LFS after getting an error message with the usual git commands.
Here I have a new repository in which I placed a large data file (1.9GB). I wanted to make sure any changes to the data file are tracked and eventually backed up remotely. First, I go through the usual git commands to stage the file (git add), save a copy of the changes on the local repository (git commit), and push the copy to the remote repository (git push). This is the output I get:
How should I resolve this error? One option is to undo the changes using git reset and either forget about saving the file, zip the file to compress it to a smaller size, or restart with Git LFS. Another option would be to stay where you are and integrate Git LFS so you can continue the process, which is what we will focus on here.
Step 1: Once Git LFS is installed, enable the specific repository with Git LFS by running git lfs install.
Although we have installed Git LFS on your computer, we will need to tell it which repositories need its service. A great analogy is a storage company. Storage companies are available throughout the city where we can choose to store your items but they do not automatically knock on the door and start storing the items. Instead, the first step is to start a relationship with the company by calling and setting up an agreement. It is the same here. To enable Git LFS “services” in a specific repository or to tell Git LFS the repository to initialize its “services”, run git lfs install.
Step 2: Tell Git LFS which files to track with the command: git lfs track “*.file_extension”.
Again, we need to tell Git LFS what files or what types of files we would like it to track so the files can be stored at a separate location instead of in git to avoiding getting the error message again. To do so, run git lfs track “*.file_extension”. For example, if all csv files need to be tracked, run git lfs track “*.csv” or if all jpeg image files need to be tracked, run git lfs track “*.jpg”. The asterisk (*) represents all files. The quotes (“ ”) are necessary when running this code. Without them, there will be an error later.
Same as how a receipt will be received when an order is placed with a storage company to start storing an item, when we track a file with Git LFS, a .gitattributes file will be created. If there is already a .gitattributes file, the file is added as a new line in it.
Step 3: Git add, commit, & push your .gitattributes file to your repo.
Similar to the .gitignore file, as Git LFS tracks new files, updates are automatically made to the .gitattributes file. To make sure the changes are being tracked, each time the .gitattributes file is updated, it needs to be staged and commited, otherwise issues may occur later on.
Step 4: Now the real secret in this scenario is to use git lfs migrate to move your commits from git to Git LFS.
What allows us to stay in the current state, not have to undo our commits and restart, is a nifty line of code that lets us move or “migrate” our commits from git to Git LFS. To move our commits, we can run git lfs migrate import — include “*.file_extension”. In order to see what file types are in the commits and can be tracked by Git LFS, we can run git lfs migrate info. By moving our commits over, we can continue to the next step: pushing our changes Github. More details in the next section.
Important Note: Moving commits involves rewriting the history. A tag can be added to prevent overwriting the changes listed in the history but this will prevent this line of code from running.
Step 5: Lastly, run git push to push the changes to Github and the large commits (ie. large files) to Git LFS.
After migrating the commits to Git LFS, currently we have a local git repository which has been updated with a change (in this case, added a new data file, which is indicated by a pointer file directing to Git LFS) and a local Git LFS cache, which now stores the data file. In the next step, we push the changes to Github. The local git repo with files within the file size criteria (ie. source code, pointer file) will be stored to Github, which is the Git host indicated in the image below, and the Git LFS cache will be stored in the Git LFS store on the cloud.
Scenario 2: Using Git LFS from the beginning.
If it is known that there are large files in the repository, we can use Git LFS from the beginning by going through steps 1 to 3 listed above. After going through these steps, return to the usual git commands (git add, git commit) to stage and save the changes in the local repo. Then, complete step 5 listed above, to push the changes to Github or other Git host and the remote Git LFS store.
Seems pretty simple right? Just remember the five steps above and we should be good to go. Pulling down the changes from a remote repository is also straightforward. It is the same set of git commands we typically use: git pull or git fetch and git merge.
Well, it was actually pretty confusing to learn at first. Here are some notes on what I learned as I fumbled my way through this:
- For those uncomfortable with git, this can add another layer of complexity. This was my biggest challenge when learning about Git LFS. Learning more about the git commands, git workflow, and how Git LFS fits in with git was key to learning the steps. This video on Atlassian’s website gave me the “epiphany” I needed to put it all together.
- Even with Git LFS, there is still a file size limit of 2GB, which is a restriction placed by Github. Anything bigger, it is probably time to look into cloud storage.
- Git LFS is an active open source project which is continuously being improved. Their github keeps a running list of current issues here.
- There are still issues when trying to resolve merge conflicts. It is best to communicate within the team before pushing any changes and merging.
- Larger files can still be a bit slow when being pushed to the remote repo.
Overall, I really enjoyed learning about this resource and it was very helpful in making me more comfortable with using git.
I would not have been able to understand anything about this topic without the knowledge gleaned from the below resources. I recommend checking them out.
- Git LFS Website: https://git-lfs.github.com/
- Atlassian Git LFS Tutorial: https://www.atlassian.com/git/tutorials/git-lfs
- GitLab Git LFS Documentation: https://docs.gitlab.com/ee/workflow/lfs/manage_large_binaries_with_git_lfs.html
- Dzone — What is Git LFS: https://dzone.com/articles/learning-git-what-is-git-lfs
- Oh Shit Git: https://ohshitgit.com/
- Visualizing Git: https://git-school.github.io/visualizing-git/