GitHub and Large Files

S. T. Lanier
The Startup
Published in
6 min readMar 21, 2020
Just when you thought you were done with the project. Captured circa 3:00 AM. Image by author.

There’s one challenge anyone working in data analysis will encounter at some point in their work: GitHub’s file size limit, which is exactly 100 MB. Due to the nature of the work, large files are par for the course (I’m looking at you, half gig CSV files), and GitHub is an industry standard for just about everyone. How are analysts reconciling the reality of too large files with too little storage? In this post, I’ll cover four of the most common approaches to dealing with the conundrum:

  1. Store the file using Git’s Large File System (Git LFS)
  2. Don’t push the file to GitHub (by use of the .gitignore file)
  3. Cut your large file into smaller files
  4. Access the database without saving it locally

1. Git Large File System (LFS)

Your best option for preserving the integrity of your project and your commit history is to make use of Git LFS. It’s easy to declare which files you want tracked, and you can then continue working normally within the git workflow you’re already familiar with. When you commit and push, Git LFS works by intercepting the designated files and migrating them to the LFS server, and it leaves pointers in your GitHub repository that point to those files on the LFS server. After installing LFS on your local device, you only need three lines of code to install LFS in the desired repository and track all CSV files therein, shown below. Execute this in each local repository where you plan to use LFS:

$ git lfs install 
$ git lfs track "*.csv"
$ git add .gitattributes

And you’re done. It’s just that easy. Now you can commit and push like you normally would, and all your data has been saved and connected to the repository. You can also look to see which files are being tracked by the LFS using the following command:

$ git lfs ls-files

Lastly, if you’ve already committed the file to your repository, you can use git-lfs-migrate to add the file to LFS and have the file taken out of your git history.

There’s only one potential drawbacks to this option: the LFS itself has a ceiling, which can be exceeded by paying $5/month/data pack, a data pack being 50 GB of storage and 50 GB of bandwith, which seems like a pretty good deal for the amount of storage.

2. Make Smaller Files

Another option would be to make smaller files out of your big file. To use a Python example, you could import your data (in a separate place from your Jupyter Notebook) and save as a Pandas DataFrame, cut that into smaller DataFrames, and export each of the smaller ones as separate files. Then you can delete the very large file that was creating the problem in the first place.

Effective, but potentially a less elegant solution as you’ll have more file clutter, and do you really want data that logically belongs together as a group to be separate? This strategy could also fail to solve your problem, because while GitHub places a hard limit on repository size at 100 GB (and 10 MB for single files), it encourages users to keep repositories under 1 GB.

3. Don’t Push The File to GitHub (.gitignore)

If, after some consideration, you decide that you don’t need this large file to be sent up to GitHub, just include the file name in the repository’s .gitignore file. I think I’ve seen this method used more often than any of the other three in this article. You keep a local copy of the data and supply a URL (probably from wherever you originally got the data––Kaggle?) to the data’s source on the project’s README. This, of course, wouldn’t work so well if you had personally scraped and created your dataset, but for data that’s already packaged somewhere on the web, this can work.

There are three situations you might find yourself in, depending on what stage of stage/commit/push you're at: (1) you haven't committed the file, (2) you've committed the file but not pushed, and (3) you've committed and pushed the file to GitHub.

If you find yourself in the first (1) case, simply add the file to the repository’s file using whatever editor you like, or just use

$ echo "big_file.sql" >> .gitignore

If, as in case (2), this is a file you’ve already committed to the project, but haven’t pushed to the remote repository, you can remove it by clearing the cache and then adding the file to .gitignore.

$ git rm --cached big_file.sql $ echo "big_file.sql" >> .gitignore

But if you’re in case three (3) and you’ve committed the file and pushed to the remote repository, you’ll need to A) clean up the repository’s git history using the git filter-branch command, B) add the file to .gitignore, and finally C) force push those changes. Par exemple, if you wanted to get rid of big_file.sql located at Users/me/myproject/big_file.sql, you would need to

A) Execute git filter-branch

$ git filter-branch --force --index-filter \ "git rm --cached --ignore-unmatch Users/me/myproject/big_file.sql" \ --prune-empty --tag-name-filter cat -- --all

B) Add the file to .gitignore

$ echo "big_file.sql" >> .gitignore $ git add .gitignore $ git commit -m "Add big_file.sql to .gitignore"

C) Force push those changes

$ git push origin --force --all

4. Access Data Without Saving It Locally

This option will only be viable for some projects, namely those where data can be accessed remotely via API/(private?) remote server/conjuring, and ideally, it’s data you feel you have some control over or data you feel exists as a relatively fixed state. Rather than save the data you need as a file and then writing your code around that local file, don’t do that: query for the data remotely, in code, inline, and save the data to your list/array/DataFrame. Don’t save it in a file on your side at all.

I’m calling this approach the Indiana Jones approach, because it feels very caution-thrown-to-the-wind for a lot of reasons.

For one, if the data is migrated, or if your database undergoes a schema migration — if the way your interact with the database/file to get that data changes — you must rewrite your code to get the same information you started with, no negotiations.

For seconds, if the database/file goes offline, all subsequent code is useless until it comes back online.

For thirds, if the data changes, the data changes, including, potentially, the trends you initially sussed out and regressed and drew conclusions from.

In essence, you are trusting in whatever higher power you believe in that the data will be the way it was when you left it, in every way, shape, and form: the URL is the same, and the API is the same, and it’s still, for instance, a MySQL database, and all the features are the same, and all the data belonging to those features is the same. Maybe you own this external place where the data exists, and then you can make your own bets with destiny and fate and nature, but otherwise, you are asking a question the universe has already answered:

No man ever steps in the same river twice, for it’s not the same river, and he’s not the same man.

— Heraclitus, bemoaning the loss of his beloved, 5 gig DB of stock values

You stand a risk of your data not being the same when you come back to it, in which case you will have to significantly alter your code.

Recap

Image by author.

--

--

S. T. Lanier
The Startup

Student of data science. Translator (日本語). Tutor. Bicyclist. Stoic. Tea pot. Seattle.