How to: storing large datasets in Github (using git lfs)
If you have some training data there’s a limit to how much github will accept by default. However its LFS “large file system” extension makes this possible. Once set up you can use it in your normal git workflow transparently. Just bear in mind that large files do take a while to upload.
Worked example
- You want to store
comments.csv
along with the rest of the repo and any other csv files you create. (so,*.csv
) - Install
git-lfs
(Mac:brew install git-lfs
or Ubuntusudo apt update && sudo apt install git-lfs
- Activate git-lfs on each machine:
git lfs install
. This is not per repo. - In the root of your repo,
git lfs track "*.csv"
(pro tip: you can place it anywhere and it’ll apply from that part of tree upward) - This will create a new file,
.gitattributes
- Now make sure in your next
git add
you include BOTH thecsv
files you want to upload and the.gitattributes
file. git commit -m <message>
andgit push
now work as normal.
Result
Here’s an example session on my Mac. In this session, .gitattributes
was created inside the datasets folder because that’s the only place I store csv
files
funnynotfunny-mac git:(main) ✗ brew install git-lfs
… various installation messages …
➜ funnynotfunny-mac git:(main) git lfs install
Updated Git hooks.
Git LFS initialized.
funnynotfunny-mac git:(main) cd backend/datasets
➜ datasets git:(main) git lfs track "*.csv"
Tracking "*.csv"
➜ datasets git:(main) ✗ git add .gitattributes
➜ datasets git:(main) ✗ git add comments.csv
➜ datasets git:(main) ✗ git commit -m "lfs + large file commit"
[main 3a862f2] lfs + large file commit
2 files changed, 4 insertions(+)
create mode 100644 backend/datasets/.gitattributes
create mode 100644 backend/datasets/comments.csv
➜ datasets git:(main) ✗ git push
Uploading LFS objects: 100% (1/1), 669 MB | 1.2 MB/s, done.
Enumerating objects: 8, done.
Counting objects: 100% (8/8), done.
Delta compression using up to 8 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (6/6), 650 bytes | 650.00 KiB/s, done.
Total 6 (delta 1), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1/1), completed with 1 local object.
To github.com:boxabirds/funnynotfunny.git
ba4856f..3a862f2 main -> main
The files didn’t appear!
Sometimes the upload fails silently: looking at github.com
you’ll see that the commit id listed in the command line doesn’t appear. Here’s an example of silent failure:
➜ datasets git:(main) ✗ git push
**Connection to github.com closed by remote host. MB/s**
Uploading LFS objects: 100% (1/1), 669 MB | 0B/s, done.
➜ datasets git:(main) ✗
I solved this by simply running git push
again.