How to: storing large datasets in Github (using git lfs)

Julian Harris
TheAIEngineer
Published in
2 min readMar 19, 2024

If you have some training data there’s a limit to how much github will accept by default. However its LFS “large file system” extension makes this possible. Once set up you can use it in your normal git workflow transparently. Just bear in mind that large files do take a while to upload.

ChatGPT: “Very simple abstract monotone charcoal sketch of <article contents>”

Worked example

  • You want to store comments.csv along with the rest of the repo and any other csv files you create. (so, *.csv)
  • Install git-lfs (Mac: brew install git-lfs or Ubuntu sudo apt update && sudo apt install git-lfs
  • Activate git-lfs on each machine: git lfs install. This is not per repo.
  • In the root of your repo, git lfs track "*.csv" (pro tip: you can place it anywhere and it’ll apply from that part of tree upward)
  • This will create a new file, .gitattributes
  • Now make sure in your next git add you include BOTH the csv files you want to upload and the .gitattributes file.
  • git commit -m <message> and git push now work as normal.

Result

Here’s an example session on my Mac. In this session, .gitattributes was created inside the datasets folder because that’s the only place I store csv files

funnynotfunny-mac git:(main) ✗ brew install git-lfs
… various installation messages …

➜ funnynotfunny-mac git:(main) git lfs install
Updated Git hooks.
Git LFS initialized.

funnynotfunny-mac git:(main) cd backend/datasets
➜ datasets git:(main) git lfs track "*.csv"
Tracking "*.csv"

➜ datasets git:(main) ✗ git add .gitattributes

➜ datasets git:(main) ✗ git add comments.csv
➜ datasets git:(main) ✗ git commit -m "lfs + large file commit"
[main 3a862f2] lfs + large file commit
2 files changed, 4 insertions(+)
create mode 100644 backend/datasets/.gitattributes
create mode 100644 backend/datasets/comments.csv

➜ datasets git:(main) ✗ git push
Uploading LFS objects: 100% (1/1), 669 MB | 1.2 MB/s, done.
Enumerating objects: 8, done.
Counting objects: 100% (8/8), done.
Delta compression using up to 8 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (6/6), 650 bytes | 650.00 KiB/s, done.
Total 6 (delta 1), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1/1), completed with 1 local object.
To github.com:boxabirds/funnynotfunny.git
ba4856f..3a862f2 main -> main

The files didn’t appear!

Sometimes the upload fails silently: looking at github.com you’ll see that the commit id listed in the command line doesn’t appear. Here’s an example of silent failure:

➜ datasets git:(main) ✗ git push
**Connection to github.com closed by remote host. MB/s**
Uploading LFS objects: 100% (1/1), 669 MB | 0B/s, done.
➜ datasets git:(main) ✗

I solved this by simply running git push again.

--

--

Julian Harris
TheAIEngineer

Ex-Google Technical Product guy specialising in generative AI (NLP, chatbots, audio, etc). Passionate about the climate crisis.