Dealing with Large Data Files and Bad Git Commits
As I dive deeper into my job search, I’ve taken up a new data science project to widen my skill set and also better-relate my career goals. The main goal of this project is to work with satellite imagery and Concurrent Neural Networks. More specifically, I retrieved data from Kaggle and will be classifying cloud types in order for a better understanding of weather patterns.
I’ve downloaded a dataset that I’ve realized is larger than I am used to working with and have gotten absolutely stuck trying to push it to GitHub!
I have found the issue and thought others might find a written solution helpful on here.
I had just ‘git commit’ the very first changes to my new repo after connecting to the Kaggle API and downloading my dataset which was over 5GB.
My terminal hit me with the following:
remote: error: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.remote: error: See http://git.io/iEPt8g for more information.To https://github.com/oac0de/Understanding_Cloud_Formations! [remote rejected] main -> main (pre-receive hook declined)error: failed to push some refs to 'https://github.com/oac0de/Understanding_Cloud_Formations'
At this point, I remember that GitHub’s standard data transfer limit is a firm 100 MB! I am way over that. Essentially Git has committed my changes into history but the GitHub website is restricting my usage for obvious reasons to save space on their servers.
What was I to do here? I thought I could just delete the files and re-commit the changes and then push, right? Nope. I still got the error message.
As I am continuously learning, when you ‘git commit’ something you are literally etching your changes in git stone, for all eternity. Not really. But you are actually saving all of your changes and commits into your git history. This is actually one of the reasons git is such a robust version control software. Thanks to trusty StackExchange, there is a workaround.
To put a bandaid on the git push issue, so that you can at least upload your code, remove your large files.
git rm --cached [your_large_file_name]
Next, you perform what GitHub calls “squashing” and essential squash your latest git commits into one. Here we will squash the last two commits because I had messed up a second time.
git reset --soft HEAD~2
Squashed! Now simply enter a message for the combined commit:
git commit -m "New message for the combined commit"
Voila. Our code is pushed up to our remote repository, just without our large data files.
To avoid these problems in the future, it is best practice to actually add your datasets (if they are larger than 100MB) to your .gitignore file. To do this, you can create a .gitignore file from your terminal if you don’t have one already:
Then open that same .gitignore file:
This will bring up your .gitignore in some text editor depending on your operating system. There is a good template hosted on GitHub for Python projects that takes care of usual stuff. Simply copy and paste all of this into your opened .gitignore file.
You’ll need to add extra lines for your large data files as well. Just Include the path to the file with the file name and you should be good to go!
The more I experiment and research, the more I find having a better knowledge of the inner workings of Git can really save a lot of time. But then again when an unknown type of problem arises, we can really thank the selfless professional coders who devote expertise to us noobs on StackExchange and StackOverflow. Lifesavers!