Dealing with Large Data Files and Bad Git Commits

AKA Git Squash

Orin Conn
Orin Conn
Feb 16 · 3 min read
Photo by Florian Olivo on Unsplash

As I dive deeper into my job search, I’ve taken up a new data science project to widen my skill set and also better-relate my career goals. The main goal of this project is to work with satellite imagery and Concurrent Neural Networks. More specifically, I retrieved data from Kaggle and will be classifying cloud types in order for a better understanding of weather patterns.

I’ve downloaded a dataset that I’ve realized is larger than I am used to working with and have gotten absolutely stuck trying to push it to GitHub!

I have found the issue and thought others might find a written solution helpful on here.

What happened:

I had just ‘git commit’ the very first changes to my new repo after connecting to the Kaggle API and downloading my dataset which was over 5GB.

Git push…

My terminal hit me with the following:

At this point, I remember that GitHub’s standard data transfer limit is a firm 100 MB! I am way over that. Essentially Git has committed my changes into history but the GitHub website is restricting my usage for obvious reasons to save space on their servers.

What was I to do here? I thought I could just delete the files and re-commit the changes and then push, right? Nope. I still got the error message.

As I am continuously learning, when you ‘git commit’ something you are literally etching your changes in git stone, for all eternity. Not really. But you are actually saving all of your changes and commits into your git history. This is actually one of the reasons git is such a robust version control software. Thanks to trusty StackExchange, there is a workaround.

To put a bandaid on the git push issue, so that you can at least upload your code, remove your large files.

Next, you perform what GitHub calls “squashing” and essential squash your latest git commits into one. Here we will squash the last two commits because I had messed up a second time.

Squashed! Now simply enter a message for the combined commit:

Voila. Our code is pushed up to our remote repository, just without our large data files.

To avoid these problems in the future, it is best practice to actually add your datasets (if they are larger than 100MB) to your .gitignore file. To do this, you can create a .gitignore file from your terminal if you don’t have one already:

Then open that same .gitignore file:

This will bring up your .gitignore in some text editor depending on your operating system. There is a good template hosted on GitHub for Python projects that takes care of usual stuff. Simply copy and paste all of this into your opened .gitignore file.

You’ll need to add extra lines for your large data files as well. Just Include the path to the file with the file name and you should be good to go!

The more I experiment and research, the more I find having a better knowledge of the inner workings of Git can really save a lot of time. But then again when an unknown type of problem arises, we can really thank the selfless professional coders who devote expertise to us noobs on StackExchange and StackOverflow. Lifesavers!

CodeX

Everything connected with Tech & Code

Orin Conn

Written by

Orin Conn

I’m a recent Data Science graduate with a B.S. in Environmental Science. Currently seeking job opportunities. Constantly learning!

CodeX

CodeX

Everything connected with Tech & Code

Orin Conn

Written by

Orin Conn

I’m a recent Data Science graduate with a B.S. in Environmental Science. Currently seeking job opportunities. Constantly learning!

CodeX

CodeX

Everything connected with Tech & Code

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store