It’s no secret that Git’s learning curve is actually a learning wall. Convoluted jargon, a confusing API and unclear inner workings often lead to trembling hands when typing git commands. I wish there was a single definite way to wrap your head around it, but there isn’t. Like with any new subject you encounter, the mental load your brain has to process decreases the more you immerse yourself in relevant literature and practice.
However, there is a crucial misunderstanding that is common among most newcomers to Git, and I think clarifying it will be your first major paved stone in your future Git fortress. So here is what you may probably think:
Git is all about storing differences between files, right?
Wrong. Git does not rely on diffs to do its magic. If that was true, looking at a version of a project at some point in time would involve calculating all the modifications done to all of the project’s files since the first ever commit, up to that specific point. What if that history is 10 years old? I hope you understand this is not feasible in the slightest.
So what is it about then? For that, we will need a quick and painless intro to Git internals:
The Zen of Git (The 3-Line Version):
That is basically it. 99% of your Git work is just creating those objects and manipulating pointers that reference them. If your first commit has the following structure:
Git will store it like that:
What happens if we change libs/base_libs/file.py and commit again? Changing the file meaning Git has to create a new ‘blob’ object, since its contents have changed. This also means creating a new ‘tree’ object, because base_libs folder content was changed, same as its parent folder (libs). The new commit will look like this:
Notice how the files that weren’t changed are still referenced by git using the same objects. The second and first commit point to the exact same objects. This simple concept is the engine that drives Git. What happens if we change ‘settings.py’ and commit again? Since it’s a file at the root level, changing it will only require creating a new ‘blob’ object along with a new root ‘tree’ object. It won’t have any effect on the ‘libs’ folder, so Git can reuse them in the next commit:
Using this approach, Git doesn’t need to endlessly apply diffs to files to reach some point in your project’s lifetime. A snapshot of your project can be reconstructed by a simple tree traversal, starting from the commit object. This is why Git is not diff-based, but object based.
So Git doesn’t care about diffs at all?
Not exactly. Git tries to be very efficient in storing its objects on disk, since software projects can get bloated very quickly. Git compresses the content of your files (using zlib) but that’s not all. What if I change one line in a big file and make a new commit? According to what we learned, that will require creating a new ‘blob’ object, since its contents have changed. That will result in 2 big objects in git’s object database that are very similar.
Git will occasionally look for those incidents and will try to create ‘packfiles’ that contain several objects in one file. In those ‘packfiles’, Git will utilize the difference between two nearly-identical files, storing one version of the file as a whole, and the other as a delta. The version that will be stored intact is the more recent version, because that’s what you’ll most likely be working with. This technique is called ‘delta compression’, and Git tells you about it all the time, especially when you deal with a remote repository. So now when you see this once-cryptic Git message:
You know Git is just being as efficient as it can be.
Sum it up
- Git is not diff-based, it is object-based.
- Git does not apply diffs to show you a version of your project
- Git does traverse object trees to show you a version of your project
- Git does use diffs to minimize disk space for its objects
Git diagrams were taken from ‘Git Internals’ by Scott Chacon under the Creative Commons Attribution-ShareAlike license (which is a great book, you should read it)