Demystifying Git internals

I have been using Git for all my personal as well as professional projects for more than two years now. I have always wanted to learn more about how Git works internally. Finally got a chance to read about how Git works. If understanding how things work is what excites you, then this post for you.

At Dgraph, we have read only Fridays where we are free to read a book or a research paper or anything else which would help us become better. I used the opportunity to study Git internals chapter from the book Pro Git. I would recommend anyone who wants to understand how Git works in detail to read the book. A few days back we shifted from Git flow to a simpler model similar to what’s mentioned here. I am not going to focus on why we did that here. My motivation for understanding Git better was this particular scenario. Say you tag a commit on a branch which is deleted later, then does the tag remain valid? Does it have all the information about the state of the repo? If yes, how? In the process, I understood a lot more.

Here I checkout the release branch, make some final changes and tag the final commit on the branch. Later I checkout master and delete the release branch. The tag is still there and contains the relevant code. Lets understand how git stores information in more detail to understand what happens here.

Contents of .git folder

When you run git init in a new or existing directory, Git creates the .git directory, which is where almost everything that Git stores and manipulates is located. These are the contents of the .git directory of one of my projects.

One thing to understand about git is that git doesn’t store diff of the contents of your files. It stores snapshots(the exact content of the files) at the point a commit is made. We are going to focus on the objects sub-directory in this post. The objects sub-directory is where all the content is stored. Git has 4 types of objects.

  1. blob — A blob object is used for storing the contents of a single file.
  2. tree — A tree object contains references to other blobs or subtrees.
  3. commit — A commit object contains the reference to another tree object and some other information(author, committer etc.)
  4. tag — A tag or a tag object is just another reference to a commit object and just makes for easier referencing.

If all this doesn’t make much sense right now, that’s okay. We will go through an example and understand this better.

Going through an example

$ git init

This creates the .git directory with the all the sub-directories above but they are empty.

$ echo "First file" > first.txt
$ git add first.txt

Staging a file creates the blob file in the objects sub-directory with the path .git/objects/4c/5fd919d52e3c1b08f7924cfa05d6de100912fd. This blob has the contents of the file and has type blob. A blob is essentially the content of the file at a particular instance.

$ git cat-file -p 4c5fd919d52e3c1b08f7924cfa05d6de100912fd
$ First file
$ git cat-file -t 4c5fd919d52e3c1b08f7924cfa05d6de100912fd
$ blob

git cat-file is a command used to view the contents, type of objects.

$ git commit -m "First file"
[master 36b8081] First file
1 file changed, 1 insertion(+)
create mode 100644 first.txt
$ git log
commit 2436b80815fde902030d71f08957f68a366dd91f
Author: Pawan Rawal <pawan@dgraph.io>
Date: Sat Aug 13 19:49:15 2016 +0530
First file

This creates two more object files in the object sub-directory. One is a tree object and other a commit object.

State at this point
$ git cat-file -p 2436b80815fde902030d71f08957f68a366dd91f
tree 94c68012bce86e6ada0b06c4707dbb9b317bc45d
author Pawan Rawal <pawan@dgraph.io> 1471097955 +0530
committer Pawan Rawal <pawan@dgraph.io> 1471097955 +0530
First file
$ git cat-file -t 2436b80815fde902030d71f08957f68a366dd91f
commit

Running a git cat-file command on the SHA-1 of the last commit above shows the that object is a commit object and contains reference to a tree object. We can see that the tree object contains the reference to the individual blob objects that we saw earlier.

$ git cat-file -p 94c68012bce86e6ada0b06c4707dbb9b317bc45d
100644 blob 4c5fd919d52e3c1b08f7924cfa05d6de100912fd first.txt
$ git cat-file -t 94c68012bce86e6ada0b06c4707dbb9b317bc45d
tree

So a commit objects refers to a tree object which refers to blob or other sub-tree objects as we would see later.

Now, lets see what happens when I create and commit another file.

$ mkdir first-folder
$ cd first-folder
$ echo "Second file" > second.txt
$ git add second.txt
$ git commit -m "Second commit"
[master ca3917d] Second commit
1 file changed, 1 insertion(+)
create mode 100644 first-folder/second.txt
State after second commit

Now when I run the cat-file command on the latest commit here is what I see.

$ git cat-file -p ca3917d421d303bba47a34c9069f3524d84ad7be
tree 03de692bf6a38ac9c98bac37dc27534fbaf020b6
parent 2436b80815fde902030d71f08957f68a366dd91f
author Pawan Rawal <pawan@dgraph.io> 1471098853 +0530
committer Pawan Rawal <pawan@dgraph.io> 1471098853 +0530
Second commit

Apart from having a reference to a tree this commit also has reference to its parent commit.

$ git cat-file -p 03de692bf6a38ac9c98bac37dc27534fbaf020b6
040000 tree e5223bc2e9eb09b0d966642c67059b4b8dda6aea first-folder
100644 blob 4c5fd919d52e3c1b08f7924cfa05d6de100912fd first.txt

On running cat-file on the tree object referred by the second commit we see it contains reference to a blob for first.txt(it didn’t change because the contents of the file didn’t change). If we were to modify the contents of first.txt, the blob reference would also be different here. It also contains reference to a tree object because first-folder is a directory and tree objects are used to store references to directories too.

# cat-file on the tree object for first-folder has reference to a blob which stores the content of second.txt.
$ git cat-file -p e5223bc2e9eb09b0d966642c67059b4b8dda6aea
100644 blob 20d5b672a347112783818b3fc8cc7cd66ade3008 second.txt

The blob files store the actual content. This is how things look at this point. So each commit has all the info about the repository at that point.

Lets see what tagging does here. There are two types of tags, lightweight and annotated tag. Annotated tags create the fourth type of objects — tag objects which just point to a commit. Lightweight tags don’t create any tag object. They just have a reference to the latest commit.

# creating a lightweight tag.
$ git tag v0.1.0
# Sorry I lied, we will take a sneak peak into the refs directory. 
$ cat .git/refs/tags/v0.1.0
ca3917d421d303bba47a34c9069f3524d84ad7be
# Above is the SHA-1 of the latest commit
# creating an annotated tag
$ git tag -a v0.1.1
$ cat .git/refs/tags/v0.1.1
c236cc24750e43808dcef99d3a3372c9a1d94141
$ git cat-file -t c236cc24750e43808dcef99d3a3372c9a1d94141
tag
$ git cat-file -p c236cc24750e43808dcef99d3a3372c9a1d94141
object ca3917d421d303bba47a34c9069f3524d84ad7be
type commit
tag v0.1.1
tagger Pawan Rawal <pawan@dgraph.io> 1471101036 +0530
First tag

As you can see the tag object has a reference to the commit object which in turn has all the other information.

After the tag on second commit

Branches in git are also just pointers to commit objects. So when you delete a branch, just the reference(stored in .git/refs/heads) gets deleted, the commit objects are still there. If you have a tag referencing a commit object then even if you delete the branch on which the tag was created, you still have all the content to reconstruct the state of the working directory.

Hope this was an insightful read and helped you understand how Git works better. Would love to hear your views about the post in comments.