Understanding git under the hood

Henrique Mota
7 min readMay 23, 2018

--

In the last years, git has become an essential tool for programmers that work in software development teams. I remember back in the days when I started to use this wonderful tool with all the limitations I had, because I didn’t payed the needed attention of what was happening under the hood.

I had the feeling that something was wrong, git shouldn’t be a so stressful experience and I decided to embrace the struggle of overcoming my limitations and understand how things really work. Of course my life became a lot easier than before, and hopefully in this post I will be able to contribute to your git experience removing from your shoulders at least some stress.

Git as a database

A big misunderstood I find among git users, is thinking that our working directory is the same thing as our git repository. Well they are strongly related but their are different entities in this ecosystem.

Let’s go trough a interactive explanation, and I encourage you to execute the following commands in your terminal.

> mkdir git-project
> cd git-project
> git init
> ls -las
total 0
0 drwxr-xr-x 3 bullian staff 96 May 22 22:21 .
0 drwxr-xr-x 3 bullian staff 96 May 22 22:21 ..
0 drwxr-xr-x 9 bullian staff 288 May 22 22:21 .git

As you may see, we started by creating a directory and initialise git inside that folder. The immediate side effect of this action was the creation a hidden git folder, and this folder is what we call the git repository. You can make the experience of removing this folder with:

> rm -rf .git
> git status
fatal: Not a git repository (or any of the parent directories): .git

As you may see this is the folder that git needs to maintain all the version control of your project.

It’s is very important to know about the utility of this folder but a more useful question is what data is stored and what is the shape of that information? You can look at it as a timeline of your project versions. Every files and/or folders you commit will be materialised into a node called commit, the next commit will materialise into a new node that will refer to the previous one as is parent. Summing all up you will end up with a timeline your git database and as every database you can visit that nodes whenever you want.

Let’s see how it works, starting by adding two files each one in a commit of their own and then committing new changes:

> echo "Hello world A" > A
> git add A
> git commit -m "Add file A"
> echo "Hello world B" > B
> git add B
> git commit -m "Add file B"
> echo "One more line for A" >> A
> git add A
> git commit -m "Add file A"
> echo "One more line for B" >> B
> git add B
> git commit -m "Add file B"

We end up with a “timeline” like this:

git graph after performing this git operations

One this that is important to highlight is that each commit is immutable, once is created is forever stored as it is and there isn’t any action that can change this commit. Each commit has an sha-1 hash corresponding to his id and with the exception of the first has at least a referenced ancestor. I say at least because there is a special commit called merge commit that has 2 referenced ancestors and that will be covered in a future post.

In opposition to other version control systems like svn, git stores a snapshot of each file/folder version instead of storing the changes that each file/folder suffers. This may seem at first just a random system design decision, but this was innocent by the author Linus Torvalds. Given the fact that a commit reflects the exact state of the files/folders in your project at that exact time, it allows the transition from commit x to commit y (being x and y any commit from all project commits universe) without visiting other commits.

What commits are made of?

After we have saw at a higher level how a git graph evolves after several commits, now it’s time to see what a commit is made of. The main question here, is how a new version is maintained and how we keep track of the previous version.

In git there are 4 types of objects in order to persist a committed version:

  • commit
  • tree
  • blob
  • annotated tags, we will not covert in depth here but I will revisit tags in the next topic.

In order to see how this objects relate to each other, we are going to use 2 git commands:

> git log --graph --oneline 

to see the commits made until this point, where the first commit of the list is the last commit in the timeline and vice versa.

> git cat-file -p <hash-id-of-the-object>

to se the content of each git object.

It’s play time, let’s execute the git log command and please be aware that my commit hashes are going to be different from yours:

> git log --graph --oneline* e38905d (HEAD -> master) Update B
* 508ee95 Update A
* 5a9f826 Add B
* 306d81c Add A

So let’s start to see the contents of the first commit in the timeline “Add a”:

> git cat-file -p 306d81ctree 9a50d452a09890c32c41227b2819dc1da3707585
author bullian <***@gmail.com> 1527032662 +0100
committer bullian <***@gmail.com> 1527032662 +0100
Add A

As you can see the git commit references a tree, stores the author and the committer and also the message of this commit.

Let’s keep going by visiting the tree:

> git cat-file -p 9a50d452a09890c32c41227b2819dc1da3707585100644 blob 27773edb586df6205c184a604fba1859f170db78 A

As you can see the tree references a blob and name it as A. This matches the file that has been added in the first commit. Now let’s look at the blob content:

> git cat-file -p 27773edb586df6205c184a604fba1859f170db78Hello world A

This is pretty awesome, the first commit has a reference to the tree structure at that time and the tree has a reference to all blobs that store the content of files. Did you notice that a blob isn’t compromised by itself with a name? The name is maintained at the tree level and this is a very smart system design decision, because if you have two duplicated files in your project or you rename a file, git doesn’t need extra blobs for that, the name associated in the tree is the only difference between files.

Let’s see now the content of the next commit, to have an idea of how a commit transition is done.

Again let’s see git log:

> git log --graph --oneline* e38905d (HEAD -> master) Update B
* 508ee95 Update A
* 5a9f826 Add B
* 306d81c Add A

See the content of “Add B” commit:

> git cat-file -p 5a9f826tree 1148c5c9ac1a0aa5340947631b06312b39449df6
parent 306d81ca56f82a855631fac75e5de39ddee91009
author bullian <***@gmail.com> 1527032684 +0100
committer bullian <***@gmail.com> 1527032684 +0100
Add B

As expected the tree has a different hash from the previous one because was changed with the addition of a new file.

But now analysing the tree:

> git cat-file -p 1148c5c9ac1a0aa5340947631b06312b39449df6100644 blob 27773edb586df6205c184a604fba1859f170db78 A
100644 blob d35f2bc25d6791cbb6a078464b6e4a002467648a B

You can see with a bit of surprise that the blob corresponding to file A has the same hash in the previous tree from the previous commit. This is awesome and if you think about it makes a lot of sense, because nothing changed in that file so we don’t have the need to create another blob and occupy more space.

Demystifying HEAD, tags and branches

This is a very cool topic, because this was one of the things that changed dramatically my git vision for best.

Why I grouped all this elements together? This is because they have all the same starting point, they are all references/pointers.

What they reference? They simply reference a commit.

Although all of them are references they have different characteristics, let’s analyse each one of them:

  • HEAD, is the easiest one to understand. It is a moving pointer that follows all your changes. It can reference a branch or a simple commit, with different implications.
    If at a certain time you are in a branch, the HEAD will be static even with the addition of new commits or with a reset to a previous state, because as we are going to see the branch itself is a moving reference.
    If instead you are in a commit, the HEAD will move with the addition of new commits or with a reset to a previous state. In this situation you will also notice that you are in a “detached” head state, this only means that every action you do, will not have impact in any branch.
  • Branches, has a very curious and useful behaviour. It is a moving pointer that follows every changes and you can look to a branch as a context for your actions.
  • Tags, I will not cover much this type of pointer, but a tag is basically a fixed pointer and can be used to “tag” a given commit like a version of your software.

Distributed Version Control

Git is about version control, but is also about concurrency. If you think about it, when you are using it in a working team, you are developing features at the same time your colleagues do. If you are a programmer and you worked with concurrency before, you know that where there is concurrency there is also some potential complicated situations to solve. Because of his distributed nature the concurrency problem is mitigated because you can solve conflicts locally and once everything is ok, “propagate” the state to the remote repository like github or bitbucket.

I hope you enjoyed, please tell me if you want another post covering any particular subject of git or if you noticed something wrong or poorly explained.

If you liked this post, read my next one:

Kind regards

Henrique Mota

--

--