How does git work internally

Shalitha Suranga
Sep 1, 2018 · 7 min read
Image result for git

A Friendly introduction

When we are doing very straight forward code projects (suppose writing a simple bash file) there are only two points in our development timeline, only start and finish. We start coding very first, thereafter we finalize and ship those projects. Obviously many projects will get more than two points in their development timeline due to feature requests , bug fixes and sometimes reverts.

As mentioned above if we do have many points in our development timeline we really need to use a VCS. So basically VCS tools allow users to manage their development paths (maybe versions, features , patches or technically branches) or development histories without too much effort.

Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.

Git is distributed system. it means that Git users are not just sending their code in to centralized codebase in order to record the history. Everyone got their own copies of development history.

Image for post
Image for post

Haha.. Article is about internals. So let’s begin.We’ll skip git basics. I found a good git-cheatsheet here

Walking to the door

We hit git add , git commit with our keyboards. In other words we stage changes of files and thereafter we commit them to the history. What will happen internally? .. Maybe some magic? or does git manage a centralized database. Then how entire history is available with git clone ?

Opening the door..

Hashes, file based key-value storage and tree data structure, these are the key things behind git. Each tree node, commit and files has own unique 40 character long SHA-1 representation(We can say that’s the key). Thus those elements are added to a tree data structure which is persisted inside .git/objects folder.

.git directory

This will be automatically created when a new repo is created or cloned. Git saves history(file contents and commits) and configuration inside this folder.

Got ahead and play your fingers for these commands

$ mkdir apple
$ cd apple
$ git init
$ ls -1 .git
Image for post
Image for post

branches — Git no longer use this folder — depreciated

config — Store repo’s configuration

Image for post
Image for post

HEAD — reference to your current working branch.

Image for post
Image for post

hooks — Scripts that will be triggered with a Git event (before committing etc..). Normally these hooks are not enabled. You need to remove .sample extension to make them work.

Image for post
Image for post

objects — File based key-value storage that holds commits, tree nodes and file contents (in blob form).

Hey!! you are now inside ..

Image for post
Image for post

Plumbing commands (core commands) will help to understand Git internals. Yes… you understood!, there is a hard way to commit changes than using simple abstract commands like git add and git commit

git add (hard way)

Adding changes to the stage is just like writing a diary anonymously. It means data will be saved to .git/objects but there is no commit message. In other words there is no history written actually.

$ touch myfile.txt
$ git hash-object -w myfile.txt
$ find .git/objects -type f

git hash-object will calculate SHA-1 hash and put the blob file into key-value storage.

Image for post
Image for post

mm.. now we have something in our database. So let’s try with cat .

Image for post
Image for post

Wow binary.. we can’t simply cat because Git uses different internal binary format than general encoding.

$ git cat-file -p e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

This will return empty content since the myfile.txt file is has not content. So add some content to myfile.txt

$ echo "Hello Git" > myfile.txt 
$ git hash-object -w myfile.txt

This will return another hash because the file content is changed. So.. git cat new hash.

$ git cat-file -p 9f4d96d5b00d98959ea9960f069585ce42b1349a
Image for post
Image for post

mm.. We got our file content. Thereafter we can start staging process.

$ git update-index --add --cacheinfo 100644 \ 9f4d96d5b00d98959ea9960f069585ce42b1349a myfile.txt
Image for post
Image for post

This command will add your file to .git/index which holds the indexing information of files. Check staged elements on index files using ls-files

$ git ls-files --stage
Image for post
Image for post

Now what you think! Yes hit git status

Image for post
Image for post

Congratulations!! you staged a file doing the hard way.

git commit (hard way)

We wrote things in our diary, thereafter we have two choices. We can either tear the page ( git reset --hard ) or put the signature ( git commit).

So as good people we simply go ahead and put our signature on what we wrote. Verify your details..

Image for post
Image for post

Awesome!! your signature is okay. commit object has a SHA-1 hash ( like any other Git objects ) and it points a tree node.

Image for post
Image for post

So.. where is the tree node?. We need to create one.

$ git write-tree
Image for post
Image for post

This will create a tree node from current index objects (Remember we staged our blob in there). Thus it will return a new hash which represents our new tree node.

Now we have enough things to do a commit

$ echo "first commit" | git commit-tree \ 6e9432aeedbad83fbffb7f8aae4a5d1ab50b7fdf
Image for post
Image for post

See first commit’s content

$ git cat-file -p 1658642a6c164700c880d499da0b874c18829883
Image for post
Image for post

Also you see history via git log

$ git log --stat 1658642a6c164700c880d499da0b874c18829883
Image for post
Image for post

Let’s do our second commit by updating myfile.txt

$ echo "Hello Git Pro" > myfile.txt

Now file is having another version. Therefore we are going to create another tree node for this history change.

$ git update-index myfile.txt
$ git write-tree

Since file is already in Git index we can simply pass one argument to update-index .

Image for post
Image for post

Since commits happen in linear manner with time, we need to pass previous commit has as an argument for new commit.

$ echo "second commit" | git commit-tree \ 075e4ae2beb7edf5fda9fef8beba34a52f60a957 -p \ 1658642a6c164700c880d499da0b874c18829883

This will return second commit’s hash value

Image for post
Image for post

Once we enter git log still we cannot get results. Therefore we need to set reference to our latest commit

$ echo 314f04395e5e7c70d9f40d681c2f4c84237a7fea >  .git/refs/heads/master
$ git log
Image for post
Image for post

Wow!. commits ands tree nodes are connected as per below. Further tree nodes has another tree nodes depending on what directory structure you staged. This is the basic internal process behind Git functionality.

Image for post
Image for post
Note : This is not structure of our scenario. just to show the graphical view

Moreover branching is very powerful feature in VCS. Basically branches are just movable pointers to tree nodes as per displayed below.

Image for post
Image for post

Conclusion

This explanation was focused on git staging, tree data structure and committing internals. There are other useful features when remote repository is used, such as pulling, pushing etc.

Image for post
Image for post

References

https://git-scm.com/book/en/v1/Git-Internals

Useful links

Neutralinojs

Take a look on our latest open source work

Support me on Patreon

Happy version controlling!!!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store