Inside Git: Understanding the Inner Mechanisms of Version Control

Md. Anjarul Islam
Brain Station 23
Published in
7 min readMar 2, 2023

--

Git is a version control system which we use in our daily basis. But many of us don’t know the internal mechanisms of this interesting tool. When I first learned git, it was like a magic box to me. I was wondering about how git stores all the data so that it can even show the very first days commit and the snapshot of that committed code even of a large scale project. I believe that you think the same like me. You’ll happy to learn the things that happen inside it. For this story, I am assuming that you’ve already know what git is and what the features of git and still curious about how the magic happens inside it.

Before starting the story I want to write a brief description about what and how I want to explain it here.

There are many ways of explaining things. But the most favorite way to me is, learning by questioning and answering. At each step of my explanation I’ll find a problem, make a question, and then will solve the problem to answer the question.

How git stores data/content?

Before starting to explore the details of how git stores content, let’s make our own git using a naive approach.

So, here is an example repository. In the left side of the image are the folders and the file names and in the right side are the file contents.

Repository

What will we do if we want to implement a commit feature like git?
A very naive solution can be, copying all the contents of the working repository, creating a new version folder and save them. As well as keep the older version in an older version folder. Like this

naive version control system

To save all the versions/commits we maintained a .vcs folder. Each version folder will have the whole repository of that particular version.

But this approach will take lots of memory. Because we are copying the full repository each time. So, to optimize the memory, you may think about storing the changes of versions only rather than copying all the files. Then each version folder will contain only the changes with its previous version. But that would be problematic. Because then no version folder will have the full content. And checking out between versions would be more difficult. Comparing between versions would be impossible type.

If a file has been changed more than 100 times, then the files whole contents will be in 100 version folder. To regenerate the file from history we’ll have to take all the changes from each version and merge the changes. Again we don’t always just add new lines to file but also remove/replace lines. So we’ll have to consider that too when merging versions. And if you have thousands of files like this, then there will be lots of stuffs to merge!! And comparing between versions will be quite impossible.

So now we understand that if we want to save the changes history and also want to compare between the versions then we must keep the full copy of the repository of each commit/version. But how can we optimize the memory?

Let’s review our naive solution again. In our solution, we were copying all the files and folders regardless of they have been changed or not. Do we need to copy the unchanged files again and again? Can’t we reuse the unchanged files between different versions/commits?

Yes, of course. And git does that? But how!!

Before explaining the mechanism, I want to tell you something about git’s consideration. Everything can be treated as content like file content, folder content etc. So git treats everything as content and then tracks the content. So, content is the main consideration of git.

Git maintains a database to save the contents. If it finds any new content then git saves it to the database, otherwise it reuses the previously stored content. So in git’s database, there are no duplicate content. But how git ensures that? As this should be so accurate that even a single space change should be identifiable.

Hashing!! Yes, that’s the perfect solution for uniquely identifying a content. A perfect hash algorithm will always generate a unique hash value for a unique content.

So what git does is, it maintain a key-value pair type database where the key is the hash of a content and the value is that content-itself. Git inserts a content to database only if the hash of that content does not exist. Here is an example

git database

The ID here is truncated to first 5 character

Now, let’s get back to our example repository. Our example repository will look like this after putting the content into git’s database.

database and repository link up

This is a tree type representation of our example repository. The Repository is the root, example is a folder and about, hello, main are the file names.

Till now we’ve stored only the file contents. The folders and file names are not stored anywhere. So what about them? How git tracks folder and file names?

As we’ve mentioned that git treats everything as content. So can we represent the folders and file names as content?

Yes. Let’s represent the folders like this

Folder name => [{filename1, hash1}, {filename2, hash2}, ….. {filenameN, hashN}]. Then our example folder would be like example => [{hello.txt, 7037}, {main.txt, f4230}]

Git keeps the file names and those files hash values in folder level. As the folder now is a content, git generates hash of that content and store it in to the database! But as this is a different type of content, git named it as tree type content. And for the file contents git named them as blob type content. Now we can generate the hash of a folder and save it to git.

folder in database

Here is the simplified view of the last picture.

repository contents in database

Great!! Now we’ve stored a folder in database!! Wow!! That’s true for any type of content. In git’s perspective, anything can be converted to a content and a hash can be generated for that to save into database.

Great!! We are putting everything one by one into git’s database. So what’s next?

Yes, the root folder is remaining!! We can also put that into database and can get a unique hash value for that too.

full repository snapshot

Wow!! Git is really awesome!! This is like the full picture of the project in a single frame where each pixel value is tracked!!A complete snapshot of the repository!! A single hash value is representing a unique state of the full repository. If any tiny change is done anywhere in the repository then we will get different hash value for that. Also in important point to notice that, the hash value of the full repository is being generated based on the current state of the repository. And so it is independent of its past state.

Now we can change anything in repository and take snapshot of the full changes! Great!!

So, till now we can store the state of repository. But where’s the history, the commit messages we see when use the command git log? Let’s see.

When we commit any change in git, we add a commit message. So what does a commit contain actually?

A commit contain the snapshot hash of the repository at the time of commit, the author info, the time of the commit etc.

commit in database

From our previous concept, we can treat a commit as content and can store that into database as well! And get a new ID again! Great!

Now, the history part. How git keeps track of the history? Simple. As we can represent a commit by a single hash value, when creating a new commit we can additionally keep our previous commits hash value so that we can traverse back.

Let’s make the history.

Let’s change the content of about.txt file. Just added two dots at the end of the file and made a new commit! Then the full picture was

commits relations in database

Now we have two commits. The latest commit is green and it has a reference to the previous commit. And now from any commit, we can generate the full project by simply traversing through the database using references. The iteration logic is, start from the commit ID, get the contents. If the content is tree type then it has list of file/folder name and its corresponding hash value. And if the content is blob type then the content is the content for that file. We can use this logic and traverse recursively to generate the full repository.

Great! Now the memory is optimized. No duplicate contents into database and also the full repository can be generated from each version/commit.

The last thing. What about the branches? Branches are nothing but a name assigned to a hash value. When we create a branch git assigns that name to the commits hash.

Hope, now you’ve a clear idea about how git stores data to show its magic!!

--

--

Md. Anjarul Islam
Brain Station 23

Experienced Senior Software Engineer at Brainstation-23 with a strong background in JavaScript technologies, Passionate about system design and problem solving.