A Beginner’s Guide to Git and Git Internals

Prakarsh Parashar
Geek Culture
Published in
15 min readApr 4, 2021

I have struggled a lot with Git! 8 out of 10 times when I would push my code, the project would break. Then I would spend the entire night trying to make my code work again but to no avail. In the morning, half-asleep, Finally, I would seek help from my friend, who is a Git God. To my amazement, he would fix the Project in just a couple of minutes. He would always say Git is simple once you understand its underlying internals.
Frustrated with my struggles with Git, one weekend, I decided that I would spend the entire weekend studying Git Internals. Since then Git has never seemed so simple yet beautiful. Now, every problem in Git, which would earlier be a nightmare to solve, seems so obvious and trivial.

Frankly speaking, for beginners Git is not an easy tool to use. Although it has a beautiful design on the inside, for a newbie understanding even basic functionalities of Git is a tough nut to crack. Through this article, I want to unwrap the Git black box and dive into the Git internals. In this series of articles, we’ll learn about Git’s Internal Structure. This is the first article in a series of 4 articles. (second article on Git branching has been published. You can find it here)

The article is written for complete beginners. However, it would also be useful to those who have some familiarity with Git but are not confident with their Git skills or those who want to demystify the underlying internals of Git. The only prerequisite is some familiarity with commands like cd, ls, cat, etc.

Before beginning our Introduction of Git, let us first understand, What a Version Control System is?

What is a Version Control System?

Simply speaking, a Version Control System (VCS) is a tool that helps to record and manage changes to our Project.

For a better understanding, let us consider an example.

Imagine a world without Version Control Systems. Suppose you are working on a project. You are saving your project files in google drive (or your local computer). You start your project on Day-1 and you have completed a part of your project (say 50%) by Day-7. After 2 days (Day-9) you realize that the project has now become more complicated and you want to go back to the state of the project on Day-7. Sadly! There is no way you can revert to the previous version of your project because all project files on your computer have been overwritten by the changes you had made. The only way could have been, had you saved a copy of your project (as on Day-7) somewhere. But saving a copy of your project every time you complete a part of the project is cumbersome.

To avoid such problems we use Version Control systems. Version control systems save a snapshot of your Project repository (Project folder) on your local computer from time to time. Every time you complete a part of your project you issue a command to your VCS, saying — Take a snapshot of my current Project Repository (snapshot of all files and folders in your current Project repository or Project directory) and save it in your history. Now if you want to come back to a previous state (snapshot which you had saved earlier) of your project, you can easily do so by issuing another command to your VCS, saying — Revert the state of my project to this particular snapshot.

You can revert to any snapshot in your Project’s VCS’s history

The name is Version Control because it saves versions (snapshots) of your Project. The importance of Version Control systems increases as the size and complexity of the project increases. As we’ll see, there are many additional features a VCS provides. They help you manage your projects in a much better way.
Git is a Version Control system. It is free and open-source and also one of the most popular VCS. Some other Version Control Systems are Mercurial, SVN, etc.

Git — Introduction

So now we understand what Git is. But before beginning with Git’s Data model, I would like to address another point of confusion for complete beginners. Apart from Git and Version Control Systems, many must have also heard about GitHub. So, a natural question is — What is GitHub? How is it different from Git? (Those who’ve not heard about GitHub, you can choose to ignore this part of the article) You can think of GitHub as a Google Drive for Projects. It stores your projects on the cloud. Not only your project but also all the project snapshots that Git has been saving along the way. It is helpful if you want to access your project on a different machine, or you accidentally lose all the data on your computer. And, GitHub is not only meant to serve as a backup for your Projects, but the USP of GitHub is also that it makes collaboration on a project so much easier.
A few beginners, just about to begin learning Git have a misconception that GitHub and Git are intermingled, that for using Git, you need to use GitHub as well. This is not true. If you don’t want to save your project on a cloud and your project does not require collaboration with others, GitHub is not required. Git will manage your project on your local computer without the need for GitHub. Also note that apart from GitHub there are other code hosting platforms as well.
Now let’s begin with Git’s Data Model.

Git — Data Model

So, time to get started with Git’s internal design. Internally git stores the content of snapshots in different types of objects. Let us take a closer look at these objects.

Commit object
We talked about a snapshot of our Project repository. A commit object denotes a snapshot of our Project. It helps us view a particular snapshot of our Project repository (present in the snapshot history that our VCS has saved). A commit object contains a pointer (reference) to its parent commit object (i.e. the snapshot saved just before). This forms a linked list of commit objects (A Linked list is a list of objects, an object is linked to the next object via a pointer. The next object, in turn, points to another object and so on). The only exception is the initial commit which does not have a parent.
NOTE: In the article on branches, we’ll discover that a commit object can also have 2 or more parents and a parent commit object can also have 2 or more children commit objects.
A commit object also contains other useful information like the author of the commit, the commit message, etc. We’ll cover those later in the article.

Commit Object-3 is the latest commit while Commit object-1 is the oldest commit ( which does not have a parent)

So how does this commit object help us view a snapshot of our Project?
Ans — It contains a reference to a tree object.

Tree object
A tree object is used to represent directories (folders). If the Project is contained in a directory named — MyProject, then the commit object contains a reference to the tree object representing the directory MyProject. We’ll call this Project directory as the root directory. Our root Project directory is also stored as a tree object. The commit object contains a reference to this tree object. A Tree object contains information about the directory that it represents, i.e. all files and sub-directories present in the directory. Sub-directories will also be represented as Tree objects. So a Tree object can contain a reference to another Tree object.

Files pointed by Tree objects are called blob objects in git.

Blob object

A blob object represents a file in git. If a file does not change from commit-1 to commit-2 (say) then the blob object also does not change.
A Blob object contains the content of the file.

Now let us take a closer look at these objects. We’ll get our hands dirty and start executing some commands. First, install git on your system. You can refer to this article for installation steps. I would advise you to follow along by executing the same commands in your system.

NOTE: To avoid any confusion, >>> is my command prompt. 😅

Firstly, create an empty directory gitDemo which will be our root Project directory.

>>> mkdir gitDemo
>>> cd gitDemo

Next, we’ll initialize this repository as a git repository. git init initializes the repository as a git repository and creates a .git folder in our root project directory.

>>> git init
Initialized empty Git repository in /home/prakarsh/MediumGitArticle1/gitDemo/.git/

Listing the content of our gitDemo directory, we will notice a .git directory. The .git directory is responsible for storing all the important information to manage our project using Git.

>>> ls -a
. .. .git

Let‘s look at the structure of the .git directory using tree command.

>>> tree .git
.git
├── branches
├── config
├── description
├── HEAD
├── hooks
│ ├── applypatch-msg.sample
│ ├── commit-msg.sample
│ ├── fsmonitor-watchman.sample
│ ├── post-update.sample
│ ├── pre-applypatch.sample
│ ├── pre-commit.sample
│ ├── prepare-commit-msg.sample
│ ├── pre-push.sample
│ ├── pre-rebase.sample
│ ├── pre-receive.sample
│ └── update.sample
├── info
│ └── exclude
├── objects
│ ├── info
│ └── pack
└── refs
├── heads
└── tags

We see that there are many sub-directories like branches, config, HEAD, info, refs, objects, etc. In this article, we’ll focus on the objects directory.

>>> tree .git/objects
.git/objects
├── info
└── pack

We can see that there are only two items in the objects directory. This is because as of now, we’ve not created an object. First, let us create some objects. We’ll create some files and folders in our gitDemo directory and save our first snapshot.

>>> echo "I am file1" > file1
>>> ls
file1
>>> mkdir subDir1
>>> cd subDir1
>>> echo "I am file2" > file2
>>> ls
file2
>>> cd ..

Now, execute the command tree in your gitDemo directory to see the structure of your repository.

>>> tree
.
├── file1
└── subDir1
└── file2
1 directory, 2 files

We’ve created some files and folders in our project repository. Time to save the first snapshot of our repository 😎. Execute the following commands to save the first snapshot of your repository.

>>> git add file1 subDir1/
>>> git commit -m "created file1, created subDir1 and created file2 in it"
[master (root-commit) 155891c] created file1, created subDir1 and created file2 in it
2 files changed, 2 insertions(+)
create mode 100644 file1
create mode 100644 subDir1/file2

Hurray! We’ve saved the first snapshot of our repository. The first commit object has been created. But you must be wondering, what does this git add and git commit commands do?

git add
git add filename1 filename2 filename3 is used to stage the changes in your repository for commit (snapshot). It is like a preliminary step before the snapshot. Note that, if you’ve made changes to some file but you’ve not staged the file using git add, the file will not be considered for commit when you run git commit. i.e. the file will not be included in the snapshot. I am intentionally omitting a few details about the staging part and staging area to keep the focus on the Git Data Model. We’ll cover those details in another article.

git commit
git commit -m “commit message” saves the snapshot of the current state of your repository in your commit history. The commit message is used to describe particular changes in this snapshot of the repository. It is important to provide a good commit message. If you want to look at a previous state of your Project, which has thousands of commits, it is a lot easier to find the commit when you have a good commit message.

Let us look at all the objects created after the commit.

git log shows us the information about all commits in our Project history. Since we have only one commit now, it will only output the information about a single commit. Execute the command git logand observe the output. We can see that the first line contains a 40 characters long alphanumeric string. It is the SHA-1 hash of the content of the commit object. If you are not familiar with SHA-1 hash, do not worry, just think of this alphanumeric string as a unique identifier for the commit object. All types of objects — commit, tree, blob are uniquely identified by this SHA-1 hash.
The second line provides information about the author of the commit (i.e. the one who made the changes in the repository). When you are collaborating with others on a project, sometimes it can be useful to know the author of the commit. The third line is the date and time of the commit. And the last line is the commit message which was added by the author at the time of commit.

>>> git log
commit 155891ca19d94f67159a992c77923818b57d74a5 (HEAD -> master)
Author: pprakarsh <prakarshparashar@gmail.com>
Date: Sun May 3 10:41:06 2020 +0530
created file1, created subDir1 and created file2 in it

Execute this command git cat-file commit <commit-hash> to see the content of the commit object. You can copy the commit hash from the output of the git log command. In the command argument, you need not write the entire commit hash, even a few characters are enough to identify the commit almost all the time. Execute the command and observe the output. Notice that the first line in the output contains the SHA-1 hash of a tree object. This tree object represents the root Project directory.

Commit hash is a function of the content of the commit object. In this case author, time, date etc are different. So the commit hash will vary even when the content of the snapshot of the repository is the same. (You will notice that my commit hash and your commit hash are different even though the content of the repository is exactly same)

>>> git cat-file commit 155891c
tree 4587af994fe34fed465b64b6ce8b8adb07a17aaf
author pprakarsh <prakarshparashar@gmail.com> 1588482666 +0530
committer pprakarsh <prakarshparashar@gmail.com> 1588482666 +0530
created file1, created subDir1 and created file2 in it
Commit object points to the Tree object corresponding to the root Project directory

Execute the command git ls-tree <tree object hash> to take a peek at this tree object. Again copy the tree object hash from the output of the git cat-file commit <commit-hash> command executed above. We observe that the tree object contains a reference to a blob object (file1) and a reference to another tree object (subDir1).

>>> git ls-tree 4587af994fe34fed465b64b6ce8b8adb07a17aaF
100644 blob a83c85bfc5f4f056b0932eb2e5cf767f167a67f2 file1
040000 tree 8993f258581e2772c7a4c27ad07498caed668c57 subDir1

Now, let’s take a look at the blob object(file1) and the tree object referenced by the root Tree object(subDir1). Executing the command git cat-file blob <blob-hash> provides us the content of the blob object i.e. the content of file1.
And executing the command git ls-tree <tree hash> shows that the Tree object (hash 8993f2 in the picture) contains a reference to another blob object (file2).

>>> git cat-file blob a83c85bfc5f4f056b0932eb2e5cf767f167a67f2
I am file1
>>> git ls-tree 8993f258581e2772c7a4c27ad07498caed668c57
100644 blob 4e01f27ee757413022834dcb1bbf4c66fb01f05c file2

Finally, before moving on to the second commit, we’ll have a look at the blob object (hash 4e01f27 in the picture) referenced by the tree object (which represents subDir). Execute the command git cat-file blob <blob-hash> . The output is the content of subDir/file2.

>>> git cat-file blob 4e01f27ee757413022834dcb1bbf4c66fb01f05c
I am file2
Pictorial representation of all objects after the first commit

Let us go back and look at our .git/objects directory again. We see that we have some new items listed. These denote the objects which have been created — the commit objects, the tree objects, the blob objects. Match the object hash in the picture above with the output below. Content of all objects is stored here, in .git/objects.

>>> tree .git/objects
.git/objects
├── 15
│ └── 5891ca19d94f67159a992c77923818b57d74a5
├── 45
│ └── 87af994fe34fed465b64b6ce8b8adb07a17aaf
├── 4e
│ └── 01f27ee757413022834dcb1bbf4c66fb01f05c
├── 89
│ └── 93f258581e2772c7a4c27ad07498caed668c57
├── a8
│ └── 3c85bfc5f4f056b0932eb2e5cf767f167a67f2
├── info
└── pack

Now let us make some changes in our repository and commit again. Execute the commands below to make some changes in file1 and commit those changes.

>>> echo "Adding another line in file1" >> file1
>>> git add file1
>>> git commit -m "added another line in file1"
[master 6642a0d] added another line in file1
1 file changed, 1 insertion(+)

Execute git log to see the new commit added in the log. The latest commit is at the top.

>>> git log
commit 6642a0de776b52bbe993555a4bf14aed060afea2 (HEAD -> master)
Author: pprakarsh <prakarshparashar@gmail.com>
Date: Sun May 3 22:14:12 2020 +0530
added another line in file1commit 155891ca19d94f67159a992c77923818b57d74a5
Author: pprakarsh <prakarshparashar@gmail.com>
Date: Sun May 3 10:41:06 2020 +0530

*Ignore HEAD -> master in the output above. We’ll cover this in the branching article.

Execute git cat-file commit <commit-hash> to see the content of the newly created commit object. We will observe that in the first line, the tree object it points to has a different hash than the previous commit’s hash, due to the changes in file1. Thus commit-2 points to a different tree object.
Remember at the beginning of the article we talked about the parent of a commit object. We can see that a parent data member has also been added to the newly created commit object. The hash corresponding to this parent data member is equal to that of the previous commit object’s hash, thus it verifies that this parent data member is a reference to the previous commit object.

>>> git cat-file commit 6642a0d
tree 60f96fcee961146cdcf8c4cfd0faf5a0e821cebf
parent 155891ca19d94f67159a992c77923818b57d74a5
author pprakarsh <prakarshparashar@gmail.com> 1588524252 +0530
committer pprakarsh <prakarshparashar@gmail.com> 1588524252 +0530
added another line in file1
Commit-2 has been created. Notice the parent data member. It points to its parent (i.e. commit-1). Also, the new commit (i.e. commit-2) points to a different Tree object.

Now, we’ll explore the tree object (60f96fce). The newly created commit object contains a pointer to this tree object. Execute the command git ls-tree <tree hash>
The tree object (60f96fce) contains a pointer to a blob object and a pointer to another tree object (hash 8993f2585) corresponding to subDir. Since the subDir did not change between the two commits, the content of this tree object (8993f2585) also did not change. Since file1 was changed, we have a new blob object corresponding to file1. We can look at the content of the new blob object using this command git cat-file blob <blob hash> . blob hash will be available to us from the output of the git ls-tree <tree hash> command executed just before.

>>> git ls-tree 60f96fcee961146cdcf8c4cfd0faf5a0e821cebf
100644 blob 668be5e5d8709844c9e14efb34565cdbc475b57f file1
040000 tree 8993f258581e2772c7a4c27ad07498caed668c57 subDir1
>>> git cat-file blob 668be5e5d8709844c9e14efb34565cdbc475b57f
I am file1
Adding another line in file1
Tree object corresponding to subDir1 remains the same, blob object for file1 changes.

Time for commit-3. We’ll make some changes in subDir/file2. Execute the following commands to make changes in subDir/file2 and commit those changes.

>>> echo "I have been changed" >> subDir1/file2
>>> git add subDir1/file2
>>> git commit -m "changed file2"
[master e097d8d] changed file2
1 file changed, 1 insertion(+)

To see the new commit. Execute git log. The new commit object can be found at the top.

>>> git log
commit e097d8da357fe97c36c7178f0b770ae44b1f1f3c (HEAD -> master)
Author: pprakarsh <prakarshparashar@gmail.com>
Date: Thu May 21 10:24:55 2020 +0530
changed file2commit 6642a0de776b52bbe993555a4bf14aed060afea2
Author: pprakarsh <prakarshparashar@gmail.com>
Date: Sun May 3 22:14:12 2020 +0530
added another line in file1commit 155891ca19d94f67159a992c77923818b57d74a5
Author: pprakarsh <prakarshparashar@gmail.com>
Date: Sun May 3 10:41:06 2020 +0530
created file1, created subDir1 and created file2 in it

To see the content of the newly created commit object. Execute the command git cat-file <commit-hash> .

>>> git cat-file commit e097d8d
tree e1564dac72800b9ee123f05e882543c4b7db56d1
parent 6642a0de776b52bbe993555a4bf14aed060afea2
author pprakarsh <prakarshparashar@gmail.com> 1590036895 +0530
committer pprakarsh <prakarshparashar@gmail.com> 1590036895 +0530
changed file2
We see that commit-3 points to a different tree object (e1564dac)

Now, we’ll look into the newly created tree object (e1564dac) and other objects pointed by this tree object (e1564dac). This tree object points to the blob object (668be5e) and a new tree object (ba6b84ec) corresponding to subDir. Why a new tree object for subDir? Because the content of subDir has changed. The blob object (668be5e) corresponding to file1 does not change, because file1 does not change.

Execute the following commands to see the contents of these newly created objects.

>>> git ls-tree e1564dac72800b9ee123f05e882543c4b7db56d1
100644 blob 668be5e5d8709844c9e14efb34565cdbc475b57f file1
040000 tree ba6b84ec235a7cc5eeefba17dd744d270e2b258d subDir1
>>> git ls-tree ba6b84ec235a7cc5eeefba17dd744d270e2b258d
100644 blob 38ec3b70732a6b5e219fea0edbf48b26c7bd2234 file2
>>> git cat-file blob 38ec3b70732a6b5e219fea0edbf48b26c7bd2234
I am file2
I have been changed
Flow chart representation of git objects after Commit-3

I hope that now we have a clear understanding of Git’s Data model. We understand how Git stores different types of objects. In the next article, we’ll learn about one of the most powerful features of Git - branching. The second article can be found here.

--

--