PPL 2020 — Document Builder: Git’s Underlying Implementation

Published in

pepeel

4 min readFeb 24, 2020

This post is written as a part of individual review criteria of Fasilkom UI’s software engineering project course: PPL 2020

For most experienced software developers and computer science students, the word “git” is not something alien. When we think of git, we think of something that we use every day, we probably think most people know them, when in fact, it is used in a very segmented field that is tech.

Sometimes, I see how people acknowledge git as a magical tool that just works. It is there to help us deliver our code out of our local machine and it’s been there forever.

In fact, git was only invented in 2005 by Linus Torvalds. Before that, controlling the version of the source code of software can be very challenging. Collaborating with other people? Even more so. Can you imagine sitting side by side for hours with a person whom you have code conflicts with, in front of two different versions of code, trying to resolve them manually together? Let’s take a moment as I write this article and as you read it to thank all the people that created git and the concept of code version control and made all the seamless collaboration in building software a reality.

In this post, I’d like to explore something that developers often miss when learning git, the internal implementation of git and what really happens when we churn out git commands. The commands which implementations I’ll try to explore in this post is git init, git add, and git commit.

git init

git init — What happens when you initialize a repository

Git init creates a git repository from the current directory. This is done by creating a .git directory and write some files to it. The directory defines the history and the configuration of Git. These files are regular files and can be edited by users using a text editor. All of the content in the .git directory is Git’s, the other files in the directory are the user’s, known as the working copy.

The example of the content of .git directory

git add

git add — What happens when you add some files

There are two things that happened when user runs the git add commands.

Add modified files from the working copy to .git/objects

When you run git add on file.txt for example, it creates a blob in the .git/objects directory. This blob file contains the compressed version of file.txt. You might notice when you delete a file from your working copy, you can run another chain of git commands that can bring the file back. This could happen because git actually saves the content of the file that you added in the .git/objects directory.

2. Add the files to .git/index

The index is a list that contains every file that git is keeping track of. It is stored as a file (not a directory) as .git/index. Each line of .git/index maps a tracked file to the blob of compressed content that is stored in .git/objects directory.

git commit

git commit — What happens when you make a commit

When a user types out the commit command, it does three steps:

It creates a tree graph representing the content of the current version of the project
It creates a commit object
It points the HEAD of the current branch to the commit object

Tree Graph

When the commit command is made, firstly it will make a tree graph. The tree graph records the location and content of every file in the project. This graph contains blobs and trees.

Blobs represent the contents of files, this is made when the files are added.

Trees represent the directory of the working copy, this is made when a commit is made.

Commit Object

After creating the tree graph, it creates the commit object. Commit object is a file that is created in the .git/objects/ directory. The commit object contains three things:

Reference to the tree graph
The hash for the tree object that represents the root of the working copy
Commit message

Pointing HEAD to commit object

In this step, the commit command points the HEAD of the current branch to the commit object. If we’re currently in the master branch, this is done by modifying the .git/refs/heads/master and replacing whatever content in the file with our new commit hash.

My Opinion

Through exploring these underlying implementations of Git, I find that sometimes, it becomes important for us to understand the underlying implementation of technology and stack that we use every day and take for granted. This becomes useful when we run into problems. When we’re knowledgable of the underlying implementations, we’re more aware of the existence of solutions to the problem.

I think being in the field of computer science is getting very very convenient as we go. When we run into problems, there is almost always a StackOverflow thread that helps us solve them. I think being a good software engineer and a problem solver means being able to come up with solutions effectively when we run into problems. I really recommend diving deep into understanding the underlying technology to acquire deeper knowledge that would help us solve everyday problems.

References

Thank you for reading, I hope it was useful :)