Knowing Git Inside-Out
Although as a programmer we use git in our day-to-day life many of us don’t really know what it does internally to do what it does. In this article, I’ll try to explain what git really is and how it internally works.
Types of Git commands
We use many git commands in our day to day life like git commit
, git push
, git checkout
etc. this are the common commands that are more user friendly and are called “Porcelain” commands. But because git was initially a toolkit for version control rather than a full user friendly VCS is also has many subcommands which does low level work and those are called “Plumbing” commands.
Git plumbing commands works with low level and gives access to inner working of git we’ll be mostly dealing with those commands in this article. These commands aren’t meant to be run from the command line but rather to be used in building blocks of any custom scripting.
Observing a newly initialised .git directory
When we do git init
it creates a .git
directory which contains almost everything git creates and manipulates. If you want to backup or clone your repo copying this .git
folder will give you almost everything you needed. So now let’s look at how .git
directory looks like
This is my newly initialised .git
folder looks like. You many have some different content depending on your git version. It may add some more when you creates a file and add it to git.
config: contains your project specific configuration options.
hooks: contains your server or client side scripts that you run maybe before commit or push.
HEAD, objects and refs are the core part of git and git changes them more frequently.
objects: stores all the content for your git repository/directory/project also called database by git.
refs: stores pointers to commit into data(branches, tags, remote etc)
HEAD: this file contains pointer to the branch you currently checked out.
Git is filesystem
git is content-addressable filesystem, that means git is a simple key value data store, by means of that you can insert any kind of data into your git repository and for that git will hand you over a unique key from which you can retrieve your data back later.
Now for the demonstration purpose lets look at plumbing command git hash-object
which takes some data and stores it into .git/objects
directory also called the object database it will give you back a unique key that now refers to that data object.
Let’s verify that nothing is there in the current object directory
git has initialised the directory and created pack
and info
directory but there is no file as such.
Let’s use git has-object
command and create new data object, store it in Git Database manually.
In it’s simplest form git has-object
would take the content you handed to it and merely returns the unique key that would be used to store it in your git database. the -w
command tells the git to not only returns the key but also write that object to database. Finally, the --stdin
option tells the git to get the content to be processed from stdin; otherwise, the command would expect a filename argument at the end of the command containing the content to be used.
The output of the following command will be a 40-character checksum hash. This is the SHA-1 — a checksum of the content you are storing plus header.
Now if we see the object directory again there will be one entry for the file. This is how git stores content as single file per piece of content. The subdirectory is named with first 2 characters of SHA-1 and filename with rest 38 characters.
Retrieving the content stored by git hash-object
Once we store the content in the git object(database) we can get it by git cat-file
command passing -p
to the command tells it to first get the file type and later show it accordingly.
Now we are able to store and retrieve content from git. We can also do this via giving a file and do version controlling with it.
Version Controlling
let us try creating a file with some content and do version controlling for that.
As we have stored some content in the file, let us store some more content for the another version.
Now the our object contains both the version first one as well as the second version.
Now we can remove the local file file.txt
and can retrieve the data from git object either the first or second version.
We can store each content in git object but it’s not practical scenario also it’s not storing the file name but just the content this type of object is called “blob”. You can see the type by using git cat-file -t sha-1
Tree Objects
Now to solve the problem and store both content and filename we have something called tree objects it allows git to store filename with content. It also allows to store group of files together which we generally use while making commits. Git stores content in similar manner to a UNIX system, but with simplified a bit. All contents is stored in tree and blob objects, with tree likes UNIX directories when blobs are more like file content. A single tree object can contain one or more entries each of which can have corresponding blob objects.
Now let us create some files and make the commit manually with just lower-level commands.
We first need to create a tree which can be done easily by creating and staging the files. Git generally creates tree from making objects from your staging or index area. Staging area is where your uncommitted changes are present.
To stage the file we can use git update-index
command, we’ll use this command to add our file.txt
to staging area also we need to pass --add
option because the file doesn’t exists in the staging area yet (we don’t even have staging area set yet), we will also need to use --cachedinfo
because the file we are adding is not in the directory but is in the database then mode
sha-1
filename
Modes
100644 - Normal file
100755 - Executable file
120000 - Symbolic link.
We can now use git write-tree
to write the staging area out in tree object. Calling this command automatically creates the tree index from the staged area.
Few things to notice here, git write-tree
has created the tree and given sha
for it once we check the type of the object it says tree
We can now create new version of the file and also can add a new file to the tree.
Notice that the latest tree has both the files with the sha.
Commit Objects
Now we have done storing content with filenames using tree but the problem still remains same, still need to remember few sha ids which is not practical. You cannot track your version with just these sha ids you will also need an appropriate message to identify which commit has what. We need to know the basic reasons why we saved the object and when we saved the object. These are the basic information we need to make it more useful.
In order to solve this problem we need to make a commit with some basic information and we can do this using git commit-tree
Wow, We just made a commit with all low-level git commands.
This is essentially what git does when we do git add
git commit
commands it store the blobs and tree that you have saved and use them to make commit.
That’s all for this article, thanks for reading. Keep committing :)