What really happens when I do: git add

Rafael Silva
7 min readApr 15, 2023

--

If you have work with Git before, you had to run git add before you save your changes, or to commit them, to use a more appropriate Git terminology. Have you ever wondered what actually happens when you run git add? Nothing “visible” changes, but somehow, if you run git status, your files suddenly appears on a magical place known as “the staging area.” Let’s dive in this mystery together.

This post is part of the series: The I way I wish people had taught me git.

When you git clone a repository or create a brand new one with git init, by default; a directory named .git will be created. The . is mark to tell that this supposed to be a hidden file and shouldn’t bother the user sight, unless the user wants to, usually by running ls -a or similar.

Actually, there are other occasions where .git might be just a simple file and not a directory. But, we are getting ahead of ourselves, I’ll left this to another time.

If you list the content of .git directory, in the version I have installed, which is v2.40.0; it looks like so:

$ tree -FL 1 .git/
.git/
├── branches/
├── COMMIT_EDITMSG
├── config
├── description
├── HEAD
├── hooks/
├── index
├── info/
├── logs/
├── objects/
├── ORIG_HEAD
└── refs/

This is where git stores its objects, all committed history, your project’s files, branches, tags, etc. However, for this article, the most important is the index file.

So, what’s so important about the .git/index?

The .git/index file or its birth name .dircache/index, is a simple binary file that holds files metadata that you have added with — wait for it — git add. It provides an efficient way for git to track files changes and, later; save it when you commit the changes to the project.

With git status you can check what files that are stored in the index, thereby; what’s currently staged to be committed. For example, after running git add file.c, you’ll see the file listed on the staging area:

$ git status
On branch main
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: file.c

This essentially means, the file information is on the .git/index, patiently waiting for you to save it. And, if git commit was executed, all the files on the staging area will be saved under a new commit object:

$ git log --stat 
commit 925fe5a307a7b970c3a6e8f29ef783f12bb88463 (HEAD -> main)
Author: Rafael Silva <email@doesnot.exist>
Date: Fri Feb 30 00:00:00 2023 +0200

add file.c to the project

file.c | 1 +
1 file changed, 1 insertion(+)

[ ... more commits will folllow ... ]

In the above output, we are executing git log to show all commits or history presented in the project and the --stat to show the files that was stored together with that commit. We can see our little file.c that we git add earlier in the bottom.

Alright, that’s pretty simple! When you’re using git, you git add the files that you intend to save it and run git commit to actually save them.

Now, let’s open the curtain behind the .git/index file and see what it holds inside.

The guts of the .git/index file

The index file is a simple binary file that holds an array of files’ metadata, i.e. its path, permissions, owner, created and updated time. etc.

This picture is a mere representation!

The content of the file, however; is not save directly on the index. Instead, a blob object will be created to store the file content during the execution of git add. The content of the file is also compressed to improve Git’s storage usage.

Remember that .git or to be more precise .git/objects is where git stores all its objects? Git has, basically, three fundamental objects:

  • A blob object, which holds the content of the file;
  • A tree object, which is the Git way of representing a directory, a object that simply points to other objects which could be blobs or other trees, which is exactly what a directory is ;).
  • A commit object, which defines a point in time where the project was saved, containing information about the author, commiter, date and a pointer to one or more previous point in time, effectively creating a history or timeline of changes since the beginning of the your project.

But, let’s hold on for a moment — we have something to talk about it here. Git could only care about the file content and create the blob object when we do a git commit, when we actually want to save it. However, it is designed to create the blob during the git add which most of the time is done incrementally, i.e. you work on file and git add’it, you work on some other files, git add’it, and so on. When it’s time to commit the object, the content of the file is already there on the object storage, inside .git/objects. Thus, during a commit, git basically just need to store one new object — the commit object. This is one of the reasons why the commit operation is so fast.

Interestingly, because Git already takes the content of the file when you add it to the index, if you change the content of the same file, you’ll see something like this:

$ git status
On branch master
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: file.c

Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: file.c

On the above output, we can observe that our file.c is presented in both the staging area (which we added earlier) and now on the changes that is not staged. Which make sense, given that the content of the file is processed when you do git add, therefore; if you change the content again you have to tell Git about it — it doesn’t know until you do.

Alright, we got too excited. Let’s go back to the inner structure of the .git/index file.

The first 12 bytes of the .git/index contains the index header. The first 4 bytes contains the file signature, telling git this is actually an index file. The next 4 bytes is the version, which instruct git to properly load and maintain backward-compatibility with previous versions. Lastly, the last 4 bytes of the header, shows how many entries does the current index holds¹.

The header is coded in C in the following definition²:

#define CACHE_SIGNATURE 0x44495243 /* "DIRC" */
struct cache_header {
uint32_t hdr_signature;
uint32_t hdr_version;
uint32_t hdr_entries;
};

You can see that by, for example, using a hexdump tool to dump the content of the binary file.

$ hexdump -C -n 12 .git/index  # -C hexdump + ASCII, -n <read first N bytes>
00000000 44 49 52 43 00 00 00 02 00 00 00 02 |DIRC........|
0000000c ^ ^ ^
| | |
| | |
Signature Version # of entries

The first 4 hex numbers is the cache signature (DIRC), followed by the version, we are seeing version 2 here , and the last 4 hex numbers is the amount of entries on the cache, which also happens to be 2.

The cache signature, DIRC, stands for dircache which is the index file birth name¹.

You can also see that with the file command:

$ file .git/index 
.git/index: Git index, version 2, 2 entries

Although, I wanted to use the hexdump on this article to pretend that I’m smart.

The file entries are represented with the following structure²:

struct cache_entry {
struct hashmap_entry ent;
struct stat_data ce_stat_data;
unsigned int ce_mode;
unsigned int ce_flags;
unsigned int mem_pool_allocated;
unsigned int ce_namelen;
unsigned int index; /* for link extension */
struct object_id oid;
char name[FLEX_ARRAY]; /* more */
};

This is the information that Git holds within the .git/index about each file that you have git add‘ed. Here’s a summary of the some of the fields:

  • ce_mode bits representing the file type and Unix permissions;
  • ce_flags internal flags used for various purposes;
  • ce_stat_data is a structure that holds information about file’s creation and modified time, information about which device the file is stored, the inode number, file size, user and group id, etc. Basically, what stat() function returns about the file;
  • mem_pool_allocated is an internal pre-allocated memory area to improve memory allocation performance;
  • ce_namelen is the length of the file name or the length of the value in name field. Very useful when you reading the binary file as you need to know where the entry ends and, when; the next one starts;
  • object_id oid points to the entry’s object file inside the object storage(.git/objects);
  • name is the path and name of the file that you git add‘ed;

Using this structure Git has everything that it needs to know about the file that you added and it’s ready, whenever you are, to commit it.

It’s getting late, so; let’s sum it up

In order to smartly handle your project files, git has a special place to store information about the file and to quickly act when you instruct it to commit or save your changes. This is what git add is all about, is placing the file to a special place, processing the file and telling Git that you care about it and want to save it later.

During git add, git already start to process the file in order to be ready and act quickly. It makes everything appears so fast, you might even worry that is not doing the correct thing. But, trust me, it is! Most of the time.

When you ready and execute git commit, git gets all entries from the .git/indexand creates a new, well known bespoke object — the commit. And by doing so; making your files part of history, your project history.

References

[1] https://github.com/git/git/blob/v2.40.0/cache.h

[2] https://github.com/git/git/blob/v2.40.0/Documentation/gitformat-index.txt

Footnote

Note: The commands output and part of the information on this article is written when Git is at the version v2.40.0. So, outputs and information might vary according to which version you are using it.

--

--

Rafael Silva

I'm passionate about technology, development and a good challenge, few times even trying to solve them. I find the tech world peculiar and fascinating.