Git from the bits and pieces, beyond the basics — Part 1
Story of Git
Oh my Git! Now a days, its is an indispensable tool in my day to day work. When I initially learned Git, it was a daunting experience, I come from Perforce background before that and I hated it (for good reasons of course). But today, Git is everywhere, be it developer, devops person, designer, even content and art creators are also using Git in a big way. So knowing this tool by heart will go a long way.
In this story, I will share how I learned Git, its a bit different from traditional guides you might find elsewhere, my focus would be give you the essence of how Git works behind the scenes and build its knowledge from the fundamental concepts.
We will start with the building blocks of Git, its Data Types and Objects, what they are and how they are stored. Then we will go deep into the whole content change tracking architecture with Git’s immutable, graph data structure (also how reliable this is), then we will introduce Git Remote and how it helps with decentralized workflow and even with multiple Git Remotes. Then we will learn common git commands and their working based on the above foundation of Objects and Data Types. What comes next is super fun, once we have a solid foundation of Git inner working, we will learn an advanced tool called reflog with which you can pretty much undo any accidental changes you might have done. (including hard deleted a branch or a bad rebase etc. ) Some of them will be a part of second part in this series.
So in this post we have a lot of grounds to cover, let’s get started!
How Git came about?
The Git project was initially started with Linus Torvalds with a very serious need for a fast, efficient and massively distributed source code management system for Linux kernel development.
The kernel team had moved from a patch emailing system to the proprietary BitKeeper SCM in 2002. That ended in April 2005 when BitMover stopped providing a free version of its tool to the open source community because they felt some developers had reverse engineered it in violation of the license.
Since Linus had (and still has) a passionate dislike of just about all existing source code management systems, he decided to write his own. Thus, in April of 2005, Git was born. A few months later, in July, maintenance was turned over to Junio Hamano, who has maintained the project ever since.
Git started out as a collection of lower level functions used in various combinations by shell and perl scripts. Since 1.0 more and more of the scripts have been re-written in C (referred to as built-ins), increasing portability and speed.
Though originally used for just the Linux kernel, the Git project spread rapidly, and quickly became used to manage a number of other Linux projects, such as the X.org, Mesa3D, Wine, Fedora. Now, we all know almost all software development specially in open source world happens through Git.
Note: Many inspirations were taken from open source content for this blog post. Me alone could not have produced such magnificent illustrations. All credits for them is due for the original Authors, I have mentioned them in references section at the end of the post. (All content that is taken is with consent and in line with the terms of use)
Version Controlling Techniques
So, what is version control, in very simple terms, keeping track of your changes as you go along creating, editing content or source code. It provides meaningful collaboration of multiple people or teams to work effectively on the same project or work without having stepping on each others shoes. There are primarily two types of version control, one is centralized version control which favors all inteaction to happen from a central server and more recently decentralized or distributed version control, Git falls on the later category along with some other very good choices like Mercurial.
Centralized Version Controlling
In centralized systems, there is generally a single collaboration model — the centralized workflow. One central hub, or repository, can accept code, and everyone synchronizes their work with it. A number of developers are nodes — consumers of that hub — and synchronize with that centralized location.
This means that if two developers clone from the server and both make changes, the first developer to push their changes back up can do so with no problems. The second developer must merge in the first one’s work before pushing changes up, so as not to overwrite the first developer’s changes
Distributed Version Controlling
distributed version control on the other hand is a form of version control in which the complete codebase, including its full history, is mirrored on every developer’s computer. This enables automatic management branching and merging right from the local setup without having to sync up with the server so every little thing. It speeds up most operations (except pushing and pulling) because they are happening locally, improves the ability to work offline, and does not rely on a single location for backups.
Being distributed very much means that there is no central location which keeps track of your data. In a sense, no single place of the data is more important than any other single place, everyone plays an equivalent and independent role for the change tracking.
“possibly the biggest advance in software development technology in the [past] ten years”. — Joel Spolsky
Git Repository, Data Types and Object Database
When we talk about Git, generally it is associated with a project/root folder which Git is supposed to track content for. This folder is also known as Git Repository and has a hidden folder called .git
which holds all Git Objects and references.
There are Four main object types (Immutable) in Git, the first three being the most important to really understand the main functions of Git.
All of these types of objects are stored in the Git Object Database, which is kept in the Git Directory also known as Git Repository.
Each object is compressed (with Zlib) and referenced by the SHA-1 value of its contents plus a small header. The essence of using a strong hash function like SHA-1 to identify the contents is the fact we can gurantee about the contents of the file/object, even the slightest changes in the content would produce a very different SHA-1. So, the security and the gurantee of the content being always the same is provided by this hashing mechanism.
SHA stands for Secure Hash Algorithm. A SHA creates an identifier of fixed length that uniquely identifies a specific piece of content. SHA-1 succeeded SHA-0 and is the most commonly used algorithm. Wikipedia (http://en.wikipedia.org/wiki/SHA1) has more on the topic.
Blob
In Git, the contents of files are stored as blobs. It is important to note that it is the contents that are stored and tracked, not the files, their types or filenames. The names and permissions of the files are not stored with the blob, just the contents. In fact, one interesting fact is you can not track a file in Git. You can only track contents and snapshot of contents.
So how does it look like, we take the file content and create blob out of it using Zlib::Deflate
and then we compute the SHA-1 hash of it. Then we store the blob with the SHA-1 as key.
This means that if you have two files anywhere in your project that are exactly the same, even if they have different names, Git will only store the blob once and reuse it everywhere possible. Git low level command hash-object
does this exactly.
Tip: In the above example if we pass a
-w
flag that will also store the hashed object along with its key in Git repository.
Tip: We can even look at the content given an hash key from the Git Database/Repository. So, if we wanted to get back the content from our previous example, we can simply run
git cat-file -p 52c0e8
If you have a working directory like the example below, Git would extract the files and represent them in Git Directory as blobs like this.
A working tree is any directory on your filesystem which has a repository associated with it (typically indicated by the presence of a sub-directory within it named
.git
.). It includes all the files and sub-directories in that directory.
Tree
Directories in Git basically correspond to trees.
A tree is a simple list of trees and blobs that the tree contains, along with the names and modes of those trees and blobs (Very much like a directory in a filesystem). The contents section of a tree object consists of a very simple text file that lists the mode, type, name and sha of each entry.
So, filling the gaps of the files types to git types mapping, we can now represent our working directory like this,
Commit
So, now that we can store arbitrary trees of content in Git, where does the ‘history’ part of ‘tree history storage system’ come in? The answer is the commit object.
Whenever you save your changes in one or more files, you can create a new commit in Git. A commit is like a snapshot of your entire repository at that point in time, not just of one or two files.
The commit is very simple, much like the tree. It simply points to a tree and keeps an author, committer, message and any parent commits that directly preceded it.
Most times a commit will only have a single parent like that, but if you merge two branches, the next commit might will point to both of them. Also, if it is the first commit ever there will not be any parent associated with the commit.
To see it in action we can simply continue with our example file above move it under a folder and create a sample commit and then use
cat-file
to show the content of tree object.
Objects on the Git Repository now looks like this,
There will be four objects created, for really going deep let’s see their contents.
So you can see, Git for the commit to happen created two tree objects one holding the root folder, one holding the contents inside exampleTree folder. There was the previous object which was contents of the file we discussed and lastly there is the Commit object itself. See, simple and elegant.
Tag
The final type of object you will find in a Git database is the tag. This is an object that provides a permanent shorthand name for a particular commit. It contains an object, type, tag, tagger and a message. Normally the type is commit and the object is the SHA-1 of the commit you’re tagging. The tag can also be GPG signed, providing cryptographic integrity to a release or version.
Tip: create a tag, it will create a new object in the .git/objects folder, we have only handful so, get the key (first few chars of the SHA1) and try
cat-file
on it.
Git References
In addition to the Git objects, which are immutable — that is, they cannot ever be changed, there are references also stored in Git. Unlike the objects, references can constantly change. They are simple pointers to a particular commit, something like a tag, but easily moveable.
Git Branches and Remotes
Examples of References in Git are branches and remotes. A branch in Git is nothing more than a file in the .git/refs/heads/ directory that contains the SHA-1 of the most recent commit of that branch. To branch that line of development, all Git does is create a new file in that directory that points to the same SHA-1. As you continue to commit, one of the branches will keep changing to point to the new commit SHA- 1s, while the other one can stay where it was.
The basic data model I’ve been explaining so far looks something like this, we have Git Objects (Commits, Trees, Blobs) and Git References (Branches, remotes shown in grey).
Here, HEAD is called a Symbolic reference created at the beginning of the Git Repository, which is always pointing to a branch you are currently looking at or is checked out. HEAD should ideally point to a branch, but if you checkout a commit directly HEAD becomes detached from the branch and starts pointing to the commit directly.
So, a git repository at its minimum, if it is tracking content in a branch, then we will have the branch pointing to a relevant commit, which points to a tree, which holds references to more trees and blobs making the whole structure look like a Graph.
Now, let’s say we change some content which changes some files but not all, when we try to commit the changes, you can see in the illustrations below, how the unmodified blobs are kept the same, new commit takes a snapshot of the contents at that point of time and points to a new tree with the previous commit as it’s parent. At this point the branch is also pointing to the new commit.
We can go ahead and make more changes and commit, and you will see how the whole graph is evolving. We can also tag commits which are again permanent and doesn’t change like branch references. So, for the next commit even if the branch is pointing to the latest commit, the tag created in the earlier commit remains at the same place.
In a sense you can see, what git is doing, it is creating this immutable graph out of the basic object types we talked about earlier and each of them are so strongly interconnected because of the SHA-1 references they hold. For example, if a file has to be tampered, not only its SHA-1 will change, but almost all objects tracking that file and its SHA-1 now needs to update as well which in turn will almost change the whole graph. That is the reason why git is so secure and fullproof.
In fact, if you have heard of Block Chain’s they essentially work on the same principle where the ledger blocks is always hashed and linked with references to the previous blocks so that no one can tamper it in between or the whole chain collapeses.
So far, content tracking is all good but how does Git actually retrieve these objects in practice? Well, it gets the initial SHA-1 of the starting commit object by looking for the reference (stored in .git/refs
directory) for the branch, tag or remote you specify. Then it traverses the objects by walking the trees one by one, checking out the blobs under the names listed.
See how elegantly it unfolds, and it is fast as well because there is not really much to do. It is very much self evident and reliable way we can track contents. Also, notice branch creation is a very cheap process for git. Its just another file in .git/refs holding the 40 character SHA-1 hash of the commit it points to.
Note: Git tracks full file contents and not diffs, or patches for the changes. This makes life insanely simple, unlike other systems where when checking out a file, a base file is checked out and a series of diffs are applied which happened over the course of its history, making it slow, error prone and time consuming. Git’s approach of being a dumb content tracker really shines here. Another great thing in this approach is we can effectively track even non text based files like binary files where instead of snapshot if diffs were stored, it would make things horribly inefficient.
This is in sharp contrast to the way most older VCS tools branch, which involves copying all of the project’s files into a second directory. This can take several seconds or even minutes, depending on the size of the project, whereas in Git the process is always instantaneous. Also, because we’re recording the parents when we commit, finding a proper merge base for merging is automatically done for us and is generally very easy to do.
Using Git Commands
Unlike other, similar tools you may have used, Git does not commit changes directly from the working tree into the repository. Instead, changes are first registered in something called the index. Think of it as a way of “confirming” your changes, one by one, before doing a commit (which records all your approved changes at once). Some find it helpful to call it instead as the “staging area”, instead of the index. Git Index also gives tremendous flexibility to critically choose what changes are ready to go to Git Repository forever as a commit.
Git Add
The git add
command adds a change in the working directory to the staging area or Index. It tells Git that you want to include updates to a particular file in the next commit. However, git add
doesn't really affect the repository in any significant way—changes are not actually recorded until you run git commit
.
In conjunction with git add command, you’ll also need git status
to view the state of the working directory and the staging area.
git add <file>
git status
Stage all changes in <file>
for the next commit.
git add <directory>
git status
Stage all changes in <directory>
for the next commit. Visually, you can see git add simply adds the changes to the Index. Git commit only commits what’s in the Index. If there is nothing, even if you have changed the working tree with some modifications, git commit would not know about it.
Git Commit
The git commit
command captures a snapshot of the project's currently staged changes. Committed snapshots can be thought of as “safe” versions of a project—Git will never change them unless you explicitly ask it to. Prior to the execution of git commit
, The git-add
command is used to promote or 'stage' changes to the project that will be stored in a commit. These two commands git commit
and git add
are two of the most frequently used.
Commits can be thought of as snapshots or milestones along the timeline of a Git project. Git Snapshots are always committed to the local repository. Git doesn’t force you to interact with the central repository until you’re ready. Just as the staging area is a buffer between the working directory and the project history, each developer’s local repository is a buffer between their contributions and the central repository. More on this in Remotes section.
git commit -m "commit message"
Git Log
The git log
command displays committed snapshots. It lets you list the project history, filter it, and search for specific changes. By default git log
shows the history from the current HEAD
Tip: There is complementary command line tool called
tig
which again is derived from Git, but is very powerful tool to see logs, changes and the whole Graph of Git right from the terminal in a beautiful way. I use it all the time.
Git Checkout
If Git add and Git commit are the commands to take things into Git Repository. Git Checkout is the command to check the contents out from the Git Reposity to Working Tree or current working directory.
Git checkout is little overloaded, so I will show you three main examples which are used the most.
If we want to checkout a particular branch, we say git checkout <branchname>
and it does the following,
- Moves the HEAD from its previous position to the branch to be checked out.
- Flushes out the contents of the working tree from Git Repository for that commit.
- Resets the index.
In another variation, if we use git checkout -b <new_branch>
then it acts as creating a new branch from the current HEAD with the <new_branch>
name. This is also very commonly used where without affecting the main branch you want to create a branch to work on some features.
Lastly, if we simply do git checkout <commit_id>
this also works, but if the checkout happens for that commit which might not have a branch. In this case you will see Git has a Detached HEAD
for this.
Git Reset
Git reset is used to perform pretty much what the name says, it resets the content of the index or working tree from Git Repository. If we want to discard the current changes Indexed for commit we can use git reset
If we also want to reflect the same changes in the working directory as well and not just on the Index then we use git reset --hard
which updates both the index and the working directory. On the other hand if want only the git reference to change without affecting both the index and the working tree we can use git reset --soft
.
For example, if we want to move to three commits prior to where the branch is now. We can git reset HEAD~3
where HEAD~3 is a relative way to say we are referring to three commits prior to current HEAD.
Git Merge and Git Rebase
Git Merge and Git Rebase serve the same purpose. They are designed to integrate changes from multiple branches into one. Although the final goal is the same, those two methods achieve it in different ways, and it’s helpful to know the difference and how the process happens. Given the above context we have set so far, I think it will be easy to understand both.
Let us say we have simple situation like this, where master branch and has some changes which needs to go to Feature branch.
Merging takes the contents of a source branch and integrates them with a target branch. In this process, only the target branch is changed. The source branch and its history remains the same.
Merging is a common practice for developers using version control systems.
git merge <source_branch> <target_branch> (Master and Feature in the diagram)
When we merge, git creates a new commit on the target branch. This is a special type of Commit because it has two parents (Remember in the Git Data Types) we talked about this.
Advantage of merge is that it preserves both branches and their original history intact. But on the con side, it polutes the git graph and log statements with many many merge commits as things evolve.
Rebase on the other hand takes a very different approach. When you rebase a branch onto another, what you are essentially saying is, that the target branches changes should actually be applied after all the latest changes on the source branch. I know this is confusing a bit. Think about it like this, if we want to rebase Feature onto Master, here are the steps that happens,
- Git will traverse both branches till it finds a common ancestor.
- In the process it will record the commits and their changes traversed as well.
- Now, from the common ancestor, it will keep the source branch (in this cast Master) commits in tact.
- On the end of the source branch it will start applying changes/commits from the target branch (Feature branch) here, one after another.
- Becuase the parents of these commits have changed in Feature branch, all commits will be created new and Feature branch will point to latest commit in that chain.
- So, it is basically creating patches (changes) from one branch and then applying them onto other, keeping the other things same in Git.
- Advanced rebasing (interactive ones) can also modify the original set of commits into one single commit keeping the Git history very clean.
git checkout Feature
git rebase Master (Master and Feature in the diagram)
What will you prefer ultimately depends on your preference and requirements. I tend to use rebase more than merge in my day to day use.
Conclustion
The journey of Git has been a really enjoying one for me. I tried to give my mental picture of how to think about Git and tracking content changes in general in this post. But, due to the volume of the content, I had to break it down to two part series. And we can safely close the first part here, because we have all talked about Git and its usecases in a local setup.
In the next part, we will introduce in details how to work with Git Remotes. How to work effectively with multiple collaborators. We will learn about Git Workflows which a distributed team follows and general practices like in GitHub.
We will also learn about some Advanced tools like reflogs which can help in many situations including a recovery from a bad action in the repository. As an inspiration from Git, we will also disucss how Docker or container registries are following a very similar ideas for continuous delivery.
So, to summarize this post, We have learned about What Git is, what are its data types and structure and how it tracks content, how distributed workflow in Git looks like, simple commands of Git.
Stay tuned for Part2, it coming very soon (almost done writing that as well)
References:
Here are the links to the references which inspired me to write this post. I think a lot of people have done a tremendous amount of work to make Git not only a powerful tool chain, but also a very comprehensive guides and tutorials to help us follow through how to effective make use of Git.