The Architecture and History of Git: A Distributed Version Control System
As of 2018, almost 90% of the 74,000 developers surveyed by Stack Overflow prefer to use Git for version control. Git dominates all other version control systems and adoption is up almost 20% from 2017 according to the survey. However, Git has not always been this ubiquitous. Let’s take a look at its ascent into mass popularity.
Git was born out of the Linux Kernel Community’s frustrations with available VCSs (version control systems). The development of the Linux kernel was quite unusual for its time: there was a large number of contributors on the project as well as high variance of contributor involvement and knowledge of the codebase. As a result of the Linux Kernel’s unusual development situation, the developers struggled to find VCSs that fit their needs. They settled for a mix of BitKeeper and Concurrent Revisions System (CVS), with a group of core developers working on each system to manage the development of the kernel. BitKeeper provided distributed revision control while CVS was a client-server version control system that let developers “check-out” copies of the project, make changes, and then “check-in” their changes back to the server.
In early 2005, Larry McVoy, the copyright holder of BitKeeper, announced the revocation of a license allowing free use of the BitKeeper software. He claimed that Andrew Tridgell, an Australian computer programmer who was creating software that interoperated with BitKeeper, had reverse engineered BitKeeper’s source code and violated its license. Many Linux core developers that relied on BitKeeper’s free software to develop the Linux kernel were now locked out from using it.
The Linux Community’s relationship with BitKeeper had not been entirely conflict-free but they had hoped for a viable alternative before making the jump to switch away from BitKeeper. Linus Torvalds, the principal developer of the Linux kernel, began work on a new VCS after seeing no other free options that met their needs. In an email to the Kernel Mailing List, Linus conveyed how he was happy with BitKeeper and what it had done for kernel development, mainly that it had helped the team maintain a much finer-granularity view of changes and change-set tracking. Noting that although BitKeeper didn’t work out, it was very helpful in changing the way the kernel was developed for the better.
Setting out to provide the team with an alternative to BitKeeper, Linus outlined certain design criteria for the new version control system. He wanted to maintain the benefits that BitKeeper afforded the team as well as develop some improvements.
Three key features were stressed: safeguards against content corruption, high performance, and distributed development workflows. Linus also emphasized that patching should take no longer than three seconds, citing source-control management systems that took upwards of 30 seconds to push a patch and update associated metadata. Such a system would obviously not scale well with the 250 developers working on the Linux kernel. Despite BitKeeper’s early influence on the creation of Git, Git allows more distributed and local-only workflows than BitKeeper. Project collaborators can work on repositories offline, commit incrementally, determine when their work is ready to be published, choose which changes to share, and push their changes to different branches.
A Version Control System usually has three core functionalities, all of which Linus built into Git. It must be able store content, track changes to said content (all history including merge metadata), and optionally distribute the content and commit history with project collaborators.
Git uses a Directed Acyclic Graph (DAG) for content storage as well as commit and merge histories. A DAG is a directed graph that has a finite number of vertices and edges (connections between vertices) that contain no cycles (acyclic). Being acyclic means that there is no way to go from Node A to Node B and loop back to Node A through any number of edges. A DAG also must have topological ordering. This means that the vertices all have edges that are directed from earlier to later in the sequence (shown by the arrows moving from the top left to the bottom right in the image above).
Git also utilizes this Directed Acyclic Graph structure for content storage. Git is essentially a content-addressable filesystem made up of objects that form a hierarchy which mirrors the content’s filesystem tree. Git has three main primitive types it uses to represent content for a repository: trees, blobs, and commits. All content is essentially stored as either tree or blob objects. A blob is a file stored in the repository and a tree object references either subtrees or blobs. You can think of the blob as the file contents while the trees are like directories. A commit object, on the other hand, has three main attributes. It points to a tree which represents a top-level snapshot of the project at the time of the commit. It also contains references to the commits that came directly before it, a field for author of the commit and, optionally, a commit message.
All of these object primitives are referenced by a 40-digit SHA hash. Two identical objects will have the same hash and different objects will have different hashes. By using the SHA hash as a reference identity, Git can calculate diffing efficiently. In order to safeguard against data corruption, one can recalculate an object’s hash to easily identify corruption or data loss.
Git also uses a DAG to track the history of changes to the content. As stated above, each commit object contains metadata about its ancestors where a commit can have any number of parent commits. Git’s usage of DAGs to store content and keep track of commit and merge histories allows it to maintain full branching capability as the history of a file is linked all the way back up its directory structure to the root directory and a commit object.
When merging the branch “feature7” into the master branch, Git performs a “fast-forward” merge, shifting the master branch pointer forwards. A “fast-forward” merge is only possible when the commit history of the current “feature7” branch contains the latest commit (HEAD) of the branch being merged into (master).
Git uses a different merge strategy when the commit of the branch you are on isn’t a direct ancestor of the branch you are merging in, meaning your development history diverged. In this case, Git uses the “recursive” strategy and performs a three-way merge. Git creates a new snapshot of the file state and a new merge commit object that points to the snapshot. This merge commit object now has two parents, pointing to the commit objects at the heads of both of the branches being merged together. Git’s usage of a nonlinear content storage and commit history system allows it to seamlessly merge two branches of a project together.
Distribution and Initialization
Git handles content and history distribution of projects among collaborators using the distributed model, where users can work offline and make commits on their local repository. Every contributor has a copy of the Git repository where they can work offline, make changes, commit their changes, and (optionally) pull in new changes from a remote repository to stay up to date. When a collaborator is ready to share their changes, they can push them to a publicly accessible repository for other collaborators to access. Once the public repository verifies that the commit can apply to the branch it was pushed to, the same objects that were created and stored on the local repository are created for the public repository and the repository is updated for all collaborators to access.
In order to initialize a local Git repository, you run the command “git init”. This creates a newly initialized repository on your local filesystem, creating a .git directory inside of your current working directory. The .git directory is a subdirectory of the root “working directory” and functions as the actual local repository, containing various config files, the object database, reference pointers for branches, and other scripts that can be run at various points in the projects lifecycle. Another important file is created once you make some changes to files, the Git index, located in .git/index. The Git index file is the staging area between the working directory and the local repository, staging specific changes within one or more files to be committed.
Git was written with a toolkit design philosophy on par with the command line tools used and built within the Linux community. While the toolkit design afforded users more granular, low-level access to much of the functionality of Git, it has a steep learning curve for new users due to the large suite of commands that may be non-intuitive to many people unfamiliar with command line tools or other VCSs. Git also lacks in its ability to be linked and built onto other services and applications. Many application developers who have built or are building tools on top of Git complain about a lack of a linkable library. The Git binary is not reentrant, meaning it cannot be interrupted in the middle of its execution and then be called again safely. This forces any applications or web services using the binary to execute a call to the binary and wait for it to fully execute before calling it again, negatively affecting application speed. There are a couple projects working on remedying this lack of a linkable library, most notably libgit2, a cross platform linkable library implementation of Git.
Another set of issues with Git is its inability to handle large numbers of files or large file sizes. If your project contains a lot of non-text files, such as images, that are updated frequently Git becomes very slow, making the largest practical repository size only a few GB.
Git was engineered almost perfectly to fit the needs that were sought by Linus and the Linux team. It met every core requirement for a VCS that Linus outlined, and did so elegantly and simply to be as efficient as possible when being used. While there are some minor issues, Git is engineered quite well and will continue to be the VCS of choice for many years to come.