The Graph Side of Git

Shlomit Ahituv
Wix Engineering
Published in
6 min readJun 14, 2022

Introduction

“Everything must be made as simple as possible. But not simpler.”
― Albert Einstein

This quote captures Git quite accurately — It’s a simple tool to use, but has a very complex mechanism.
It’s one of the most common Version Control Systems, used to track content history and helps developers collaborate.

In this article we will try to show how git does its magic.

Git concepts

Git structure

Try to envision a git repository
The repository contains branches, and each one of those branches is a set of commits with a parental relationship. In fact, a branch is just an alias to a specific commit.
Each commit has a specific ID (based on a hash of the content) — unique to that repository.

Let’s try to visualize it.

  • Assume we have a git repository with a master branch with 2 commits (M1 is the parent commit of M2, M2 is the latest commit):
  • We want to introduce a new set of changes. Let’s create a branch called feature to create the changes independently from master:
  • We made a set of changes and committed them to the branch (commit F1):
  • As we continued to work on the feature, we added a few more commits:
  • Meanwhile, other people kept adding commits to master branch:
  • After testing our feature, we decide to merge the changes into master:

Note: When we merge the feature into master, a special merge commit (M5) is created, and this commit has two parent commits.

This structure might remind you of a familiar data structure called Graphs.

Graphs

A graph is a data structure based on Nodes and Edges. An edge connects exactly two nodes, and a node can be connected to zero or more other nodes.

A path is a collection of connected nodes, and a cycle is a path that starts and ends in the same node.

There are two types of graphs — directed and undirected — and as their names suggest, the former has directions on the edges (e.g. if A is connected to B, it does not mean that B is connected to A), while the latter doesn’t (i.e. a connection of nodes is symmetrical).

A DAG (Directed Acyclic Graph) is a special kind of graph, and, as its name suggests, is a directed graph in which there are no cycles (i.e. there are no paths that start and end on the same node).
A nice property of DAGs is that we can sort them in a topological order
which is an order of the nodes of the graph such that if we were to travel the entire graph, each node in the order comes before all its “dependencies” (i.e. all the nodes that can be reached from it).
Note: There can be more than one topological sort for a DAG.

Graphs are used to represent a structure of points with connections between them. Using this representation allows us to use algorithms for finding paths with certain properties (e.g. shortest, longest, etc.).

Git and Graphs

We can therefore deduce that a Git repository is basically a DAG:

  • commit → node
  • link between commit and parent commit → directed edge

By definition, there are no cycles since when a commit is generated it points to one (or more) parent commits that already exist. Since existing commits cannot be modified, we could never point to a newer commit, thus avoiding cycles.

Why does git use graphs as its data structure model?

  • We can think of a git repository history as a state machine — each commit is a state of the repository in a certain time, and the parents of each commit are the other valid states of the repository. This state machine is basically a graph.
  • Graphs allow git to use graph traversal algorithms for showing the history of files in a visual, easy-to-understand manner.

Useful graph-utilizing git commands

  • git log --graph — allows git to visualize the entire working tree (commits and their parents) in a graph. This is done using a modification of a DFS (Depth First Search) which traverses the graph from the current node (HEAD).
    We can even use the --topo-order flag to visualize the commits in a topological order. This order is only applicable for DAGs, and thus git can use this functionality.

Git Implementation

Git is open source, and its code can be found here.
You may notice a little chicken and egg situation, but of course this repository is simply a mirror of the original.
It’s written in C and we’ll go over the fundamentals in high level.

Before we go over the actual flows of commands in git, we’ll need to be familiar with a few terms:

  • Blob — an actual change to an actual file as part of a commit object.
  • Tree Object— a collection of files for a commit. These files can be either trees or blobs.

Commit

The commit structure is composed of the Tree object, the commit parent(s), and other metadata used by git.

When we create a commit of changes using git commit, the flow that happens is:

  • Go over the changes — Parse and create the tree object. As part of this process the blobs and subtrees themselves are parsed as well.
  • Read the commit message from the user.
  • Create the commit object.
  • Update the HEAD object to point to this newly created commit.

Eventually, each commit holds a link to their parent(s) and this is what makes the collections of commits a DAG.

Branch

Branches are aliases (symbolic refs) to commits.
Branches allow us to work separately on our changes easily, with a human perspective to gather all of our changes.
Branches can be created either with git checkout -b or with git switch -c

Last thoughts

Graphs seem like a very abstract concept, but digging deeper we can find many applications to them. Many of our day-to-day tools, both inside and outside the software industry (such as geographic maps and navigation, social networks, and even the world wide web) use graphs and their algorithms.

Graphs allow us to represent systems that consist of relations, and git definitely falls into that category.

Next time you run a git command, you can probably guess how git does its magic.

--

--