Implementations of Graphs

Tyler Elliot Bettilyon
Teb’s Lab
13 min readFeb 6, 2019

--

Any implementation of a graph needs a few essential features. These features comprise the minimal graph API that we’re going to implement. After we’ve described what our graph can do we’ll move on to question of how should we make a computer program that does those things. Following this two step approach is good advice for any programming task. So, what must our graph be able to do?

A graph needs to keep track of all the nodes in it, and all the edges that connect those nodes. We will also need a way to add nodes and edges to the graph in the first place. Minimally, this is enough to create a representation of a graph. However, for our API to be much use, we’ll want a few more features. As a minimal graph API for the purpose of this article we will support the following features:

  • Add a node to the graph.
  • Create an edge between any two nodes.
  • Check if a node exists in the graph.
  • Given a node, return it’s neighbors.
  • Return a list of all the nodes in the graph.
  • Return a list of all edges in the graph.

This leaves us with a fairly simple API to implement. A constructor and 6 methods.

A scaffold for a graph API.

While the API can be the same for directed and undirected graphs, the implementation of the API will have to be slightly different. For example, calling addEdge in an undirected graph must result in a bidirectional relationship in the underlying data, but a unidirectional relationship in a directed graph. It is actually somewhat common for an undirected graph to simply be implemented as a directed graph where the addEdge function always adds two edges instead of just one.

Weighted and unweighted graphs present similar implementation differences. It’s reasonable and common to simply use a uniform weight of 1 for all edges in an unweighted graph, and therefore easy to build an unweighted graph using a weighted graph API.

Nevertheless, it’s often easier for everyone involved to build separate classes for directed vs undirected graphs; same goes for weighted vs unweighted graphs. The following strategies are applicable for any kind of graph, weighted/unweighted and directed/undirected — but there will be small differences in the implementation details. There are also potential optimizations that apply to specific kinds of graph. For example in an unweighted graph that was built on top of a weighted graph API as we described above we are storing an unnecessary copy of the number 1 for every edge in the graph — building an implementation specifically for unweighted graphs allows us to eliminate this memory waste.

In the rest of this article we will examine and compare the performance of four different ways to implement a graph. Each of these tactics can be applied to any of the four major graph types that we’ve previously discussed, but supporting each type will require subtle changes to store or not store the weights and to ensure the graph is directed or undirected.

Recall our 4 major types of graph

To compare the different kinds of graphs, we’ll compare the speed of the individual functions of the API defined above as well as the total size of the underlying data structures using Big O notation. All of our Big O formulas will involve two important variables:

  • E — the number of edges in the graph.
  • N — the number of nodes in the graph.

We’re starting our exploration with more specialized implementations that are more useful in specific cases, and will work our way towards more general implementations that strike a balance between the extremes. For all of the implementations, we will be assuming that nodes are identified by some unique identifier; this is commonly a string supplied by the users so that the node names can be easily interpreted by the people using the graph, but could just as easily be a generated numerical identifier.

Edge List

The first implementation strategy is called an edge list. In this implementation the underlying data structure for keeping track of all the nodes and edges is a single list of pairs. Each pair represents a single edge and is comprised of the two unique IDs of the nodes involved. Each edge in the graph gets an entry in the edge list, and that single data structure then encodes the entire graph.

For a weighted graph the edge list must be a list of triples (rather than a list of pairs) where the final value is the weight. For example (A, B, 3) represents an edge between A and B with a weight of 3.

For an undirected graph the order of the two items in each pair is irrelevant, a single entry can represent the relationship between two nodes; (A, B) means you can both go from A to B as well as from B to A. For a directed graph implementations must encode direction into the edge list, typically by giving meaning to the order of the IDs such that the first identifier is the “from” node and the second is “to” node; (A, B) means you can go from A to B but tells us nothing about whether or not you can get from B to A.

Here is a directed/weighted graph and an edge list representing that graph:

Each row on the right represents a single edge. For example, the first row says you can get from A to D for a cost of 2.

In an edge list nodes only exist in the graph if they have at least one edge connected to them. This means that an edge list cannot represent solitary nodes. While this is a drawback, it’s not a huge one — the relationships between nodes are the central point of interest in a graph; a node with no relationships to speak of is generally not a particularly important member of the graph. In some specific cases though this could be a real problem: Imagine you’re using a graph to model a sewage system where nodes are buildings and edges are sewer pipes that connect buildings to the main sewer line. A solitary node in this graph would represent a building that isn’t connected to the main sewer line — city planners really should know about that node so an edge list would be a bad choice.

The total memory requirements of the edge list depends only on the number of edges in the graph. This means for sparse graphs (graphs where each node has relatively few neighbors) an edge list will be small. For dense graphs (graphs where each node has many neighbors) the edge list will be large. Consider the worst case scenario for the size of our edge list: a graph where every node has an edge to every other node (such a graph is called a fully connected graph). In a fully connected graph the number of edges is O(N²) where N is the number of nodes.

In addition to the total storage requirements we need to consider the speed of critical operations when evaluating graph implementations. Recall our simple API supports six operations:

  • Add a node to the graph: Impossible, we can only add edges.
  • Create an edge between any two nodes: O(1) — just append it to the edge list. This is technically amortized constant time, and depending on the implementation of the list being used could be O(E) instead of amortized constant time.
  • Check if a node exists in the graph: O(E) — scan the edge list until we find an edge containing the node or we reach the end of the list.
  • Given a node, return it’s neighbors: O(E) — scan the entire edge list keeping track of every edge outbound from the node in question.
  • Return a list of all the nodes in the graph: O(E) — scan the entire edge list keeping a set of all the unique node identifiers.
  • Return a list of all edges in the graph: O(E) — make a copy of the edge list and return it. (Simply returning a pointer to the actual edge list would be O(1) but would not be wise because it would allow code outside of the graph API to change the graph)

All of the information is stored in a single list whose size is dependent on the number of edges. Therefore, it’s not surprising that all of our operations (except adding edges) is bounded by the number of edges in the list. In the event that your graph has very few edges these operations will be fast, but as we’ll see some of the other implementations can be a lot faster in most situations. Nevertheless, if your problem is one where relationships are scarce, then your graph with be sparse, and an edge list might be a good option for solving this problem.

Adjacency Matrix

At the opposite extreme of an edge list is an adjacency matrix. In an adjacency matrix a 2D square matrix is created, where each node in the graph has an entry in both dimensions. Each cell contains information about the edge between the nodes indicated by the row and column. In an unweighted graph boolean true/false values are used to simply indicate that an edge exists or not. In a weighted graph the value in each cell indicates the weight of the edge between the two nodes, and some value needs to be reserved to indicate that the two nodes are not connected by an edge.

Consider this adjacency matrix for a weighted and directed graph where the value -1 indicates that an edge does not exist, and any other number indicates the weight or an existing edge:

Each row indicates which nodes can be reached from that node. For example, the first row indicates that from A you can only reach D for a cost of 2. The first column indicates that A can be reached from C for a cost of 12, and can be reached from D for a cost of 10.

It is worth noting that for this implementation and the next one nodes should really be represented by consecutive integer values. This is because those integers must be used as the indices into the cells of our matrix. If a programmer wants to use strings or some other unique identifiers to represent nodes, this feature can easily be achieved with a hash table that maps these names to the appropriate integer for the matrix. In our case: A maps to 0, B maps to 1, C maps to 2 … and so on.

Unlike an edge list, where the lack of a connection between any two nodes is implicit, an adjacency matrix makes the relationship (or lack thereof) between all possible two node combinations explicit. An advantage of this implementation is that checking for the existence of an edge between any two nodes is a constant time, O(1), operation. Additionally, getNeighbors is linear in terms of the number of nodes O(N) rather than in terms of the edges O(E).

A downside is that the matrix can be quite large, and potentially wasteful. The total size of the underlying representation is always a square matrix where both sides are N long. Therefore the storage requirement is always O(N²), even if there is only one edge in the whole graph.

This is especially wasteful on sparse graphs, where we’re using lot of space to indicate the lack of any important relationships. Consider the example above which is relatively sparse. There are 19 cells out of 25 with a -1 to indicate no edge exists; in the edge list we only needed 6 total entries to represent this graph instead of 25.

However, in a fully connected graph — one where each node has an edge to each other node — the edge list and the adjacency matrix will be the same size. In terms of speed, though, an edge list would be a disastrous choice for a fully connected graph. In this case E == N², which makes all six of our fundamental operations O(N²) for fully connected graphs using an edge list. What about our adjacency matrix?

  • Add a node to the graph: O(N²) — we only need to add one row to the matrix, but every individual row will also need one additional column. We will almost certainly need to reallocate memory and copy the whole matrix to do this.
  • Create an edge between any two nodes: O(1) — Lookup the row/column in constant time and change the value in that cell.
  • Check if a node exists in the graph: O(1) — check either that the name exists in the name to position hash table, or that the integer identifier is less than the length of the matrix.
  • Given a node, return it’s neighbors: O(N) — go to the appropriate row in constant time then make a copy (O(N)) to return.
  • Return a list of all the nodes in the graph: O(N) — either scan the hash table for all the identifiers, or simply return a list of integers 0-N.
  • Return a list of all edges in the graph: O(N²) — we have to examine every cell in the matrix to determine the complete list of edges.

With these features in mind, adjacency lists excel under a few conditions. If the graph is relatively dense then the N² size requirement will not waste memory. If the total number of nodes in the graph is known before the representation has to be constructed — that is we are first building the graph then performing analysis as opposed to incrementally building a graph — then we won’t pay the painful O(N²) cost to add a node to the graph. Finally, if we are frequently checking for the existence of an edge between two nodes in whatever analysis we’re performing then we maximize the benefit from the O(1) lookup time.

A Brief Aside, Looking Ahead

In order to highlight the strengths and weakness of these two representations consider the following. Suppose we want to use a web crawler to build a graph representing Wikipedia articles where nodes are articles and edges are hyperlinks inside those articles. Further suppose we wanted to use that graph to answer the question “what is the shortest path between article x and article y?for several pairs of articles.

While we’re crawling Wikipedia, we will keep discovering new articles, and therefore we will have to continuously add new nodes. If our underlying representation is an adjacency matrix paying the O(N²) cost of doing so will get expensive fast. If we instead simply use a set to keep track of URLs we’ve already seen, and an edge list to track the hyperlinks we find, then adding a node is two constant time operations.

If we want to start finding shortest paths using search we won’t be adding nodes to the graph. Instead we’ll be repeatedly calling getNeighbors as we saw in the previous section about BFS and DFS. The edge list’s performance on getNeighbors is very poor, so unless we think Wikipedia articles very rarely link to other Wikipedia articles, we really shouldn’t use an edge list for performing search.

Consider the possibility of using both. First, while crawling Wikipedia make use of an edge list for it’s favorable cost of adding nodes and edges. Then, once we’ve crawled all the articles from Wikipedia that we’re interested in, process the edge list into an adjacency matrix. This way we know all of the nodes beforehand so we can allocate all of the memory for our matrix exactly once and get the benefit of the faster getNeighbor and isNeighbor operations.

The next two representations are alternatives that are compromises between an edge list and an adjacency matrix. For most “general purpose” graphs, one of these two will likely be a better choice than either the adjacency matrix or the edge list.

Adjacency Lists

In an adjacency list rather than having a single data structure that encodes all of the information about the nodes and edges, we maintain a separate list of neighbors for each node individually. This implementation reduces the large storage requirements of an adjacency matrix while achieving more reasonable speeds for getNeighbors and isNeighbor than the edge list. Consider our running example:

Here, we have a list of pointers (one for each node). Those pointers lead to individual lists representing all of the outbound edges for each starting node. The overall size of our representation is now dependent on both the number of nodes and the number of edges; we’ll always have the list of nodes even if they all point to empty lists indicating no edges, and the rest of our data structures size depends on how many edges there are. We can say the overall storage requirement is O(N + E).

How about the speed of our operations? Assuming (as we did for the adjacency matrix) that we have sequential integer identifiers for each node, or a hash table that maps string identifiers to consecutive integers:

  • Add a node to the graph: O(1) — allocate space for an empty list, and add one item to the list of pointers to point to that empty list.
  • Create an edge between any two nodes: O(1) — follow the pointer for the “from” node and add an item to its adjacency list.
  • Check if a node exists in the graph: O(1) — check either that the name exists in the name to position hash table, or that the integer identifier is less than the length of the node list (the list of pointers).
  • Given a node, return it’s neighbors: O(N) — go to the appropriate adjacency list in constant time then make a copy. Note that in the adjacency matrix this always takes exactly N time whereas in the adjacency list it may well take less time, depending on how many neighbors the node in question actually has.
  • Return a list of all the nodes in the graph: O(N) — either scan the hash table for all the identifiers, or simply return a list of integers 0-N.
  • Return a list of all edges in the graph: O(N + E) — sequentially copy each of individual the individual adjacency lists into a single list to return.

Note also that checking if a node is a neighbor of another node is O(N) because we have to scan the individual adjacency lists to check for a matching edge. This is slower than the adjacency matrix O(1), and one of the weaknesses of the adjacency list.

Nested Hash Tables

The last implementation we’ll examine is a great general purpose compromise that relies on everyone’s favorite data structure, the hash table. This implementation is very similar to the adjacency list but instead of using a list to encode the edges for each node, we use a hash table. Additionally, instead of requiring sequential integer identifiers for the list of pointers, we just use a hash table for that too. Since we were already assuming a hash table existed between the names and sequential integers (A maps to 0, B maps to 1, and so on…) this isn’t much of a change. Indeed, we could have just as easily had our hash table map names to the pointers that lead to the adjacency lists in the previous example, but I digress.

The benefit of using a hash table to represent the list of neighbors is that determining if two nodes are neighbors becomes a constant time, rather than linear time operation. Another benefit is that this strategy is easy to implement as long as you have a hash table API. We have made an example of this implementation available on Github for you to explore.

This is currently the end of the series. More to come, sign up for our weekly newsletter to find out when articles have been added to this series!

--

--

Tyler Elliot Bettilyon
Teb’s Lab

A curious human on a quest to watch the world learn. I teach computer programming and write about software’s overlap with society and politics. www.tebs-lab.com