Data Structures Part 2: The Maze

In the first part of this trilogy (I know :/ it was supposed to be two parts), I have talked about relatively elementary structures; arms and legs of a programmer. This time, we will dig deep into more fancy stuff; the ideas that can be applied to many complex problems as well as simple ones. I would like to dedicate this part to the structure — theory if I may — that probably deserves the most attention:

Graph

Image Credit

As Shakespeare would put it: I owe more words to this data structure, than you shall see me pay; though I will try to clarify the few aspects that I find the most interesting and important.

These little discs and filaments you see in the image roam this universe with many names. I will use the aliases vertex (disc) and edge (filament). Note: this is not a valid graph coloring.

Depending on the relation between the two vertices at the ends of an edge, a graph is said to be directed or undirected. These two types are mutually exclusive, meaning that a graph cannot contain both types. An analogy for directed graphs would be one-way signs: Just imagine all streets having a one-way sign. And an undirected graph would intuitively have no one-way signs at all. Edges can also have weights: what if we had to pay for each street we use? Well, that would be a weighted city (now we call them weighted graphs). I need to define one last thing: Density of a graph. A graph is said to be dense if the number of edges is close to the maximum number of edges; called sparse otherwise.

Although there is no silver bullet (at least to my knowledge) in implementation of a graph, adjacency list often is the preferred data structure working in the backstage (following complexities for basic operations will be build up on adjacency list).

I will drop here the minimal C++ implementations for Edge and Vertex types:

struct Edge {
int weight;
Vertex *from;
Vertex *to;
}
struct Vertex {
int val;
vector<Edge*> edges;
}

Before we get our hands dirty with insertion, let’s get the space complexity out of the way: Each vertex holds the list of related edges. This means that each edge will be duplicately stored in two distinct vertices. So the total space required for whole graph is the space required for vertices, plus, space required for edges times two [|V| + 2|E|] which makes the space complexity [O(|V| + |E|)].

Addition (Edge)

Adding an edge is a constant time [O(1)] operation. Just instantiate the edge and connect the two ends (an edge always has two ends). This is all there is to it.

Addition (Vertex)

Adding a vertex is also a constant time [O(1)] operation. After instantiating the vertex, we will need to populate the edges vector (or some mutable array). The process of population of course depends on the quantity of edges we are willing to append.

Removal (Edge)

At first glance, this seems like a constant time operation, though it isn’t. We cannot just delete the edge, we also need to traverse the edges array of both ends and find the edge we are about to delete. What if the vertex at one end is connected to all the other vertices? This means that the edges array may contain as many edges as there are vertices. We will have to traverse all over it to find the edge we want to delete. Oops! |V| complexity appeared out of the blue. Thus, time complexity for an edge removal is [O(V)].

Quick recap: Just the deletion of an edge is of course constant time. Removing the edge from the vertices’ edges array is linear (count of edges is related to number of vertices that current vertex is connected to). Russia has 14 neighbors, so edges array of Russia would contain exactly 14 edges.

Removal (Vertex)

For every edge connected to the vertex, we need to go to the other end of the edge and remove the edge from the list. Finding the edge from the target vertex’s edges array is again linear time. Only this time it is not [O(|V|)], but [O(|E|)]. So we have the vertex A which is connected to vertices B, C and D. B is connected to A, C and D. We can to delete A.

  1. Remove the only edge from C’s edges array, also remove C since it is not connected to anything.
  2. Remove the only edge from D’s edges array, also remove D.
  3. Find edge that connects B to A. B is connected to all three also, so we need 3 more iterations at most (in this case we need to remove C and D too since they are removed from the graph).

While the total number of vertices is only 4, we have gone through 3 edges of A, both edges array of C and D contained single element so comes 2 from there and finally edges array of B contained 3 elements which sums to (3 + 1 + 1 + 3) 8. We just traversed every single edge!

Query

Are vertices A and B adjacent? All we have to do is traverse the edges array of A or B to see if there is an edge that connects the two. It is probably a good idea to use the one with less elements in it’s edges array. Again, edges count is related to the number of vertices. Time complexity: linear [O(|V|)].

Graph Coloring

Here comes the fun stuff. Graph coloring is the act of grouping vertices. Some vertices would be in Red Team, some in Blue Team, some in Green Team and so on and so forth. The amount of available colors should be predetermined and it depends on what we are hoping to achieve. Coloring is applied to vast number of problems such as scheduling, designing seating plans and even Sudoku.

Image Credit

The deal is, no two neighbors can have the same color. If a graph can be colored (or.. painted?) using k distinct colors, then that graph is called k-colorable. The 2-colorable are also called bipartite graphs. Trees are also 2-colorable.

The brute-force search for graph coloring is of course extremely inefficient. We try every possible combination of such graph until we find a valid layout. For k colors and n vertices, consider each of k^n assignments.

2-colorability algorithm is pretty straightforward.

  1. Apply Breadth-First-Search Algorithm and assign the parent’s opposite color for all siblings at the same level.
  2. Traverse all the edges and check if there are any two neighbors with the same color (if we find such an edge, then the graph is not 2-colorable or bipartite).
BFS starts from the root (or some specified node of the graph) and traverses the tree or graph level by level in the following order: 1 / 2–3–4 / 5–6–7–8 / 9–10–11–12.

For any k > 2, the problem is much more difficult (an NP-complete problem). Fastest algorithms for 3-colorability & 4-colorability are bound with [O(1.3289^n)] and [O(1.7272^n)] complexities respectively.

Going into more detail about NP-completeness, the problems with higher level coloring problems and how they are solved efficiently are unfortunately beyond my knowledge’s reach at this moment. If you are interested, you can (or should) refer to the holy book of algorithms: Introduction to Algorithms by Cormen et. al.

Shortest Path

This is a problem we sub-consciencely try to solve possibly dozens if not hundreds of times everyday: Which is the shortest route between point A and point B? To achieve our goal, we probably are using some sort of caching; check the memory for prior calculations for the same specific route.

Then we do have to calculate all possible routes and pick the most efficient one. I have no idea how our brains do it, but I know how computers are told to do it: Introducing, Mr. Edsger W. Dijkstra’s famous shortest path algorithm:

Dijkstra’s Algorithm

For starters, we need a (positive) weighted graph to symbolize distances. We will also need a hash table for unvisited nodes which in the beginning will contain all vertices. All the unvisited nodes should be marked with infinity (or some sufficiently large number) since distance is unknown and the initial vertex must be assigned 0 as tentative distance. We can only remove a vertex from unvisiteds table after we are done processing each neighbor of that vertex. The gif below is I think a great visualization of the algorithm.

  1. For each vertex X, calculate the distance to it’s unvisited neighbors, if the tentative distance is less than the already assigned value (infinite if not yet assigned), update the distance.
  2. Remove the vertex from the unvisiteds table when all neighbors are processed
  3. If the destination node — in this case vertex b — has been marked as visited (or simply removed from the unvisiteds) we have found the the shortest path.
  4. Otherwise continue with the vertex that has the smallest tentative distance.

The most efficient implementation for Dijkstra’s Algorithm uses Adjacency List together with Fibonacci Heap (I will talk about heaps later on) which results in O(|E| + |V|log|V|) complexity.

Minimum Spanning Tree

This time, we are interested in the skeleton of the graph that connects all the vertices together with the lowest cost. This problem is valid only for again weighted and undirected graphs. Negative weighted edges would cause problems since going over any negative weighted edge over and over again decreases the total weight of the skeleton, which results in an infinite cycle. For a graph with n vertices, a MST should contain n-1 edges. And if all the weights of the graph is unique, MST of the graph is also unique.

One great example for MST is electricity cables laid out in a city. We have to connect every neighborhood with the other in a greedy manner. Here comes Prim’s Algorithm and Kruskal’s Algorithm for aid. Both of these algorithms work greedily with few differences.

Kruskal’s Algorithm

Together with Prim’s this is not the most efficient algorithm though they work equally fast for sparse graphs. More sophisticated algorithms are being used if the graphs is or may be dense. I have picked the Kruskal’s for I think it is the simplest one among all.

  1. Initialize a forest (a forest would have a very similar implementation with a graph) F where each vertex in the graph is a separate tree, a collection Q containing all the edges in the graph and perform a sort operation so that the collection is non-descending. Perform the step below until Q is empty and F is not yet spanning (contains all the vertices).
  2. Remove the edge with the minimum value ([O(1) operation since the collection is sorted]) and check if edge connects two vertices. If so, add it to F.

Time complexity for Kruskal’s Algorithm is [O(|E| log|E|] or equivalently [O(|E| log|V|)] since E is at most V² and logV² = 2 * logV (link for source).

There are much much much more to say about graphs, however I must away ere break of day.

What’s Next

I was thinking it would be enough to divide the whole thing into two parts would be sufficient though it now seems unreasonable to force Balanced Trees and Heaps to this chapter. So please bear with me for one last chapter on Data Structures.