Building Algorithm to Compute Strongly Connected Components (SCC) From Scratch

Published in

The Startup

8 min readJan 28, 2021

Preface

When I learned algorithms for strongly connected components (SCC) in the university, like Kosaraju’s algorithm or Tarjan’s, I could see it works well but had no idea where these come from. I mean these algorithms are so elegant and seems magically come from some genius. Indeed, theoretical basis of computer science today was developed by small number of highly talented people in 1970’s and 80's. I remember that I felt like I would never be able to come up these noble algorithms.

Time has passed and now I’m curious about computer science fundamental again. In this article, I try to derive my own algorithm for SCC decomposition from scratch, assuming I forget about Kosaraju’s or Tarjan’s. Though my algorithms were stupid and were not efficient, building algorithm from scratch by myself gave me much deeper insight (and fun!). So, I’d like to share it in this article.

Mathematical Definitions of SCC

Before all, I need formal definition of strongly connected components (SCC). Let’s take the following graph as an example.

In the example graph, there are 9 nodes = {A1, A2, A3, A4, A5, A6, A7, A8, A9} and 12 edges = {(A1, A2), (A1, A5), (A2, A4), (A3, A2), (A4, A3), (A5, A6), (A5, A9), (A6, A7), (A7, A5), (A7, A8), (A9, A10), (A10, A1)}.

SCC is groups of nodes where nodes in the same group are strongly connected. But, what does “strongly connected” mean? We need some mathematical definition.

Path: a chain of edges where each edge starts from the end nodes of previous edge.
Example [(A1, A5), (A5, A6), (A6, A7)] is a path from A1 to A6, and A1 is the start node and A7 is the end node of the path. For convenience, we can abbreviate the duplication in the path expression like path(A1, A5, A6, A7).

Strong Connectivity definition 1: node1 and node2 are strongly connected, if and only if there is a path from A to B and there is a path from B to A.
Example, A1 and A7 is strongly connected as there are path(A1, A5, A6, A7) and path(A7, A5, A9, A10, A1).

Following definition of the strong connectivity, the example graph is decomposed into 3 components like the below figure.

Alternatively, we can define mathematically equivalent definition of connected nodes with the concept of loop.

Loop: a path in which the start node and the end node is identical.
Example [(A5, A6), (A6, A7), (A7, A5)] is a loop. As same as path, we can abbreviate the duplication in the expression like loop(A5, A6, A7).

Strong Connectivity definition 2: node1 and node2 are connected if and only if there is a loop which includes node1 and node2.
Example, A5 and A7 is connected as there is a loop(A5, A6, A7).

Let me construct algorithms from these 2 definitions of strong connectivity in the following section.

Algorithm from definition1: brute force

Strong Connectivity definition1: node1 and node2 are connected, if and only if there is a path from A to B and there is a path from B to A.

Direct implementation of this definition is straight forward, just simply iterating over all the pair of nodes should be enough.

Python code below is doing this, in which the input is given with dot format and the output is also in dot with color annotation (nodes in same components are with same colors).

This implementation is able to correctly compute a given directed graph into SCC. However, obvious problem of this solution is its time complexity. It requires N² times depth first search (DFS) and ended up to O(N³) as a whole, where N is the number of nodes.

Algorithm from definition2: Ad hoc loop finding

Strong Connectivity definition2: node1 and node2 are connected if and only if there is a loop which includes node1 and node2.

Next, let me start with SCC definition2 which uses the concept of loop. Though this definition is mathematically equivalent to the definition1, it’s statement looks different and it’s not as easy as the previous one for implementation. One thing we need to consider is that how to find loops. Also, there is no guarantee that there is 1 to 1 relationships between SCC and the associated loop.

Fortunately, characteristics of DFS is well studied: Ref: Lecture note from Duke Univ. We can use DFS to find loops in directed graph. In a nutshell, finding back edge is almost equivalent to finding a loop which defines each SCC. The steps will be look like

find back edges (u, v) during DFS, which indicates there is a loop starting v. Let’s call it as root node of the loop.
backtrack the root node of the loop during DFS to find all the member in the loop discovered in step1.

As member nodes in a loop are in the same SCC, result of step2 will have enough information for SCC decomposition. Though I need to explain subtle details, I’ll give the python code at first.

Apparently, the implementation is much more complicated than the previous brute force one. This complexity comes from subtle difference on how to handle different types of edges, tree, forward/cross, and back edges during DFS.

Figure3. Propagation of back edge information for tree edges

For tree edges, we need to propagate the back edge information from the next node, who has just been visit-completed. In the Figure3, DFS might visit nodes in the order of A1, A2, A3, A4 then find a back edge (A4, A2). Upon the detection of the back edge, it stores A2 to A4.root_node_label. The root_node_label propagates to A3 and A2 as DFS is back tracing. When A2 visiting completes, all of A2, A3, A4 has A2 in their root_node_label. However, it won’t propagate to A1 as A2 is not an ancestor of A1. This control on when we stop the propagation is needed at tree edges.

Figure4 is to show how back edge information propagates for cross edges. In the above graph, DFS might visit nodes in the order of A1, A2, A3, A6, A4 , A5, and then it finds a back edge (A5, A1). The back edge information propagates to A5, A4, A6, A3, A2 as DFS traversal back tracing to A2. At A2, it will visit A7, which is the next neighbor nodes of A2. At A7, it will find a cross edge (A7, A4). Note that cross edge is not traversed, but we need to pass the back edge information A1 from A4 to A7 as A1 is A7’s ancestor. It means that we store A1 to A7’s root_node_label. Note that we need to check if A1 is really an ancestor of A7. Same applies for forward edges.

At the end of the DFS traversal, each node should have one root_node_label. There are three types of nodes.

For loop root node (destination of back edge), it has its own label for root_node_label.
For nodes which consist single node SCC, it has its own label for root_node_label as well.
For other nodes, it has one of its ancestor’s label for root_node_label.

It is important to recall that nodes with same root_node_label are in the same SCC. So, it collects root_node_label from each node and make a hash table whose key is the root node and the value is a list of member nodes in the loop.

Figure5. Post process of root_node_labels.

Tricky part is line 136–156. For example, for Figure5, we might get following 2 loops before the post processing if the DFS visiting order is A1, A2, A3, A4, A5, A6.

{“A1” : set(A6, A5, A2, A1), “A2”: set(A4, A3)}

Note that the key of the hash is root nodes. As the graph in Figure5 is fully strongly connected graph, we should get one loop {“A1”: set(A6, A5, A2, A1, A4, A3)} if we correctly decompose it into SCC.

There are 2 back edges (A4, A2) and (A6, A1) in the graph. If (A4, A2) is found before (A6, A1), it stores A2 to A4, A3 and A2’s root_node_label. After (A6, A1) back edge found, it eventually updates A2’s root_node_label to A1 but A3 and A4’s root_node_label remains A2 because the visiting of A3 and A4 has been already completed when the back edge (A6, A1) is found. This is why we saw 2 components without the post processing at line 136–156.

To mitigate these out-dated labels, we can merge loops who don’t include its root node to the loop who contains the root node.

It’s been a bit long way but we finally come up with our second algorithm for SCC computation.

Regarding time complexity, we need only one DFS traversal.That’s good comparing it to N² time traversal in the brute force. However, we need additional computational cost for the post processing of loop merge and it makes the worst case time complexity worse. Merging loops in the post processing takes O(L² * N) at worst case where L is the number of loops detected (= number of back edges found in DFS traversal) and N is the number of nodes. The number of back edges can have O(V) where V is the number of edges, which can be O(N²). So, theoretically, worst case time complexity of second algorithm is worse than brute force, unfortunately.

However, for graphs who have small number of back edges and if we can consider it as constant, the second algorithm works faster than brute force as we only need DFS traversal once.

Back to Tarjan’s Algorithm

At the last section, I’d like to re-visit Tarjan’s algorithm in order to understand again how noble it is. The next python code is one implementation of Tarjan’s algorithm.

Interesting thing I noticed after I came up with my inefficient algorithms is that, in Trajan’s algorithm, lowlink variable plays a role to track the root node of back edges. So, the core idea seems same. Of course, Tarjan’s algorithm has much more sophisticated logic and data structure and better time complexity. Actually, I finally understand that it does quite well in implementing backtracking of back edges, which magically avoid the issues I encountered in my second algorithm.