Algorithm Performance: Looking Beyond Big-O Analysis
A review of Robert Sedgewick’s Union-find case study
Recently, I’ve been going through Sedgewick’s Algorithms textbook and really liked his way of talking about algorithms and data structures in general. His point of view is that Big-O is too broad of a brush to paint the picture of an algorithm’s performance. For those reading who don’t know what Big-O is, it’s a tool used to describe the worst-case performance of an algorithm in mathematical terms. It describes how the run-time of an algorithm grows with input size. Sedgewick believes that you can do better and be able to describe performance at a deeper level through analyzing things like how many/what variables are used and the number of array accesses. The goal of this article is to unpack and add some of my own thoughts about what Sedgewick writes in his Union-find analysis. The format of this article will be to go through 4 different implementations of the Union-find data structure (in Java) and discuss the performance of each implementation.
What is Union-find you ask? It’s a data structure that is able to solve the dynamic connectivity problem. It should be able to take in a pair of integers and determine if they are connected. What do we mean by connected? We mean that two integers that have reflexive, symmetric and transitive properties are said to have an equivalence relation. Two integers having the same equivalence relation are said to be part of the same equivalence class which is also known as a connected component. The integers represent different nodes/sites and two integers are connected if they are part of the same component. Union-find is disjoint-set data structure, it finds all the non-overlapping sets.
Before I go into each of the three different implementations I’ll briefly discuss the Union-find API. We have the following methods:
- Count — returns the count of connected components, which is kept track through a count variable.
- Connected — returns whether two sites/nodes are connected, simply calls the find method
- Find — returns the id of the component the passed in site is a part of.
- Union — joins two sites into the same component.
Implementation # 1 — Quick-find
public class UnionFind {
private int[] id;
private int count;public UnionFind(int n) {
count = n;
id = new int[n];
for (int i = 0; i < n; i++) {
id[i] = i;
}
}public int count() {
return count;
}public boolean connected(int p, int q) {
return find(p) == find(q);
}public int find(int p) {
return id[p];
}public void union(int p, int q) {
int pID = find(p);
int qID = find(q);
if (pID == qID) return;
for (int = 0; i < id.length; i++) {
if (id[i] == pID) id[i] = qID;
count--;
}
}}
The quick-find approach is to maintain an invariant that p and q are connected if and only if the id of p and q are equal, which means that all the sites/nodes in a component must have the same id. This approach is called quick-find because find(p)
just returns id[p]
and connected()
just boils down to id[p] == id[q]
.
Although the approach is quick as it only needs to access the id array once to complete its operation, this implementation is not typically useful for large inputs because the union method needs to scan through the whole id array for each input pair.
An analysis of the quick-find algorithm shows that the find()
method uses one array access for each call, two array accesses for the connected()
method, and between n + 3 and 2n + 1 array accesses for each call to union()
. The n + 3 is the lower bound of performance where a component of size one is connected, we have 2 array accesses for the find()
method and n + 1 array accesses in the lower bound scenario: n accesses for the equality check in the for loop and one more for the changing of the value if it passes the condition. The 2n + 1 upper bound comes from the scenario when p is the size n-1, so that means we have n array accesses for the equality check and n-1 array value changes. That gives us 2n-1, when we add in the two array accesses for the find methods we get the total 2n + 1.
If we consider the Union-find data structure in the context of the dynamic connectivity problem, which is where a program takes in a sequence of pairs and builds the union find data structure and outputs the pair if it was not previously connected. What this means is for each pair we’re calling the union()
method. In the scenario that we end up with one component, this requires n-1 calls to union()
and at least n+3 array accesses which gives us (n+3)(n-1) ~ n² array accesses. This is quadratic time. When we talk about the order of growth for the performance of a program, we use the following terms:
- Constant — 1
- Logarithmic — log(n)
- Linear — n
- Linearithmic — n * log(n)
- Quadratic — n²
- Cubic — n³
- Exponential — 2^n
Based on the analysis of the quick-find approach, we can surmise that this approach wouldn’t be very practical for large inputs. We can now move on to the second approach.
Implementation # 2: Quick-union
public int find(int p) {
while (p != id[p]) p = id[p];
return p;
}public void union(int p, int q) {
int i = find(p);
int j = find(q);
if (i == j) return;
id[i] = j;
count--;
}
This approach requires that you look at the data structure in a different approach, the values in the id array no longer signify links to other nodes but links to a root node. The root node would link to itself. In fact, when the data structure is first initialized, all the sites are roots to themselves. When we implement the find()
method, we’re following the links to the root node.
Quick-union uses the forest of trees representation. What this means is that the id array represents a forest of trees. The trees are different components. The terms that describe a tree are size, depth and height. The size is the number of nodes in the tree, the depth of a node is how many links to the root node, and the height of a tree is the maximum depth among its nodes.
The implementation above is called quick-union, because we follow the nodes p and q and simply link them by linking their roots. The choice to link p to q or q to p is arbitrary, but it is a choice that we will later see is an opportunity for optimization.
The quick-union approach seems to be faster than the quick-find approach because it doesn’t have to go through the entire array for each input pair. The question is how much faster is it? It’s difficult to determine how much faster because the performance of this algorithm is input dependent. The find()
method at best would use just one array access and at worst would need 2n-1 array accesses (n array accesses for the equality check and n-1 for the value change). The worst case is when you have a node with the depth of n. Also, the 2n-1 count is a conservative measure because compiled code will typically not do an array access for the second reference to id[p]
. For the union()
method we can envision a best-case scenario where it runs in linear time, that would mean that each node in the sequence of pairs in the dynamic connectivity problem have a depth of zero. The worst case scenario is when for example in a union-find data structure initialized with five sites has the following sequence: 0–1, 0–2, 0–3, 0–4. This would run in quadratic time, because each time the depth of the tree would increase. The bulk of the work done in this implementation is in the find()
method. The number of array accesses for the pair 0-i is 2i+ 1 (site 0 is at depth i-1 and site i is at depth 0). Accordingly, we get (3 + 5 + 7 + … + 2n-1) ~ n² array accesses, which indicates quadratic time.
The quick-union implementation certainly is an improvement because it removes quick-find’s liability that union()
always takes linear time, but it’s not substantially better because we can’t guarantee in every situation that it will outperform quick-find. Hence, we move on to the next implementation: weighted quick-union which optimizes on the link between p and q in the union() method.
Implementation # 3: Weighted Quick-union
public class UnionFind {
private int[] id;
private int[] sz;
private int count;public UnionFind(int n) {
count = n;
id = new int[n];
sz = new int[n];
for (int i = 0; i < n; i++) {
id[i] = i;
sz[i] = 1;
}
}public int count() {
return count;
}public boolean connected(int p, int q) {
return find(p) == find(q);
}public int find(int p) {
while (p != id[p]) {
p = id[p];
}
return p;
}public void union(int p, int q) {
int i = find(p);
int j = find(q);
if (i == j) return;
if (sz[i] < sz[j]) {
id[i] = j;
sz[j] += sz[i];
} else {
id[j] = i;
sz[i] += sz[j];
}
count--;
}}
In this implementation, instead of arbitrarily connecting p and q, we connect them based on their size, we always want to connect the smaller sized component with the bigger sized component. In order to keep track of size, we updated the code a little bit to include a size array.
When analyzing the algorithm we have to keep in mind the property of a tree amongst the forest of trees: each tree has a height of k nodes when it has 2^k nodes. Furthermore, when we merge two trees we get 2 * 2^k → 2¹ * 2^k → 2 ^ k + 1 nodes in the new tree and we increase the height of the tree to k + 1. This observation generalizes to provide a proof that the weighted quick-union approach guarantees logarithmic performance.
We can unpack this a bit more. We take a look at the proposition that any node in a forest built by this approach for n sites has a depth of at most log(n). We can prove this by induction: the height of every tree of size s is at most log(s). The base case follows from the fact that the tree is height zero when k is one. By inductive hypothesis assume that the tree height of a tree of size i is at most log(i) for all i < s. When we combine a tree of size i with a tree of size j and i is less than j, than we’re increasing the depth of every node in the smaller tree by one, but now they are in a tree of size i + j = s, so the property is preserved since 1 + log(i) = log(i + i) ≤ log(i + j) = log(s).
The weighted quick-union algorithm uses at most c * m * log(n) array accesses to process m pairs amongst n sites. This performance is much better than the performance guarantees of the earlier two implementations. There is one last implementation that seeks to optimize even further! This gets us as close as possible to constant time: weighted quick-union w/ path compression.
Implementation # 4: Weighted Quick-union w/ path compression
public int find(int p) {
int root = p;
while (id[root] != root) root = id[root];
while (id[p] != p) {
p = id[p]
id[p] = root;
}
return root;
}
This last implementation’s only difference between the previous implementation is that it compresses the path from node p to the root, so every node along the path will simply point directly to the root. Although this is an improvement, we’re not likely to see much improvement over the weighted quick-union method in a practical situation. This approach comes close to amortized constant time, but not quite. There is no algorithm that exists that can compute each union-find operation in amortized constant time.
In conclusion, I’d like to drill in the idea that one should consider the situation at hand before designing an algorithm and should develop a solution in an iterative manner and consider and verify the performance of the algorithm through empirical analysis, mathematical analysis or both. Analyzing Quick-find helped us see that it’s best performance would be logarithmic / close to but not quite amortized constant time and we were able to come to that conclusion based on mathematical proofs and empirical data.
Stay tuned for future articles discussing other topics in Robert Sedgewick’s Algorithms textbook.
If you’d like to connect with me (I’m looking for an engineer role!) please follow the link to my linkedin: https://www.linkedin.com/in/hassan-b-malik/