Merging Communities

Rupesh Sasne
Algorist Weekly
Published in
4 min readJul 31, 2019

Ever wondered how Facebook/LinkedIn learns whether two people may know each other through some common friends? Read on to know the answer… 😉

I came across this problem statement while practicing the problem-solving at HackerRank;

So, there were two types of queries, M & Q.

  1. M I J => communities containing person I & J merged (if they belong to different communities).
  2. Q => print the size of the community to which the person belongs.

The algorithms were pretty obvious, merge the two different communities & return the size of the community whenever requested.

Merge

merge(i, j)
ci = community(i)
cj = community(j)

if (ci == cj) return
mergeCommunities(ci, cj)
  1. Maintain a list of sets.
  2. Initially, each person will be in its own community represented by a set.
  3. For merging communities of p1 & p2, search in the list to get respective community sets. If both the sets are different then only merge them and clear one of them.

Query

query(i)
ci = community(i)
return ci.size()
  1. Search across the list to get the community set for the query person
  2. Return the size

That seemed pretty good, initially :). So I implemented the above algorithms as follows.

I did some dry runs, ran the sample tests; it worked great up until the moment I submitted it.

It gave me the TLE verdict. 🤦‍♂

I should have expected this, the verdict TLE was obvious as the time complexity of this naive algorithm is O(n*q) where the input size is 1 ≤ n ≤ 10⁵ & 1 ≤ q ≤ 2*10⁵.

“How can I improve it?”, I gave it a thought.

I had to design an efficient data structure for merge such that the searching will be better than O(n), and what is better than that? O(log n), isn’t it?

What if I’d connect two persons P1 & P2 (and hence their communities) via some common ancestor? In other words, what if I’d put P1 & P2 in an n-ary tree? And initially, every node acts as the root of its own community with size 1.

Merging communities via a common ancestor

In this case, if we maintain a hash of each person, searching for the person will require O(1) & finding a root of its community will take O(log n)

Proposition: Depth of any node x is at most log N.

Let’s modify the Merge & Query algorithms

Merge

merge(p1, p2)
p1Root = root(p1)
p2Root = root(p2)

if (p1Root == p2Root) return
p2.parent = p1
p1.size += p2.size
root(p1)
traverse till root using parent pointer; root is the node which
doesn't have a parent (or which is pointing to self as parent)

Query

query(p)
pRoot = root(p)
return pRoot.size()

And, here is the implementation; not providing the main method for brevity.

Merging communities via a common ancestor

That also seemed pretty good, at least better than before.

I ran the sample tests, it worked great; up until the moment I submitted it.

It gave me a TLE verdict, again!!! ⏰

But it was definitely better than the first solution, only 3 tests were failing this time due to timeout. I had to find out what was going wrong. The time complexity had been reduced to O(n + q*log n) from O(q*n); still, it was giving a TLE verdict.

“Is there any scope for improvement?” 🤔

Had I really achieved O(n + q*log n)? Let’s get back to the diagram Merging communities via a common ancestor, in the second step for merging p2 & p6 what if we’d have made p6 as the parent of the root of p2? In other words, merge large community in the smaller community.

Merging larger communities in smaller ones

If we keep doing this one can clearly see the height of the tree will grow dramatically. I hadn’t considered this scenario in the implementation and that's why the algorithm performed O(q*n) in the worst case?

So I guarded the merge algorithm as follows; the idea was to merge smaller communities into the larger ones to keep height maintained.

Optimized merge algorithm

That seemed pretty good, I ran the sample tests, and it worked great after submitting also 😅.

The complexity is O(n + q*log n)

Now the question, How to know if two persons are connected, that too in O(log n) time?

connected(p1, p2) = root(p1) == root(p2)

The algorithm which I used is called as;

Union-Find (Disjoint Sets)

One can optimize the above implementation further by compressing the path to the root node. How? Think and let me know in the comments.

--

--