The ROCK Algorithm in Machine Learning

Published in

Geek Culture

5 min readJun 7, 2021

‘If You Smell, What The Rock Is Cooking!’

This is exactly what came to my mind when I first read about an hierarchical clustering algorithm called ‘ROCK’. The creators of this technique unknowingly drew up an analogy between Dwayne Johnson, the versatile American actor, producer, retired professional wrestler, and former American football and Canadian football player, and a clustering technique, which solves the problem of using data such as his long list of achievements as a variable in a clustering exercise.
In other words, ROCK, not Mr. Johnson, but RObust Clustering using LinKs, is a novel technique which helps us to cluster categorical data.

Introduction

Clustering is a data mining technique of grouping similar type of data or queries together which helps in identifying similar subject areas. The major problem is to identify heterogeneous subject areas where frequent queries are asked. There are a number of agglomerative clustering algorithms or hierarchical clustering algorithms which are used to cluster the data. Most of these algorithms make use of distance measures to calculate similarity between the data points. This methodology acts as hindrance when our data consists of many/all categorical variables.

Recently, some members in my team were dealing with Experian data, which had more than 300 variables, and all the variables were categorical. One, solution which we came up with was a mix of one-hot encoding, bucketing based on information gain and bucketing based on simple majority. This resulted in close to 2000 encoded variables. Specialists in the field of data science would suggest performing some kind of feature importance and controlling this explosion of variables. Eliminating features with lower importance/impact helped us in limiting the number of input variables and getting great homogeneous clusters. The project was delivered and the client was happy with the outcome.

However, it occurred to me if we could have a more sophisticated solution to the problem of clustering categorical variables. This led me to ‘The ROCK’.

The ROCK algorithm uses the Lp^3 metric or the Jaccard coefficient instead of using the distance measures to find the similarity between the data points. For domains with discrete non-numeric attributes, the unsuitability of the Lp distance metrics and the Jaccard coefficient as an estimate of the similarity between clusters is established. The situation with these distance metrics further worsens as the number of attributes/dimensions increase. So, we introduce the notion of links between data points, which helps us overcome the problems with Lp distances and the Jaccard coefficient.

Explanation

Let a pair of points be neighbors if their similarity exceeds a certain threshold. The similarity value for pairs of points can be based on Lp distances, the Jaccard coefficient or any other non-metric similarity function. The number of links between a pair of points is then the number of common neighbors for the points. Points belonging to a single cluster will in general have a large number of common neighbors, and consequently more links. In this way merging clusters/points with the most number of links first will result in better and more meaningful clusters.

Unlike distances or similarities between a pair of points which are local properties involving only the two points in question, the link concept incorporates global information about the other points in the neighborhood of the two points. The larger the number of links between a pair of points, the greater is the likelihood that they belong to the same cluster. Thus, clustering using links injects global knowledge into the process and is thus more robust.

The ROCK algorithm is divided into three general parts:

Obtaining a random sample of data.
Performing clustering on the data using the link agglomerative approach. A goodness measure is used to determine which pair of points is merged at each step.
Using three clusters the remaining data points are assigned to them.
All information in the heap is ordered based on the goodness measure:

5. Here link (Ki, Kj) is the number of links between the two clusters. Also ni and nj are the number of points in each cluster. The function f(theta) depends on the data, but it is found to satisfy the property that each item in Ki has approximately ni ^ f(theta) neighbors in the cluster.

The pair of clusters for which the goodness measure value as per point 4 is maximum is the best pair of clusters to be merged at any given step. Based on logic of links, pairs of clusters with a very large number of cross links are good candidates for merging. But this naïve approach of using only the cross-links between pairs of clusters may work for well-separated clusters, but in the case of outliers or clusters with points that are neighbors, a large cluster may swallow other clusters and thus, points from different clusters may be merged into a single cluster. This is because a large cluster typically would have larger number of cross-links with other clusters. Thus we use the expected number of cross links between the clusters (denominator in point 4) as a normalization factor in the above goodness measure, and as a heuristic to steer us in the direction of clusters with large values for the criterion function.

Implementation

The PyPi offers a robust library to implement the rock algorithm in python. For R lovers a good implementation and explanation can be found here.

Conclusion

We have a number of other weapons to deal with categorical data for clustering such as K-modes, MROCK (Modified ROCK) etc., which may give us better homogeneous clusters as compared to the ones we get from ROCK. Then there are data preparation techniques discussed earlier like encoding, bucketing etc., which too are life-saviors under certain circumstances. The takeaway is that the choice of technique/algorithm greatly depends on the data at hand. We must evaluate the performance of the clustering approach by using Extrinsic measures such as Adjusted Rand index, Fowlkes-Mallows scores, Mutual information based scores, Homogeneity, Completeness and V-measure etc., or Intrinsic measures such as Silhouette Coefficient, Calinski-Harabasz Index, Davies-Bouldin Index etc., to decide on the best approach. This is why it’s called Data Science — as it requires the domain expertise, experience and experimentation to get to the most appropriate solution.

This article would be incomplete without thanking Sir Sudipto Guha, Sir Rajeev Rastogi and Sir Kyuseok Shim for developing this amazing clustering technique — ROCK.

I will be looking forward to your comments and suggestions.
Thanks for Reading! Stay Safe!

The ROCK Algorithm in Machine Learning

Introduction

Explanation

Implementation

Conclusion

Written by Aman Gupta