What is K Medoid Clustering: Why and How?
Being a data science enthusiast (which you are since you’re reading this), you must have come across quite a number of K prefixed phrases like k means, k modes, k nearest neighbors, and so on and while they might seem a little baffling at first glance, with a little thorough reading you’ll find though they might seem similar, they make perfect sense. Okay, in this lesson we’re gonna take a little closer look at one of there siblings, the K Medoid Clustering. So let’s get going.
Here’s a summary of what we’ll be doing over here, first why K Medoid, then a decent definition, and finally work out an example to help solidify your understanding. Now, I’m gonna assume you’re familiar with the clustering problem and the K means algorithm, if not, here’s a little brush-up.
Why K medoids
To explain why we need K medoid or why the concept of medoid over mean, let’s seek an analogy. Here’s a list of weights of 5 students of your class, 62, 64, 65, 62, 120, the mean of whose is 74.6, which is greater than the 80% of individual weights in the population and what’s funnier is that the mean excluding the last item(120) is 63.5, which represents the set quite well. These values, which lye way further than the general distribution of the data points are called outliers and as seen above, the mean is worst affected by such outliers. Now think of another measure, the median. Here’s how to find that, sort the data in ascending order, which gives 62, 62, 64, 65, 120 and point at the one at the middle (in case of even number of items its the average of the two at the middle), that’s the median, which is 64, in this case. Note that the median almost accurately represents the data even at the presence of an outlier. Statistically, medians are less prone to outliers which makes them a more reliable measure of center. Hopefully, this made sense to why K medoids are preferred over K means.
A decent definition
We are now ready to ingest a nice, intuitive definition of the problem at hand. Formally speaking, K Medoids a clustering algorithm that partitions sets of data points around a medoid (the least dissimilar point) and constantly attempts to lower the dissimilarity among the points in the same cluster. The key point here is that the medoid essentially is a data point from the input set, unlike in k means where mean is the mere average.
The algorithm
The algorithm is quite intuitive. In words, we’re given two input parameters, the value of k, ie. the number of clusters to be formed and the dataset on which the clusters form. First, take k number of random data points as medoids. Then for every data point except the ones taken as medoids, compute the dissimilarity with each k medoids using any distance metric such as Euclidean, Manhattan, or anything else and include the current on the cluster represented by the medoid which has the least dissimilarity with the current point. Once every node(the points) have been assigned to a cluster, calculate the total dissimilarity of the iteration, which is the sum of the dissimilarities between every node and the medoid of the cluster in which it resides. This is one iteration. This is to be repeated with different nodes as medoids until the total dissimilarity is found to be greater than the previous iteration. Below is the pseudocode view,
Now, who’s up for a work out example? Come on, it’s gonna be fun and enlightening(!).
How it works
We’re gonna work out an example here, the statement is something like this given a dataset containing height and weight of n = 10 people, arrange the dataset into 3 clusters. Here’s the tabular form,
This dataset is created on the fly, for the sole purpose of illustration, and any real-life inconsistency associated should be ignored. Now, note the fact that we’re expecting the dataset to be clustered into 3 disjoint sets. Let's take the 3 medoids randomly, for convenience I’m gonna select the first three nodes, so we have, the medoids = { m1(155, 67), m2(161, 68), m3(152, 76) }. Now if you’re doing it on hands, like me, here’s the trick, chalk out 3 more columns on the right (since k = 3) with headers d1, d2 and d3, where dj would represent the distance of ith node from jth medoid. Calculation time! I’m gonna use the Manhattan distance since it’s plain math devoid of the square and root hassle. If you’re not familiar with that, The Manhattan distance between (a, b) and (c, d) is equal to |a-c|+|b-d|, ie. the aggregate of the point distances. Once the distances are calculated, here’s how the table should appear,
A quick recap on calculations for node 4 (178, 82), to calculate d1, we took medoid m1(155, 67) and hence d1 = |155–178| + |67–82| = 82 and the rest follows. Here are a few thinks a clever eye would note in the second table,
- It excludes the first 3 nodes since they are the medoids
- I’ve marked the least distance with apple green because that’s the cluster we’re gonna push the corresponding node into
The clusters would initially consist of only the corresponding medoids and once the calculations wrap up, they’re gonna look much like
Cluster 1: { N1, N5, N7, N10 }
Cluster 2: { N2, N4, N6, N8, N9 }
Cluster 3: { N3 }
Ni denotes the ith node. The clusters are the result of the first iteration and now to calculate the overall dissimilarity measure for this iteration, add up all the least distances, such as D1 = 31 + 3 + 11 + 15 + 24 + 3 + 5 = 92. This metric is used as a ranking score to track the best clustering possible.
In the following iterations, which I‘m not going to show here, the same calculations would be repeated with different medoids, selected randomly and this would go on until for ith iteration Di-Di-1 > 0, ie. the swapping cost is positive.
So that’s it. Hope it builds a clear and more understandable basis. You may code this out, shouldn’t take long and as always, SPREAD LOVE.