Partitional Clustering using CLARANS Method with Python Example

A detailed explanation of the Clustering Large Applications based on RANdomized Search(CLARANS) Clustering technique with example in Python

Yash Dagli
Analytics Vidhya
5 min readAug 13, 2019

--

Table of Contents:

  1. Overview of Clustering.
  2. Brief Description of Partitioning Methods.
  3. Comparison of Partitioning Methods.
  4. Overview of CLARANS
  5. CLARANS Algorithm
  6. Python Example
  7. References

1. Overview of Clustering.

Clustering is a form of unsupervised learning because in such kind of algorithms class label is not present.

In general, clustering is the process of partitioning a set of data objects into subsets. Where each subset is a cluster, such that objects in a cluster are similar to one another, yet dissimilar to objects in other clusters.

Based on the characteristics of the algorithms, there are 4 main types of clustering techniques:

CLARANS is a type of Partitioning method.

2. Brief Description of Partitioning Methods.

Partitioning methods are the most fundamental type of cluster analysis, they organize the objects of a set into several exclusive group of clusters (i.e each object can be present in only one group).

Partitioning algorithms require the number of clusters ( k ) as it’s starting point.

Thus given a dataset D, consisting of n points, and k (k << n), partitioning algorithm organizes the objects into k partitions (clusters).

The clusters are formed by optimizing an objective partitioning criterion, such as a dissimilarity function based on distance, so that the objects within a cluster are “similar” to one another and “dissimilar” to objects in other clusters in terms of the data set attributes.

3. Comparison of Partitioning Methods.

K-means: The k-means algorithm defines the centroid of a cluster as the mean value of the points within the cluster.

That is why K-means is sensitive to noise and outliers because a small number of such data can substantially influence the mean value.

3.1 — K-medoids: To overcome the problem of sensitivity to outliers, instead of taking the mean value as the centroid, we can take actual data point to represent the cluster, this is what K-medoids does.

But the k-medoids methods is very expensive when the dataset and k value is large.

3.2 — CLARA: To scale up the K-medoids method, CLARA was introduced. CLARA does not take the whole dataset into consideration instead uses a random sample of the dataset, from which the best medoids are taken.

But the effectiveness of CLARA depends on the sample size. CLARA cannot find a good clustering if any of the best sampled medoids is far from the best k-medoids.

3.3 — CLARANS (Clustering Large Applications based upon RANdomized Search) : It presents a trade-off between the cost and the effectiveness of using samples to obtain clustering.

4. Overview of CLARANS:

It presents a trade-off between the cost and the effectiveness of using samples to obtain clustering.

First, it randomly selects k objects in the data set as the current medoids. It then randomly selects a current medoid x and an object y that is not one of the current medoids.

Then it checks for the following condition:

Can replacing x by y improve the absolute-error criterion?

If yes, the replacement is made. CLARANS conducts such a randomized search l times. The set of the current medoids after the l steps is considered a local optimum.

CLARANS repeats this randomized process m times and returns the best local optimal as the final result.

5. CLARANS Algorithm

Explanation:

The algorithm requires numlocal (amount of iterations for solving the problem), maxneighbor ( the maximum number of neighbors examined) and no. of clusters to be formed ( k ) as input.

Then the iteration starts, i is set to 1, before which the mincost (which is the optimal cost ) is set to infinite and bestnode (optimal medoids) is set to empty tuple.

Now k random data points are selected as current medoids and clusters are formed using these data points (Euclidean distance can be used to find the nearest medoid to form clusters).

After this a new loop starts, where j is set to 1. A random current medoid is selected and a random candidate (random neighbor) datapoint is selected for replacement with current medoid. If the replacement of candidate datapoint yields a lower TotalCost (which is the summation of distances between all the points in the clusters with their respective medoids ) than the current medoid then the replacement is made. If replacement is done then j is not incremented otherwise j = j +1.

Once j > maxneighbor, then the current medoids are taken and their TotalCost is compared with mincost. If the TotalCost is less then mincost, then the Bestnode is updated as the current medoids.

i is incremented afterwards and if it is greater than numlocal, then the Bestnode is given as output otherwise whole process is repeated.

6. Python Example.

Outputs:

output of : print(“A peek into the dataset : “,data[:4])

The above output shows that the iris dataset contains datapoints having 4 features.

output of : print(“Index of the points that are in a cluster : “,clusters)

We can see that there are 3 clusters, one starting with 53rd datapoint as its first element, another having 0 as it’s first element and lastly the one starting with 50th datapoint.

Output of : print(“The target class of each datapoint : “,iris.target)

The output shows that 1st to 50th datapoint belong to class one, 51st to 100th belong to class 2 and the rest in class 3. These are the actual classes that each datapoint belongs to i.e. setosa, versicolor and virginica

Thus it can be noticed that CLARANS algorithm was able to partition the datapoints with significant precision in appropriate clusters on iris dataset.

7. References

  1. Swarndeep Saket J, Dr. Sharnil Pandya, “An Overview of Partitioning Algorithms in Clustering Techniques.” (IJARCET)Volume 5, Issue 6, June2016.
  2. Novikov, A., 2019. PyClustering: Data Mining Library. Journal of Open Source Software, 4(36), p.1230.
  3. Raymond T. Ng and Jiawei Han, “CLARANS: A Method for Clustering Objectsfor Spatial Data Mining” IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 5, SEPTEMBER/OCTOBER 2002.
  4. Jiawei Han, Micheline Kamber, Jian Pei. “Data MiningConcepts and Techniques.”

--

--