Partitional Clustering

Published in

Analytics Vidhya

7 min readJul 4, 2020

Still wondering what clustering is all about? Lets take an example to understand the concept of clustering. Suppose yourself to be in the shoes of a shopkeeper and you wish to understand preferences of your costumers to augment your business. Is it possible for you to look at details of each costumer and devise a unique business strategy for each one of them? Definitely not. But, what you can do is to cluster all of your costumers into say 5 groups based on their purchasing habits and use a separate strategy for costumers in each of these 5 groups so that you can market your products to each group effectively and appropriately . And this is what we call clustering. This can be a powerful means to identify discontented customer needs.

Clustering is the task of grouping a set of customers in such a way that customers in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). Cluster analysis uses mathematical models to discover groups of similar customers based on the smallest variations among customers within each group. A Clustering Algorithm tries to analyse natural groups of data on the basis of some similarity. It locates the centroid of the group of data points. To carry out effective clustering, the algorithm evaluates the distance between each point from the centroid of the cluster. The goal of clustering is to determine the intrinsic grouping in a set of unlabelled data i.e. data without any dependent variable.

Clustering is dividing data points into homogeneous classes or clusters:

Points in the same group are as similar as possible
Points in different group are as dissimilar as possible

Applications of Clustering -

Clustering helps marketers improve their customer base and work on the target areas. It helps group people (according to different criteria’s such as willingness, purchasing power etc.) based on their similarity in many ways related to the product under consideration.
Clustering helps in identification of groups of houses on the basis of their value, type and geographical locations.
Clustering is used to study earth-quake. Based on the areas hit by an earthquake in a region, clustering can help analyse the next probable location where earthquake can occur.

What is Partitioning in Clustering?

The most popular class of clustering algorithms that we have is the iterative relocation algorithms. These algorithms minimize a given clustering criterion by iteratively relocating data points between clusters until a (locally) optimal partition is attained. There are many algorithms that come under partitioning method some of the popular ones are K-Means, PAM(k-Medoid), CLARA algorithm (Clustering Large Applications) etc.

Partitioning Algorithms used in Clustering -

K-Means Algorithm (A centroid based Technique): It is one of the most commonly used algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster. The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized.

Steps involved in K-Means Clustering :

The first step when using k-means clustering is to indicate the number of clusters (k) that will be generated in the final solution.
The algorithm starts by randomly selecting k objects from the data set to serve as the initial centers for the clusters. The selected objects are also known as cluster means or centroids.
Next, each of the remaining objects is assigned to it’s closest centroid, where closest is defined using the Euclidean distance between the object and the cluster mean. This step is called “cluster assignment step”.
After the assignment step, the algorithm computes the new mean value of each cluster. The term cluster “centroid update” is used to design this step. Now that the centers have been recalculated, every observation is checked again to see if it might be closer to a different cluster. All the objects are reassigned again using the updated cluster means.
The cluster assignment and centroid update steps are iteratively repeated until the cluster assignments stop changing (i.e until convergence is achieved). That is, the clusters formed in the current iteration are the same as those obtained in the previous iteration.

K-Medoids Algorithm (Partitioning Around Medoid) :

A medoid can be defined as the point in the cluster, whose similarities with all the other points in the cluster is maximum.
In k-medoids clustering, each cluster is represented by one of the data point in the cluster. These points are named cluster medoids. The term medoid refers to an object within a cluster for which average dissimilarity between it and all the other the members of the cluster is minimal. It corresponds to the most centrally located point in the cluster.
These objects (one per cluster) can be considered as a representative example of the members of that cluster which may be useful in some situations. Recall that, in k-means clustering, the center of a given cluster is calculated as the mean value of all the data points in the cluster.
K-medoid is a robust alternative to k-means clustering. This means that, the algorithm is less sensitive to noise and outliers, compared to k-means, because it uses medoids as cluster centers instead of means (used in k-means).

Steps involved in K-Medoids Clustering :

The PAM algorithm is based on the search for k representative objects or medoids among the observations of the data set.
After finding a set of k medoids, clusters are constructed by assigning each observation to the nearest medoid.
Next, each selected medoid m and each non-medoid data point are swapped and the objective function is computed. The objective function corresponds to the sum of the dissimilarities of all objects to their nearest medoid.
The SWAP step attempts to improve the quality of the clustering by exchanging selected objects (medoids) and non-selected objects. If the objective function can be reduced by interchanging a selected object with an unselected object, then the swap is carried out. This is continued until the objective function can no longer be decreased. The goal is to find k representative objects which minimize the sum of the dissimilarities of the observations to their closest representative object.

Difference between K-Means & K-Medoids Clustering -

K-means attempts to minimize the total squared error, while k-medoids minimizes the sum of dissimilarities between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm, k-medoids chooses data points as centers ( medoids or exemplars).
K-medoid is more robust to noise and outliers as compared to K-means because it minimizes a sum of general pairwise dissimilarities instead of a sum of squared Euclidean distances.
K-medoids has the advantage of working on distances other than numerical and lends itself well to analyse mixed-type data that include both numerical and categorical features.

Difference between k-means and k-medoids

Conclusion -

In this post, we discussed about what is Clustering and how one can apply the same into their business for decision making. Then we discussed about Partitioning based algorithms used in clustering and about two prominent techniques for partitioning i.e. K-means and K-medoids as well as understood the difference between these two algorithms, out of which one may be chosen over the other according to the business problem.