Unsupervised Learning -1

Clustering

Romaly Das

Published in

AlmaBetter

7 min readSep 12, 2021

What is unsupervised learning?

With supervised learning, we always had a target variable. We knew what output we desire to predict and had many metrics, like accuracy, recall, precision, r score, etc to evaluate a supervised learning model. With unsupervised learning, we don’t really have any target variable. We just try to cluster data points that have some similarities between them.

So, in short, unsupervised learning in ML is where a model looks for patterns in a dataset with no labels and with minimal human supervision. We do not tell the model what to learn, but we allow it to find patterns and draw conclusions from the dataset. Unsupervised learning tasks mostly involve grouping data into similar clusters, dimensionality reduction, and density estimation.

In this blog, we will discuss the types of clustering.

What is clustering?

Clustering can be loosely defined as organizing objects into different groups or clusters, where members of the same cluster have some similarities and are dissimilar to data points that belong to other clusters. In the above plot, we can safely say that data points in cluster 0 are very different from cluster 1, but data points within cluster 0 are similar to each other.

TYPES OF CLUSTERING ALGORITHMS

Clustering can be connectivity-based, centroids-based, distribution-based, density-based, fuzzy, or constraint-based. In this blog, we will discuss K-Means clustering(centroids based with E-M algorithm) and hierarchical clustering(connectivity-based).

KMeans Clustering Algorithm

Clustering algorithms seek to learn underlying patterns and properties from data, to optimally divide data into labeled clusters. The simplest clustering algorithm is K-Means clustering. This algorithm searches for a pre-determined number of clusters within an unlabeled multidimensional dataset. The clusters have a center, which is the arithmetic mean of all the points inside the cluster and each point is closer to its own cluster than to other cluster centers. K means is limited to linear cluster boundaries, meaning this algorithm will be often ineffective for clusters with complex geometries. Since this is iteration-based, K means will be slower for a large number of samples.

So, how does KMeans clustering algorithm assigns data to the pre-defined clusters?
It follows the Expectation-Maximization (E-M) algorithm.
These are the steps followed by the E-M algorithm:

Guess some clusters
Assign points to nearest cluster center (E -step) -This step updates our expectation of which point each cluster belongs to.
Set the cluster centers to mean (M -step) -This step maximizes the function that defines the location of cluster centers
Repeat until all data points are clearly converged to their respective data points.

Some red flags about the E-M algorithm:

The most optimal result may not be achieved -Even though every E-M step will produce better clusters than the previous E-M step, but it is not guaranteed that it will necessarily produce a globally optimal solution.
The number of clusters (n_clusters) has to be selected beforehand.

There are 2 ways by which we can guess an optimal number of clusters. One is the Silhouette score, the other is the elbow method.

Silhouette Analysis

Silhouette analysis is used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to the points in the neighboring clusters. This plot ranges from [-1,1]. Silhouette coefficients nearer to +1 indicate that individual clusters are well defined. A value of 0 means the clusters are very close to or overlap the decision boundary between them. A value of -1 indicated that the data points have been assigned to the wrong clusters.

Silhouette score/coefficient is calculated as 𝑆=(𝑏−𝑎)/𝑚𝑎𝑥(𝑎,𝑏).

Elbow Method

In this method, we try to vary the number of clusters and for each value of k, we try to find the WCSS(within-cluster sum of squares). WCSS is the sum of squared distance between each point and the centroid of a cluster. For different values of K, we get different values of WCSS. WCSS has the highest value at K=1 and then it gradually decreases with an increase in K value. Since the plot looks like an elbow, it is known as the elbow method.

The elbow point gives us the optimal/desirable number of clusters that must be chosen for a particular dataset.

Hierarchical Clustering Algorithm

With hierarchical clustering, we do not have to pre-define several clusters. There are 2 types of hierarchical clustering.

Agglomerative -Bottom-Up approach -We assign each point to a cluster and then we merge the closest pair of clusters and repeat this step until one cluster is left. Since we are merging the clusters at each step, this is also known as additive hierarchical clustering.

Divisive -Top Bottom approach -This is the inverse of agglomerative clustering. Instead of starting with n clusters, we start with one cluster and assign points to that cluster. So, initially, all data points will belong to one cluster only. Then, at each iteration, the farthest point in the cluster is split and this process is repeated until each cluster contains only a single point.

How does hierarchical clustering work?

Let’s take an example of how agglomerative hierarchical clustering works.

First, let’s try to understand what a proximity matrix is. In the given figure, we have a student_id column and a marks column.

Imagine trying to find out the differences in marks between student 1 and the rest of the students and assign the differences in a list, just for our understanding. Let’s try that out now. For finding out the differences, we will use euclidean distance.

difference_list_student_1 = sqrt((10–10)²), sqrt((10–7)²), sqrt((10–28)²), sqrt((10–20)²), sqrt((10–35)²)
=> difference_list_student_1 = 0,3,18,10,25

Similarly, we can do that for the other student_id’s too and we can form a 5×5 matrix, also known as the distance matrix or the proximity matrix.

In this final matrix, one can notice that all the diagonal elements are 0, as they should be.

Now, the first step to hierarchical clustering is to assign points to individual clusters. Once we have done that, we will try to look for the smallest distance, using the proximity matrix and merge the points with the smallest distance. Then we will again update the proximity matrix and we will keep repeating until just a single cluster is left.

Linkage:

Linkage is a way to find out how individual clusters are linked to other clusters. There are 4 very popular linkages: single-linkage, complete-linkage, average-linkage, and centroid linkage. In this blog, we will focus on the first three types of linkages.

Single Linkage: This is the shortest distance between 2 clusters. This approach can separate non-elliptical shapes as long as the gap between 2 clusters is not too small. But, if there is any noise between the clusters, then this approach will not be helpful.

Complete Linkage: This is the longest distance between datapoints within two different clusters. So, if there is no noise between these clusters, complete linkage works well but it is more biased towards globular structures.

Average Linkage: This is the distance between each pair of observations in each cluster that is added up and divided by the number of pairs to get an average inter-cluster distance. Average linkage and complete linkage are the two most popular distance metrics in hierarchical clustering. This is also biased towards globular clusters.

CONCLUSION

Clustering is used to finding groups that have not been explicitly labeled in the data. Clustering finds uses in Search engines, Crime hot-spot detection, Insurance fraud detection, customer segmentation, diagnostic systems, etc.

REFERENCES

AlmaBetter Materials
Content from Youtube
Content from medium articles, and other online webpages