Clustering Analysis & PCA Visualisation — A Guide on Unsupervised Learning

Published in

Codepth

6 min readSep 13, 2020

If you are a Machine Learning Enthusiast and want to learn about Unsupervised Learning Algorithms and their applications, this blog is meant for you. We will look into how we can use these algorithms to solve Real World problems.

Introduction :

The purpose of Unsupervised Learning is to find some structure in the dataset. Some of the Unsupervised Learning algorithms we use are Clustering, Dimensionality Reduction, and Apriori & Eclat.

Clustering — can be used in market segmentation and Analysis for Astronomical Data.

Dimensionality Reduction — PCA, LDA is used for Visualisation and Feature Extraction.

Apriori and Eclat — are used to make recommendation engines and for pattern recognition.

In this blog, we will discuss how to use and implement Clustering algorithms for analysis and Dimensionality Reduction for visualization.

NOTE: Since both Clustering and Dimensionality Reduction Algorithms are based on distance, it would be better if we normalize our data first and then apply these algorithms on our dataset.

The Task: We have a dataset that contains each customer’s age, sex, annual income, and a spending score that are visiting a mall. We want to know what different types of customers are visiting the mall. We have 200 examples for the same.

Part I: Exploratory Data Analysis

A good approach will be to patterns in the dataset, to have more insights from the Data.

As can be seen, the Market should be more oriented towards Females, as the Number of Females visiting the Market is more, and their Score is more spread-out than Males. The grey represents the females, and the green represents the males and the grey has a higher peak than the green, signifying that females spend more.

Here, we have put a scatter plot over a line plot to see how the spending score varies with age. And we can infer, older customers tend to spend less.

Part II: Hierarchial Clustering & PCA Visualisation

In Clustering, we identify the number of groups and we use Euclidian or Non- Euclidean distance to differentiate between the clusters.

Hierarchical Clustering :

Hierarchical Clustering is of two types: Agglomerative & Divisive

Steps:

STEP 1: Each Data Point is to be taken as a single point cluster.

STEP 2: Take 2 closest data points & make them into a single cluster.

STEP 3: Take 2 closest clusters & make them one-cluster.

STEP 4: Repeat Step-3 till we have only 1 cluster.

What exactly is the distance between 2 clusters? It could be the ance between — closest points, furthest points, average distance, or distance between centroids of the clusters.

For Hierarchical Clustering, we use Dendrograms to identify the number of clusters, and then we use our findings to create the Clusters.

Dendrogram for the Agglomerative Clustering

How to choose the number of Clusters? We will use the dendrogram and find out the number of clusters required. We have to find a vertical line in the dendrogram that is not crossed by any horizontal line and select that as a threshold. Then, we see how many clusters were formed until the tip of the line.

If you look at the green vertical line between values 150–250, you will see that it’s crossed by no horizontal line, and if we judge by that line then there are only 5 clusters. A good recommendation would be to go for a top-down approach.

After finding that the optimal number of clusters is 5, we use the sklearn library and then use the Agglomerative Clustering class to fit and predict the labels (segment type) from our dataset.

PCA :

Principal Component Analysis is a method of Dimensionality Reduction. Here we reduce a higher dimension to a lower one, retaining the variance of the original one. It involves the use of Eigen Vectors and a Covariance Matrix.

We will use PCA to convert our 4-D Dataset to a 3-D one so that we can visualize it. Even though we can mathematically operate on larger datasets, it’s physically impossible to see the results unless they are 3-D or 2-D.

Here, the ‘n_components’ parameter represents the number of dimensions we want to reduce our dataset to. So, here we chose 3.

Visualizing Clusters :

We can see our 5 Clusters and how they vary with our newly created 3-D dataset, as segmented by the Hierarchical Clustering Algorithm.

Part III: K-Means Clustering and PCA Visualisation

K-Means Clustering-

This is another way of Clustering, in which we randomly initialize ‘k’ data points as clusters and then allot other points to different clusters based on their distance to the cluster centroids. As the cluster keeps getting bigger the centroid of the clusters keeps on changing too.

Random Initialisation — Since this is an instance-based algorithm, the k-means can end up with different solutions depending on how it was initialized. In Python, we use the k-means++ initialization technique to solve this issue.

How to choose the number of Clusters? Here instead of the dendrogram, we use the Elbow Method to find the number of clusters. We run the K-means Algorithm for a set of numbers that represents the number of clusters and we choose the number post which the rate of the reduction of loss drops. The same point looks an ‘ elbow- point’ for our ‘arm-like’ curve of a line.

As you will notice, 5 is the optimal number of clusters in this case as the loss reduces very slowly as we hit 6 number of clusters.

We use sklearn’s KMeans class to define and call our algorithm. Notice that, we used the ‘k-means++’ technique in our init parameter. n_clusters, like before, here also represents the number of clusters.

Visualization :

We will use the same dataset that we got after reducing the dimensions from PCA.

Visualization of Clusters from K-Means Algorithm

Conclusion :

Both the Hierarchial and K-Means found the optimum number of Clusters to be 5. This verified the number of segments that we created.

Also, even though both these algorithms gave the output as 5 clusters, the labels weren’t the same. In some cases, we ended up with different labels for the same examples.

The Dataset and the complete python solution can be found here.