UNSUPERVISED LEARNING IN PYTHON: Hierarchical clustering / t-SNE

Hierarchical clustering(also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:

Shawn
4 min readAug 31, 2022
  • Agglomerative: This is a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
  • Divisive: This is a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

How it work:

Step 1: First, we assign all the points to an individual cluster

Step 2: Next, we will look at the smallest distance in the proximity matrix and merge the points with the smallest distance. We then update the proximity matrix:

Here, the smallest distance is 3 and hence we will merge point 1 and 2:

Repeating:

Here we got 3 clusters when distance =15
The linkage method.https://dataaspirant.com/hierarchical-clustering-algorithm/
use fcluster to know the distance.

Case Study:

Use the linkage() function to obtain a hierarchical clustering of the grain samples, and use dendrogram() to visualize the result. A sample of the grain measurements is provided in the array samples, while the variety of each grain sample is given by the list varieties.

Case 2:

use the fcluster() function to extract the cluster labels for this intermediate clustering, and compare the labels with the grain varieties using a cross-tabulation.

The hierarchical clustering has already been performed and mergings is the result of the linkage() function. The list varieties gives the variety of each grain sample.

t-SNE

t-Distributed Stochastic Neighbor Embedding

It use for Dimensionality reduction, same as PCA.

PCA is a linear dimensionality reduction method, but if the correlation between features is nonlinear, using PCA may lead to underfitting.

t-SNE is also a dimensionality reduction method, but uses a more complex formula to express the relationship between high and low dimensions. t-SNE mainly approximates the high-dimensional data with the probability density function of the Gaussian distribution, while the low-dimensional data is approximated by the t distribution method, and uses the KL distance to calculate the similarity, and finally uses the gradient descent (or stochastic gradient). drop) to find the best solution.

  • t-SNE is not a linear dimensionality reduction and will take much longer to execute than PCA
  • The distance between groups may be meaningless
  • What Cluster Size Does Not Mean in a t-SNE Chart
  • The t-SNE algorithm is random, and multiple experiments can produce different results, while the common PCA is deterministic, and the result after each calculation is the same.
https://sonraianalytics.com/what-is-tsne/

If you like my content please clapped for me and follow me, thank you :)

There’ll be more article and more content related to Data Science. Hope you enjoy it!

Reference:

UNSUPERVISED LEARNING IN PYTHON/datacamp/Benjamin Wilson

https://mortis.tech/2019/11/program_note/664/

--

--

Shawn

Self taught — Data Analyst | Business Intelligence Specialist | Business Analyst | Data scientist