UNSUPERVISED LEARNING IN PYTHON: Hierarchical clustering / t-SNE
Hierarchical clustering(also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:
- Agglomerative: This is a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
- Divisive: This is a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
How it work:
Step 1: First, we assign all the points to an individual cluster
Step 2: Next, we will look at the smallest distance in the proximity matrix and merge the points with the smallest distance. We then update the proximity matrix:
Here, the smallest distance is 3 and hence we will merge point 1 and 2:
Repeating:
Case Study:
Use the linkage()
function to obtain a hierarchical clustering of the grain samples, and use dendrogram()
to visualize the result. A sample of the grain measurements is provided in the array samples
, while the variety of each grain sample is given by the list varieties
.
Case 2:
use the fcluster()
function to extract the cluster labels for this intermediate clustering, and compare the labels with the grain varieties using a cross-tabulation.
The hierarchical clustering has already been performed and mergings
is the result of the linkage()
function. The list varieties
gives the variety of each grain sample.
t-SNE
t-Distributed Stochastic Neighbor Embedding
It use for Dimensionality reduction, same as PCA.
PCA is a linear dimensionality reduction method, but if the correlation between features is nonlinear, using PCA may lead to underfitting.
t-SNE is also a dimensionality reduction method, but uses a more complex formula to express the relationship between high and low dimensions. t-SNE mainly approximates the high-dimensional data with the probability density function of the Gaussian distribution, while the low-dimensional data is approximated by the t distribution method, and uses the KL distance to calculate the similarity, and finally uses the gradient descent (or stochastic gradient). drop) to find the best solution.
- t-SNE is not a linear dimensionality reduction and will take much longer to execute than PCA
- The distance between groups may be meaningless
- What Cluster Size Does Not Mean in a t-SNE Chart
- The t-SNE algorithm is random, and multiple experiments can produce different results, while the common PCA is deterministic, and the result after each calculation is the same.
If you like my content please clapped for me and follow me, thank you :)
There’ll be more article and more content related to Data Science. Hope you enjoy it!
Reference:
UNSUPERVISED LEARNING IN PYTHON/datacamp/Benjamin Wilson