Deep dive Agglomerative Clustering!

Introduction: This article covers each & every aspect of agglomerative clustering with one project.

Harshit Dawar
Analytics Vidhya
5 min readMay 30, 2020

--

Agglomerative clustering (Not full dendrogram) [Image by Author]
Image by Author

A request to everyone that please go through my articles on Clustering & Hierarchical clustering before going through this, because it will be a great help for you guys as it will help in understanding the basics of clustering and its type “Hierarchical Clustering”!

Introduction to Agglomerative Clustering!

  • It is a bottom-to-up approach of Hierarchical clustering.
  • It follows a very simple pattern of clustering, it starts by identifying two points closest to each other in terms of distance, & this approach continues recursively until the whole data is clustered.
  • It will create “n” clusters, each for one data point.
  • It will create a “proximity matrix”, also known as “distance matrix” or “similarity matrix”. Each value in that matrix corresponds to the distance between each & every point. If there are 10 points let’s say, then the matrix will be of dimension 10 x 10, calculating distance for each & every point to each & every other point.
  • After the formation of a cluster, the proximity matrix is updated.
  • The above step is repeated until there is no point left in the dataset to cluster.

Ways to find the distance between the clusters!

There are total 4 ways available for this:

  1. Single-Linkage Clustering: In this approach, the distance between the closest points between the 2 clusters are taken into consideration.
  2. Complete-Linkage Clustering: In this approach, the distance between the farthest points between the 2 clusters are taken into consideration.
  3. Average-Linkage Clustering: In this approach, the average distance of all the points belonging to each cluster is calculated, & then this distance between the 2 clusters are taken into consideration.
  4. Centroid-Linkage Clustering: In this approach, the distance between the centroid of 2 clusters are taken into consideration.

Advantages of Hierarchical Clustering!

  • Doesn’t require the number of clusters to be specified for the working of this algorithm.
  • Very easy to implement.
  • Produces a Dendrogram, which helps in understanding the data.

Disadvantages of Hierarchical Clustering!

  • It can never undo the previous step throughout the algorithm. For Example, if this algorithm finds a cluster initially, & the realizes afterward that is was not optimal, then that step cannot be undone.
  • Generally takes a longer time to run as compared to other clustering algorithms like KMeans.
  • Sometimes, it is difficult to identify the correct number of clusters from Dendrogram.

Difference between Hierarchical Clustering & KMeans Clustering!

Image by Author

Project — Agglomerative Filtering on Random generated Data!

Project Starts!

In this project, agglomerative filtering has been illustrated on data generated from scratch!

In the above code, important libraries are imported for the project. Requirement of each & every library is as follows:

  • Numpy: It is used for handling python “numpy.ndarrays”. It is very powerful and fast.
  • Pandas: It is used for handling the data frames in data science.
  • Scipy: It stands for scientific python. In the above code, distance_matrix & hierarchy is imported from scipy library, in order to calculate the proximity matrix (explained above), & to illustrate the dendrogram explained above.
  • Matplotlib: It is used for illustrating various plots in python.
  • Sklearn: It is a machine learning library & is used for almost every task of machine learning.
  • In the code above, “make_blobs” is used to generate the random data, “n_samples” represents represent the number of samples we want to generate, “centers” parameter takes a list as input, which contains a center for each cluster which will be generated by “make_blobs”, & “cluster_std” represent the standard deviation of each cluster from another.

In “X” & “y”, make_blobs will return the random generated points & their corresponding class.

  • Line number “9” represents the style to be used for the plotting of the randomly generated points.
  • Line number “11” represents the figure size for the plotting of the randomly generated points.
  • Line number “13” will plot the scatter plot using the first & second column of the randomly generated data.
  • Line number “14 & 15” disables the x & y labels for the scatter plot.
  • Line number “17” is used to display the plot.
Scatter plot for the above random generated data. [Image by Author]

In the above code, Agglomerative clustering model is initialized with some of the parameters like, “n_clusters”, which represent the number of clusters (if we do not specify the number of clusters, even then also it is fine), “affinity” is set to euclidean, i.e. “Euclidean” distance will be used for the calculation of the distance between the points, and at last “linkage” is set to average which represents the “Average linkage clustering” (explained above).

Above code will plot the points of the data according to their cluster calculated by agglomerative clustering.

Output of Agglomerative clusters [Image by Author]

The above code is used to plot the data on a scatter plot, & also assign a number to the point corresponding to their cluster.

Cluster number assigned to each data point. [Image by author]

The above code illustrates, first of all, the creation of a distance matrix between “X” and “X”, so that distance between each and every point belonging to data “X” is calculated, then the dendrogram is initialized with data & average linkage clustering. Finally, the dendrogram is plotted.

Output Dendrogram [Image by Author]

For more detailed information on this project, check out this project on Github by clicking on the link given below.

If you are interested in the Agglomerative project implemented on a real dataset, then check out that also by clicking on the link given below.

I hope my article explains each and everything related to agglomerative hierarchical clustering along with the explanation of the project. Thank you so much for investing your time in reading my article and boosting your knowledge!

--

--

Harshit Dawar
Analytics Vidhya

AIOPS Engineer, have a demonstrated history of delivering large and complex projects. 14x Globally Certified. Rare & authentic content publisher.