Hierarchical clustering β Boss and Employee
_>Hierarchical clustering
Bang buck, Iβm back with another clustering algorithm. This algorithm is pretty fun to work on the paper but is as tricky to conceptualise in code. So lets understand what the algorithm is and then implement it in code!
So what is this algorithm all about?
A dendogram. Thatβs it. Mostly.
But yeah, this algorithm is an unsupervised learning technique for making groups or clusters of similar objects in our dataset β said every clustering algorithm ever. The difference that comes up here is β It creates a hierarchy of clusters by merging or splitting them based on similarity measures. It uses a bottom-up approach or top-down approach to construct a hierarchical data clustering schema.
Now grouping these hierarchical clusters together is where the dendogram comes into play. It brings together similar clusters iteratively, starting with each cluster as its own datapoint. This process ends up with a tree like structure that shows the relationships between clusters and their hierarchy.
A dendrogram from hierarchical clustering shows how data points group together at different levels. Itβs like a family tree for your data, revealing patterns and unusual points. This makes it a handy tool for exploring and understanding your data better.
There are mainly two types of hierarchical clustering:
- Agglomerative hierarchical clustering β is a type of hierarchical clustering where each data point starts in its own cluster, and pairs of clusters are merged together step by step until all data points belong to a single cluster. This method helps in identifying the structure and natural groupings in the data.
- Divisive Hierarchical clustering β is a type of hierarchical clustering where all data points start in a single cluster. This cluster is then split step by step into smaller clusters until each data point is in its own cluster. This method helps to identify natural groupings and structures within the data by progressively dividing it.
Another thing to learn here is the Silhouette Score
Silhouette score is a method to determine the quality of a clustering solution. It helps to measure how similar an object is to its own cluster compared to other clusters. The silhouette value for each sample is composed of two scores:
1. The average distance between the sample iii and all other points in the same cluster.
2. b(i): The average distance between the sample iii and all points in the nearest cluster that the sample iii is not a part of.
Silhouette Score Calculation:
Steps to Calculate Silhouette Scores
- Compute Pairwise Distances: Calculate the pairwise distances between all points in the dataset.
- Cluster Assignment: Assign each point to a cluster using your clustering algorithm of choice (e.g., K-means, DBSCAN).
- Calculate a(i): For each point, compute the average distance to all other points in the same cluster.
- Calculate b(i): For each point, compute the average distance to all points in the nearest different cluster.
- Compute Silhouette Score: Use the formula above to compute the silhouette score for each point.
- Average Silhouette Score: The overall silhouette score for the clustering solution is the mean of all individual silhouette scores.
Now that youβve understood the theory, letβs mama mia the code!
This code implements Agglomerative Hierarchical Clustering, a bottom-up approach to clustering where individual data points are successively merged into clusters until a certain number of clusters (K) is reached.
1. Initialization (__init__
method):
def __init__(self, data, K, M):
self.data = data
self.N = len(data)
self.K = K
self.measure = get_distance_measure(M)
self.clusters = self.init_clusters()
data
: This is the dataset to be clustered.
K
: The desired number of final clusters.
M
: The distance measure (like Euclidean or Manhattan) to compute the similarity between clusters.
self.clusters
: This dictionary will hold the clusters. Initially, each data point starts as its own cluster.
2. Initialize Clusters (init_clusters
method):
def init_clusters(self):
return {data_id: [data_point] for data_id, data_point in enumerate(self.data)}
The
init_clusters
method initializes clusters. It creates a dictionary where each key is an index (data_id
) and each value is a list with a single data point.At the start, every data point forms its own cluster. For example:
{0: [point_0], 1: [point_1], 2: [point_2], ...}
3. Find Closest Clusters (find_closest_clusters
method):
def find_closest_clusters(self):
min_dist = math.inf
closest_clusters = None
clusters_ids = list(self.clusters.keys())
for i, cluster_i in enumerate(clusters_ids[:-1]):
for j, cluster_j in enumerate(clusters_ids[i+1:]):
dist = self.measure(self.clusters[cluster_i], self.clusters[cluster_j])
if dist < min_dist:
min_dist, closest_clusters = dist, (cluster_i, cluster_j)
return closest_clusters
Goal: Identify two clusters that are closest to each other.
Distance calculation: The method iterates over every possible pair of clusters and computes the distance between them using the chosen distance measure (
self.measure
).min_dist: Keeps track of the smallest distance found.
closest_clusters: Stores the pair of clusters (by their IDs) that are closest.
For example, if clusters 0 and 3 are closest, it returns
(0, 3)
.
4. Merge Clusters (merge_and_form_new_clusters
method):
def merge_and_form_new_clusters(self, ci_id, cj_id):
new_clusters = {0: self.clusters[ci_id] + self.clusters[cj_id]}
for cluster_id in self.clusters.keys():
if (cluster_id == ci_id) | (cluster_id == cj_id):
continue
new_clusters[len(new_clusters.keys())] = self.clusters[cluster_id]
return new_clusters
Merging: This method merges two clusters (with IDs
ci_id
andcj_id
) into a new one.The merged cluster is given the new ID
0
(could be any index) and contains all points from both clusters. Then, it creates a new set of clusters excluding the merged ones.For example, if clusters 0 and 3 are merged, it creates a new cluster that combines points from both and leaves the other clusters unchanged.
5. Run the Algorithm (run_algorithm
method):
def run_algorithm(self):
while len(self.clusters.keys()) > self.K:
closest_clusters = self.find_closest_clusters()
self.clusters = self.merge_and_form_new_clusters(*closest_clusters)
This is the main loop of the algorithm. It runs as long as the number of clusters is greater than
K
(the desired final number of clusters).Step-by-step process:
Find the two closest clusters using
find_closest_clusters()
.Merge them using
merge_and_form_new_clusters()
.Repeat until the number of clusters is reduced to
K
.
6. Printing Clusters (print
method):
def print(self):
for id, points in self.clusters.items():
print("Cluster: {}".format(id))
for point in points:
print(" {}".format(point))
This method prints out the clusters and their points in a readable format.
For each cluster, it lists its ID and all the data points belonging to it.
That was it guys for this implementation, hope you understood stuff and will implement this algorithm on your own too!
Happy Coding! See ya next time!