Hierarchical clustering β€” Boss and Employee

_>Hierarchical clustering

Bang buck, I’m back with another clustering algorithm. This algorithm is pretty fun to work on the paper but is as tricky to conceptualise in code. So lets understand what the algorithm is and then implement it in code!

Photo by Pierre Bamin on Unsplash

So what is this algorithm all about?

A dendogram. That’s it. Mostly.

But yeah, this algorithm is an unsupervised learning technique for making groups or clusters of similar objects in our dataset β€” said every clustering algorithm ever. The difference that comes up here is β€” It creates a hierarchy of clusters by merging or splitting them based on similarity measures. It uses a bottom-up approach or top-down approach to construct a hierarchical data clustering schema.

Now grouping these hierarchical clusters together is where the dendogram comes into play. It brings together similar clusters iteratively, starting with each cluster as its own datapoint. This process ends up with a tree like structure that shows the relationships between clusters and their hierarchy.

A dendrogram from hierarchical clustering shows how data points group together at different levels. It’s like a family tree for your data, revealing patterns and unusual points. This makes it a handy tool for exploring and understanding your data better.

There are mainly two types of hierarchical clustering:

- Agglomerative hierarchical clustering β€” is a type of hierarchical clustering where each data point starts in its own cluster, and pairs of clusters are merged together step by step until all data points belong to a single cluster. This method helps in identifying the structure and natural groupings in the data.

- Divisive Hierarchical clustering β€” is a type of hierarchical clustering where all data points start in a single cluster. This cluster is then split step by step into smaller clusters until each data point is in its own cluster. This method helps to identify natural groupings and structures within the data by progressively dividing it.

Another thing to learn here is the Silhouette Score

Silhouette score is a method to determine the quality of a clustering solution. It helps to measure how similar an object is to its own cluster compared to other clusters. The silhouette value for each sample is composed of two scores:

1. The average distance between the sample iii and all other points in the same cluster.

2. b(i): The average distance between the sample iii and all points in the nearest cluster that the sample iii is not a part of.

Silhouette Score Calculation:

Steps to Calculate Silhouette Scores

  1. Compute Pairwise Distances: Calculate the pairwise distances between all points in the dataset.
  2. Cluster Assignment: Assign each point to a cluster using your clustering algorithm of choice (e.g., K-means, DBSCAN).
  3. Calculate a(i): For each point, compute the average distance to all other points in the same cluster.
  4. Calculate b(i): For each point, compute the average distance to all points in the nearest different cluster.
  5. Compute Silhouette Score: Use the formula above to compute the silhouette score for each point.
  6. Average Silhouette Score: The overall silhouette score for the clustering solution is the mean of all individual silhouette scores.

Now that you’ve understood the theory, let’s mama mia the code!

This code implements Agglomerative Hierarchical Clustering, a bottom-up approach to clustering where individual data points are successively merged into clusters until a certain number of clusters (K) is reached.

1. Initialization (__init__ method):

def __init__(self, data, K, M):
self.data = data
self.N = len(data)
self.K = K
self.measure = get_distance_measure(M)
self.clusters = self.init_clusters()

data: This is the dataset to be clustered.

K: The desired number of final clusters.

M: The distance measure (like Euclidean or Manhattan) to compute the similarity between clusters.

self.clusters: This dictionary will hold the clusters. Initially, each data point starts as its own cluster.

2. Initialize Clusters (init_clusters method):

def init_clusters(self):
return {data_id: [data_point] for data_id, data_point in enumerate(self.data)}

The init_clusters method initializes clusters. It creates a dictionary where each key is an index (data_id) and each value is a list with a single data point.

At the start, every data point forms its own cluster. For example:

{0: [point_0], 1: [point_1], 2: [point_2], ...}

3. Find Closest Clusters (find_closest_clusters method):

def find_closest_clusters(self):
min_dist = math.inf
closest_clusters = None
clusters_ids = list(self.clusters.keys())

for i, cluster_i in enumerate(clusters_ids[:-1]):
for j, cluster_j in enumerate(clusters_ids[i+1:]):
dist = self.measure(self.clusters[cluster_i], self.clusters[cluster_j])
if dist < min_dist:
min_dist, closest_clusters = dist, (cluster_i, cluster_j)
return closest_clusters

Goal: Identify two clusters that are closest to each other.

Distance calculation: The method iterates over every possible pair of clusters and computes the distance between them using the chosen distance measure (self.measure).

min_dist: Keeps track of the smallest distance found.

closest_clusters: Stores the pair of clusters (by their IDs) that are closest.

For example, if clusters 0 and 3 are closest, it returns (0, 3).

4. Merge Clusters (merge_and_form_new_clusters method):

def merge_and_form_new_clusters(self, ci_id, cj_id):
new_clusters = {0: self.clusters[ci_id] + self.clusters[cj_id]}

for cluster_id in self.clusters.keys():
if (cluster_id == ci_id) | (cluster_id == cj_id):
continue
new_clusters[len(new_clusters.keys())] = self.clusters[cluster_id]
return new_clusters

Merging: This method merges two clusters (with IDs ci_id and cj_id) into a new one.

The merged cluster is given the new ID 0 (could be any index) and contains all points from both clusters. Then, it creates a new set of clusters excluding the merged ones.

For example, if clusters 0 and 3 are merged, it creates a new cluster that combines points from both and leaves the other clusters unchanged.

5. Run the Algorithm (run_algorithm method):

def run_algorithm(self):
while len(self.clusters.keys()) > self.K:
closest_clusters = self.find_closest_clusters()
self.clusters = self.merge_and_form_new_clusters(*closest_clusters)

This is the main loop of the algorithm. It runs as long as the number of clusters is greater than K (the desired final number of clusters).

Step-by-step process:

Find the two closest clusters using find_closest_clusters().

Merge them using merge_and_form_new_clusters().

Repeat until the number of clusters is reduced to K.

6. Printing Clusters (print method):

def print(self):
for id, points in self.clusters.items():
print("Cluster: {}".format(id))
for point in points:
print(" {}".format(point))

This method prints out the clusters and their points in a readable format.

For each cluster, it lists its ID and all the data points belonging to it.

That was it guys for this implementation, hope you understood stuff and will implement this algorithm on your own too!

Happy Coding! See ya next time!

--

--