Practical Implementation Of K-means, Hierarchical, and DBSCAN Clustering On Dataset With Hyperparameter Optimization

Janibasha Shaik
Analytics Vidhya
Published in
11 min readSep 17, 2020

Clustering Algorithms with Hyperparameter optimization

Table of contents :

(i) Article Agenda

(ii) Data Processing

(iii) K-mean Clustering with Hyperparameter Optimization

(iv) Hierarchical Clustering

(v) DBSCAN Clustering with Hyperparameter Optimization

(vi) Conclusion

(vii) References

Image Source: Scikit learn

(i) Article Agenda :

This article is purely related to the implementation of Clustering Algorithms on any data set. We also do Hyperparameter optimization.

Prerequisites: Basic understanding of K-means, Hierarchical, and DBSCAN Clustering

Throughout this article, I follow https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python Mall_Customers.csv data set

Please download the data set from the above link

(ii) Data Processing :

We read a CSV file using pandas

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Reading csv filedf=pd.read_csv('Customers.csv')df.head()
Top 5 rows of df

The data set contains 5 features

Problem statement: we need to cluster the people basis on their Annual income (k$) and how much they Spend (Spending Score(1–100) )

So our features for the clustering are Annual Income(k$) and Spending Score(1–100)

Spending score is nothing but a score gave the basis on how much they spend

f1: Annual Income (k$)

f2: Spending Score(1–100)

Now we need to create an array with f1(x) and f2 (y) from data frame df

# converting features f1 and f2 into an array 
X=df.iloc[:,[3,4]].values

We had features in array form now we can proceed to implement step

(iii) K-means Clustering :

K-means Clustering is Centroid based algorithm

K = no .of clusters =Hyperparameter

We find K value using the Elbow method

K-means objective function is argmin (sum(||x-c||)²

where x = data point in the cluster

c= centroid of the cluster

objective: We need to minimize the square distance between the data point and centroid

If we have K-clusters then we have K-centroids

Intracluster distance: Distances between data points in the same cluster

Intercluster distance: Distances between different clusters

Our main aim to choose the clusters which have small intracluster distance and large intercluster distance

We use K-means++ initialization(probabilistic approach)

from sklearn.cluster import KMeans# objective function is nothing but argmin of c (sum of (|x-c|)^2 )  c: centroid ,x=point in data setobjective_function=[] 
for i in range(1,11):
clustering=KMeans(n_clusters=i, init='k-means++')
clustering.fit(X)
objective_function.append(clustering.inertia_)
#inertia is calculaing min intra cluster distance
# objective function contains min intra cluster distances
objective_function

objective_function : min intracluster distances

[269981.28000000014,
183116.4295463669,
106348.37306211119,
73679.78903948837,
44448.45544793369,
37233.81451071002,
31599.13139461115,
25012.917069885472,
21850.16528258562,
19701.35225128174]

We tried K value in the in-between 1 to 10, we don’t know which is best K surely

So to know best K we do Hyperparameter optimization

Elbow method: Hyperparameter optimization

# for finding optimal no of clusters we use elbow technique 
# Elbow technique is plot between no of clusters and objective_function
# we take k at a point where the objective function value have elbow shape

plt.plot(range(1,11),objective_function)
plt.title(‘The Elbow Method’)
plt.xlabel(‘Number of Clusters K’)
plt.ylabel(‘objective_function’)
plt.show()
Hyperparameter optimization

In the above plot at K=5, we got elbow joint we consider optimum K=5

Now we train a model with an optimum K value

# Training the model with optimal no of clusters

# Training the model with optimal no of clusterstuned_clustering=KMeans(n_clusters=5,init=’kmeans++’,random_state=0)
labels=tuned_clustering.fit_predict(X)
# x and y coordinates of all clusters
# Centroids of clusters
tuned_clustering.cluster_centers_[:]

labels return: predicted clusters

array([3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1,
3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 0,
3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 4, 2, 0, 2, 4, 2, 4, 2,
0, 2, 4, 2, 4, 2, 4, 2, 4, 2, 0, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2])

tuned_clustering.cluster_centers_[:] return : centroid coordinates of each cluster

array([[55.2962963 , 49.51851852],
[25.72727273, 79.36363636],
[86.53846154, 82.12820513],
[26.30434783, 20.91304348],
[88.2 , 17.11428571]])

Visualizing the clusters

# visualizing the clustersplt.scatter(X[labels==0,0],X[labels==0,1],c=’green’,label=’cluster1')
plt.scatter(X[labels==1,0],X[labels==1,1],c=’yellow’,label=’cluster2')
plt.scatter(X[labels==2,0],X[labels==2,1],c=’red’,label=’cluster3')
plt.scatter(X[labels==3,0],X[labels==3,1],c=’orange’,label=’cluster4')
plt.scatter(X[labels==4,0],X[labels==4,1],c=’blue’,label=’cluster5')
plt.scatter(tuned_clustering.cluster_centers_[:,0],tuned_clustering.cluster_centers_[:,1],s=300,c=’black’,label=’centroid’)plt.title(‘Clusters of customers’)
plt.xlabel(‘Annual Income(K$)’)
plt.ylabel(‘Spending Score(1–100)’)
plt.legend()
plt.show()
Clustering people Based on Annual Income and spending

Evaluation: How good our clustering is

To check how good our clustering we use the Silhouette coefficient

Silhouette coefficient

In the above diagram, we have two clusters C1 and C2

a= intracluster distance

b=intercluster distance

Silhouette coefficient= b-a/max(b,a)

If our clustering is good then we have small intracluster distance then the Silhouette coefficient value is positive

If our clustering is bad then we have large intracluster distance then the Silhouette coefficient value is negative

Silhouette coefficient lies in between -1 and 1

If the value moves towards 1 then clustering is good

If the value moves towards <0 then clustering is bad

from sklearn import metrics
metrics.silhouette_score(X, tuned_clustering.labels_,
metric='euclidean')

We got the Silhouette coefficient value is 0.553931997444648

It’s moving towards 1 so our clustering is good

If you want to visualize the Silhouette coefficient

# visualizing Silhouette coefficientfor n_clusters in range(2,10):
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
print(“For n_clusters =”, n_clusters,
“The average silhouette_score is :”, silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title(“The silhouette plot for the various clusters.”)
ax1.set_xlabel(“The silhouette coefficient values”)
ax1.set_ylabel(“Cluster label”)
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color=”red”, linestyle=” — “)
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(X[:, 0], X[:, 1], marker=’.’, s=30, lw=0, alpha=0.7,
c=colors, edgecolor=’k’)
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker=’o’,
c=”white”, alpha=1, s=200, edgecolor=’k’)
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker=’$%d$’ % i, alpha=1,
s=50, edgecolor=’k’)
ax2.set_title(“The visualization of the clustered data.”)
ax2.set_xlabel(“Feature space for the 1st feature”)
ax2.set_ylabel(“Feature space for the 2nd feature”)
plt.suptitle((“Silhouette analysis for KMeans clustering on sample data “
“with n_clusters = %d” % n_clusters),
fontsize=14, fontweight=’bold’)
plt.show()

For n_clusters = 2 The average silhouette_score is : 0.3273163942500746
For n_clusters = 3 The average silhouette_score is : 0.46761358158775435
For n_clusters = 4 The average silhouette_score is : 0.4931963109249047
For n_clusters = 5 The average silhouette_score is : 0.553931997444648
For n_clusters = 6 The average silhouette_score is : 0.5379675585622219
For n_clusters = 7 The average silhouette_score is : 0.5270287298101395
For n_clusters = 8 The average silhouette_score is : 0.4548653400650936
For n_clusters = 9 The average silhouette_score is : 0.4595491760122954

K=5 Silhouette coefficient

If we observe the above scores for k=5 we have a high Silhouette coefficient

If we observe the above plot no negative movements in clusters everything moving towards 1

So we can conclude our clustering is good

(iv) Hierarchical Clustering :

Hierarchical clustering is tree-based clustering

Dendrogram: It‘s a tree-like a diagram that records the sequence of merges

In Hierarchical clustering, we use Agglomerative clustering

Step1: consider each data point as a cluster

Step2: merge clusters based on their similarity (distance)

if two clusters are very near to each other group those two clusters into one cluster

Repeat until only a single cluster remains

In Agglomerative clustering no need to give no of clusters as hyperparameters we can stop hierarchy how many clusters we want

So no.of clusters is not a hyperparameter

# Dendogramimport scipy.cluster.hierarchy as schcluster_visualising=sch.dendrogram(sch.linkage(df.iloc[:,[3,4]].values,method=’ward’))
plt.title(‘Dendrogram’)
plt.xlabel(‘Customers’)
plt.ylabel(‘Euclidean distances’)
plt.show()
Dendrogram

If we observe the above dendrogram, it groups the customers according to min euclidean distance

No.of clusters from dendrogram =We select the largest vertical line which can not cut by horizontal line

From the dendrogram, we choose K=5 =no of clusters

For cluster similarity, we use the ward method

Ward method less susceptible to noise and outliers

# AgglomerativeClustering Model initializationfrom sklearn.cluster import AgglomerativeClusteringclustering_model=AgglomerativeClustering(n_clusters = 5, affinity = ‘euclidean’, linkage = ‘ward’)clustering_model.fit(df.iloc[:,[3,4]].values)# Predicting clusters
clustering_prediction=clustering_model.fit_predict(df.iloc[:,[3,4]])

clustering_predictoin return: Predicted clusters

array([4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3,
4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 1,
4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 0, 2, 0, 2,
1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 1, 2, 0, 2, 1, 2, 0, 2, 0, 2, 0, 2,
0, 2, 0, 2, 0, 2, 1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2,
0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2,
0, 2], dtype=int64)

Visualizing clusters

plt.scatter(df.iloc[:,[3,4]].values[clustering_prediction == 0, 0], df.iloc[:,[3,4]].values[clustering_prediction == 0, 1], s = 100, c = ‘green’, label = ‘Cluster 1’)plt.scatter(df.iloc[:,[3,4]].values[clustering_prediction == 1, 0], df.iloc[:,[3,4]].values[clustering_prediction == 1, 1], s = 100, c = ‘red’, label = ‘Cluster 2’)plt.scatter(df.iloc[:,[3,4]].values[clustering_prediction == 2, 0], df.iloc[:,[3,4]].values[clustering_prediction == 2, 1], s = 100, c = ‘blue’, label = ‘Cluster 3’)plt.scatter(df.iloc[:,[3,4]].values[clustering_prediction == 3, 0], df.iloc[:,[3,4]].values[clustering_prediction == 3, 1], s = 100, c = ‘yellow’, label = ‘Cluster 4’)plt.scatter(df.iloc[:,[3,4]].values[clustering_prediction == 4, 0], df.iloc[:,[3,4]].values[clustering_prediction == 4, 1], s = 100, c = ‘orange’, label = ‘Cluster 5’)plt.title(‘Clusters of customers’)
plt.xlabel(‘Annual Income (k$)’)
plt.ylabel(‘Spending Score (1–100)’)
plt.legend()
plt.show()
Clustering of people basis on their Annual Income and spending

Evaluation

from sklearn import metrics
metrics.silhouette_score(df.iloc[:,[3,4]].values, clustering_prediction , metric='euclidean')

Silhouette coefficient: 0.5529945955148897

The coefficient moving towards 1 so clustering is good

(v) DBSCAN Clustering :

DBSCAN is a density-based clustering

We need to know the terms before implementing

Dense region, Sparse region, Core point, Border point, Noise point , Density edge , Density connected points

please refer the below link to understand the above-mentioned terms

In DBSCAN Min points and epsilon are the hyperparameters

Step1: For every point in the dataset we need to label data point belong to whether core point /border point/noise point

Step2: Remove all noise points

Step3: For each core point ‘p’ which is not assigned to a cluster

we create a new cluster with not assigned core point and add all points that are density connected to not assigned core point into this new cluster

Step4: each border point assign to the nearest core points cluster

Hyperparameter optimization:

In DBSCAN Min points and epsilon are the hyperparameters

Min points :

rule of thumb ==> Min points ≥dimensionality +1

If the data set is noisier then we use Min Points are larger because it removes noisy points easily

Epsilon (radius): Elbow method

step1: for every data point (x) we compute distance (d)

stpe2: sort distances in increasing order

then plot the graph between distance and point index

We choose the optimum distance(epsilon) d, where the sharp rise in the graph

# we use nearestneighbors for calculating distance between pointsfrom sklearn.neighbors import NearestNeighbors
# calculating distancesneigh=NearestNeighbors(n_neighbors=2)
distance=neigh.fit(X)
# indices and distance values
distances,indices=distance.kneighbors(X)
# Now sorting the distance increasing ordersorting_distances=np.sort(distances,axis=0)
# sorted distancessorted_distances=sort_distances[:,1]# plot between distance vs epsilonplt.plot(sorted_distances)
plt.xlabel(‘Distance’)
plt.ylabel(‘Epsilon’)
plt.show()
Epsilon optimization

If we observe the graph, at epsilon is equal to 9 sharp rises in the graph so we choose epsilon(radius) as 9

Implementing DBSCAN with optimized hyperparameters

from sklearn.cluster import DBSCAN# intializing DBSCANclustering_model=DBSCAN(eps=9,min_samples=4)# fit the model to Xclustering_model.fit(X)# predicted labels by DBSCANpredicted_labels=clustering_model.labels_
# visualzing clusters
plt.scatter(X[:,0], X[:,1],c=predicted_labels, cmap='Paired')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title("DBSCAN")

Evaluation :

from sklearn import metricsmetrics.silhouette_score(X, predicted_labels)

Silhouette coefficient: 0.4259680122384905

Silhouette coefficient moving towards 1 so our clustering is good

(vi) Conclusion :

I hope you find this article is helpful , Please give me your feedback that will helpful for me

(vii) References :

--

--

Janibasha Shaik
Analytics Vidhya

machine-learning fascinates the world, I exploit the machine learning to solve the real-world problem statements