Practical Implementation Of K-means, Hierarchical, and DBSCAN Clustering On Dataset With Hyperparameter Optimization
Clustering Algorithms with Hyperparameter optimization
Table of contents :
(i) Article Agenda
(ii) Data Processing
(iii) K-mean Clustering with Hyperparameter Optimization
(iv) Hierarchical Clustering
(v) DBSCAN Clustering with Hyperparameter Optimization
(vi) Conclusion
(vii) References
(i) Article Agenda :
This article is purely related to the implementation of Clustering Algorithms on any data set. We also do Hyperparameter optimization.
Prerequisites: Basic understanding of K-means, Hierarchical, and DBSCAN Clustering
Throughout this article, I follow https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python Mall_Customers.csv data set
Please download the data set from the above link
(ii) Data Processing :
We read a CSV file using pandas
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt# Reading csv filedf=pd.read_csv('Customers.csv')df.head()
The data set contains 5 features
Problem statement: we need to cluster the people basis on their Annual income (k$) and how much they Spend (Spending Score(1–100) )
So our features for the clustering are Annual Income(k$) and Spending Score(1–100)
Spending score is nothing but a score gave the basis on how much they spend
f1: Annual Income (k$)
f2: Spending Score(1–100)
Now we need to create an array with f1(x) and f2 (y) from data frame df
# converting features f1 and f2 into an array
X=df.iloc[:,[3,4]].values
We had features in array form now we can proceed to implement step
(iii) K-means Clustering :
K-means Clustering is Centroid based algorithm
K = no .of clusters =Hyperparameter
We find K value using the Elbow method
K-means objective function is argmin (sum(||x-c||)²
where x = data point in the cluster
c= centroid of the cluster
objective: We need to minimize the square distance between the data point and centroid
If we have K-clusters then we have K-centroids
Intracluster distance: Distances between data points in the same cluster
Intercluster distance: Distances between different clusters
Our main aim to choose the clusters which have small intracluster distance and large intercluster distance
We use K-means++ initialization(probabilistic approach)
from sklearn.cluster import KMeans# objective function is nothing but argmin of c (sum of (|x-c|)^2 ) c: centroid ,x=point in data setobjective_function=[]
for i in range(1,11):
clustering=KMeans(n_clusters=i, init='k-means++')
clustering.fit(X)
objective_function.append(clustering.inertia_)#inertia is calculaing min intra cluster distance
# objective function contains min intra cluster distances objective_function
objective_function : min intracluster distances
[269981.28000000014,
183116.4295463669,
106348.37306211119,
73679.78903948837,
44448.45544793369,
37233.81451071002,
31599.13139461115,
25012.917069885472,
21850.16528258562,
19701.35225128174]
We tried K value in the in-between 1 to 10, we don’t know which is best K surely
So to know best K we do Hyperparameter optimization
Elbow method: Hyperparameter optimization
# for finding optimal no of clusters we use elbow technique
# Elbow technique is plot between no of clusters and objective_function
# we take k at a point where the objective function value have elbow shape
plt.plot(range(1,11),objective_function)
plt.title(‘The Elbow Method’)
plt.xlabel(‘Number of Clusters K’)
plt.ylabel(‘objective_function’)
plt.show()
In the above plot at K=5, we got elbow joint we consider optimum K=5
Now we train a model with an optimum K value
# Training the model with optimal no of clusters
# Training the model with optimal no of clusterstuned_clustering=KMeans(n_clusters=5,init=’kmeans++’,random_state=0)
labels=tuned_clustering.fit_predict(X)# x and y coordinates of all clusters
# Centroids of clusters
tuned_clustering.cluster_centers_[:]
labels return: predicted clusters
array([3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1,
3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 0,
3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 4, 2, 0, 2, 4, 2, 4, 2,
0, 2, 4, 2, 4, 2, 4, 2, 4, 2, 0, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2])
tuned_clustering.cluster_centers_[:] return : centroid coordinates of each cluster
array([[55.2962963 , 49.51851852],
[25.72727273, 79.36363636],
[86.53846154, 82.12820513],
[26.30434783, 20.91304348],
[88.2 , 17.11428571]])
Visualizing the clusters
# visualizing the clustersplt.scatter(X[labels==0,0],X[labels==0,1],c=’green’,label=’cluster1')
plt.scatter(X[labels==1,0],X[labels==1,1],c=’yellow’,label=’cluster2')
plt.scatter(X[labels==2,0],X[labels==2,1],c=’red’,label=’cluster3')
plt.scatter(X[labels==3,0],X[labels==3,1],c=’orange’,label=’cluster4')
plt.scatter(X[labels==4,0],X[labels==4,1],c=’blue’,label=’cluster5')plt.scatter(tuned_clustering.cluster_centers_[:,0],tuned_clustering.cluster_centers_[:,1],s=300,c=’black’,label=’centroid’)plt.title(‘Clusters of customers’)
plt.xlabel(‘Annual Income(K$)’)
plt.ylabel(‘Spending Score(1–100)’)
plt.legend()
plt.show()
Evaluation: How good our clustering is
To check how good our clustering we use the Silhouette coefficient
In the above diagram, we have two clusters C1 and C2
a= intracluster distance
b=intercluster distance
Silhouette coefficient= b-a/max(b,a)
If our clustering is good then we have small intracluster distance then the Silhouette coefficient value is positive
If our clustering is bad then we have large intracluster distance then the Silhouette coefficient value is negative
Silhouette coefficient lies in between -1 and 1
If the value moves towards 1 then clustering is good
If the value moves towards <0 then clustering is bad
from sklearn import metrics
metrics.silhouette_score(X, tuned_clustering.labels_,
metric='euclidean')
We got the Silhouette coefficient value is 0.553931997444648
It’s moving towards 1 so our clustering is good
If you want to visualize the Silhouette coefficient
# visualizing Silhouette coefficientfor n_clusters in range(2,10):
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
print(“For n_clusters =”, n_clusters,
“The average silhouette_score is :”, silhouette_avg)# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, cluster_labels)y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]ith_cluster_silhouette_values.sort()size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_icolor = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samplesax1.set_title(“The silhouette plot for the various clusters.”)
ax1.set_xlabel(“The silhouette coefficient values”)
ax1.set_ylabel(“Cluster label”)# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color=”red”, linestyle=” — “)ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])# 2nd Plot showing the actual clusters formed
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(X[:, 0], X[:, 1], marker=’.’, s=30, lw=0, alpha=0.7,
c=colors, edgecolor=’k’)# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker=’o’,
c=”white”, alpha=1, s=200, edgecolor=’k’)for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker=’$%d$’ % i, alpha=1,
s=50, edgecolor=’k’)ax2.set_title(“The visualization of the clustered data.”)
ax2.set_xlabel(“Feature space for the 1st feature”)
ax2.set_ylabel(“Feature space for the 2nd feature”)plt.suptitle((“Silhouette analysis for KMeans clustering on sample data “
“with n_clusters = %d” % n_clusters),
fontsize=14, fontweight=’bold’)plt.show()
For n_clusters = 2 The average silhouette_score is : 0.3273163942500746
For n_clusters = 3 The average silhouette_score is : 0.46761358158775435
For n_clusters = 4 The average silhouette_score is : 0.4931963109249047
For n_clusters = 5 The average silhouette_score is : 0.553931997444648
For n_clusters = 6 The average silhouette_score is : 0.5379675585622219
For n_clusters = 7 The average silhouette_score is : 0.5270287298101395
For n_clusters = 8 The average silhouette_score is : 0.4548653400650936
For n_clusters = 9 The average silhouette_score is : 0.4595491760122954
If we observe the above scores for k=5 we have a high Silhouette coefficient
If we observe the above plot no negative movements in clusters everything moving towards 1
So we can conclude our clustering is good
(iv) Hierarchical Clustering :
Hierarchical clustering is tree-based clustering
Dendrogram: It‘s a tree-like a diagram that records the sequence of merges
In Hierarchical clustering, we use Agglomerative clustering
Step1: consider each data point as a cluster
Step2: merge clusters based on their similarity (distance)
if two clusters are very near to each other group those two clusters into one cluster
Repeat until only a single cluster remains
In Agglomerative clustering no need to give no of clusters as hyperparameters we can stop hierarchy how many clusters we want
So no.of clusters is not a hyperparameter
# Dendogramimport scipy.cluster.hierarchy as schcluster_visualising=sch.dendrogram(sch.linkage(df.iloc[:,[3,4]].values,method=’ward’))
plt.title(‘Dendrogram’)
plt.xlabel(‘Customers’)
plt.ylabel(‘Euclidean distances’)
plt.show()
If we observe the above dendrogram, it groups the customers according to min euclidean distance
No.of clusters from dendrogram =We select the largest vertical line which can not cut by horizontal line
From the dendrogram, we choose K=5 =no of clusters
For cluster similarity, we use the ward method
Ward method less susceptible to noise and outliers
# AgglomerativeClustering Model initializationfrom sklearn.cluster import AgglomerativeClusteringclustering_model=AgglomerativeClustering(n_clusters = 5, affinity = ‘euclidean’, linkage = ‘ward’)clustering_model.fit(df.iloc[:,[3,4]].values)# Predicting clusters
clustering_prediction=clustering_model.fit_predict(df.iloc[:,[3,4]])
clustering_predictoin return: Predicted clusters
array([4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3,
4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 1,
4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 0, 2, 0, 2,
1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 1, 2, 0, 2, 1, 2, 0, 2, 0, 2, 0, 2,
0, 2, 0, 2, 0, 2, 1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2,
0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2,
0, 2], dtype=int64)
Visualizing clusters
plt.scatter(df.iloc[:,[3,4]].values[clustering_prediction == 0, 0], df.iloc[:,[3,4]].values[clustering_prediction == 0, 1], s = 100, c = ‘green’, label = ‘Cluster 1’)plt.scatter(df.iloc[:,[3,4]].values[clustering_prediction == 1, 0], df.iloc[:,[3,4]].values[clustering_prediction == 1, 1], s = 100, c = ‘red’, label = ‘Cluster 2’)plt.scatter(df.iloc[:,[3,4]].values[clustering_prediction == 2, 0], df.iloc[:,[3,4]].values[clustering_prediction == 2, 1], s = 100, c = ‘blue’, label = ‘Cluster 3’)plt.scatter(df.iloc[:,[3,4]].values[clustering_prediction == 3, 0], df.iloc[:,[3,4]].values[clustering_prediction == 3, 1], s = 100, c = ‘yellow’, label = ‘Cluster 4’)plt.scatter(df.iloc[:,[3,4]].values[clustering_prediction == 4, 0], df.iloc[:,[3,4]].values[clustering_prediction == 4, 1], s = 100, c = ‘orange’, label = ‘Cluster 5’)plt.title(‘Clusters of customers’)
plt.xlabel(‘Annual Income (k$)’)
plt.ylabel(‘Spending Score (1–100)’)
plt.legend()
plt.show()
Evaluation
from sklearn import metrics
metrics.silhouette_score(df.iloc[:,[3,4]].values, clustering_prediction , metric='euclidean')
Silhouette coefficient: 0.5529945955148897
The coefficient moving towards 1 so clustering is good
(v) DBSCAN Clustering :
DBSCAN is a density-based clustering
We need to know the terms before implementing
Dense region, Sparse region, Core point, Border point, Noise point , Density edge , Density connected points
please refer the below link to understand the above-mentioned terms
In DBSCAN Min points and epsilon are the hyperparameters
Step1: For every point in the dataset we need to label data point belong to whether core point /border point/noise point
Step2: Remove all noise points
Step3: For each core point ‘p’ which is not assigned to a cluster
we create a new cluster with not assigned core point and add all points that are density connected to not assigned core point into this new cluster
Step4: each border point assign to the nearest core points cluster
Hyperparameter optimization:
In DBSCAN Min points and epsilon are the hyperparameters
Min points :
rule of thumb ==> Min points ≥dimensionality +1
If the data set is noisier then we use Min Points are larger because it removes noisy points easily
Epsilon (radius): Elbow method
step1: for every data point (x) we compute distance (d)
stpe2: sort distances in increasing order
then plot the graph between distance and point index
We choose the optimum distance(epsilon) d, where the sharp rise in the graph
# we use nearestneighbors for calculating distance between pointsfrom sklearn.neighbors import NearestNeighbors
# calculating distancesneigh=NearestNeighbors(n_neighbors=2)
distance=neigh.fit(X)
# indices and distance values
distances,indices=distance.kneighbors(X)
# Now sorting the distance increasing ordersorting_distances=np.sort(distances,axis=0)
# sorted distancessorted_distances=sort_distances[:,1]# plot between distance vs epsilonplt.plot(sorted_distances)
plt.xlabel(‘Distance’)
plt.ylabel(‘Epsilon’)
plt.show()
If we observe the graph, at epsilon is equal to 9 sharp rises in the graph so we choose epsilon(radius) as 9
Implementing DBSCAN with optimized hyperparameters
from sklearn.cluster import DBSCAN# intializing DBSCANclustering_model=DBSCAN(eps=9,min_samples=4)# fit the model to Xclustering_model.fit(X)# predicted labels by DBSCANpredicted_labels=clustering_model.labels_
# visualzing clustersplt.scatter(X[:,0], X[:,1],c=predicted_labels, cmap='Paired')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title("DBSCAN")
Evaluation :
from sklearn import metricsmetrics.silhouette_score(X, predicted_labels)
Silhouette coefficient: 0.4259680122384905
Silhouette coefficient moving towards 1 so our clustering is good
(vi) Conclusion :
I hope you find this article is helpful , Please give me your feedback that will helpful for me