Theory & Python Implementation of K-Means vs DBSCAN
Under the unsupervised machine learning model family there is many popular model we have like:
1. KNN
2. DBSCAN
3. PCA etc.
K- Means:
It is an unsupervised machine learning model. We do not have a labelled dataset. We will go for cluster making approach. So the goal is to make groups of data. The number of groups is represented by K.
Working method:
- Choose no of clusters=k & choose k points in the dataset.
- Assign the other points to the same cluster nearest to the k points.
- Calculate the centroid of each cluster.
- Assign all the points nearest to the centroid of the clusters.
- Repeat this last two steps until the error is minimized.
Choosing the K value: Suppose we have a set of observations (x1, x2, …, xn), k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS) Because in the cluster making process we want to maximize the distance among the clusters & minimize the distance of data points within a same clusters.
μi= The centroid of the cluster.
Elbow Method : So we will plot the error vs no of clusters graph. If no of our clusters increases, the error will be decreased but after a certain period it stops to decrease. That will be the best optimal value for the K. To approach this method we have to minimize the WCSS. We measure the distance among data points by Euclidean distance. The method is called as Elbow Method.
Python Implementation:
#......Import the relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
#......read the data
data = pd.read_csv ('3.12. Example.csv')
plt.scatter(data['Satisfaction'],data['Loyalty'])
x = data.copy()
#......scale the data
scaler = StandardScaler()
x_scaled= scaler.fit_transform(x)
x_scaled
#......to minimize the error by WCSS
WCSS=[]
for k in range(2,30):
model=KMeans(n_cluster=k)
model.fit(x_scaled)
WCSS.append(model.inertia_)
#.....looking where large jump of K
plt.plot(range(2,30),WCSS,'o--')
pd.Series(WCSS).diff().plot(kind='bar')
kmeans = KMeans(30)
kmeans.fit(x_scaled)
clusters = x.copy()
#.....labelling the clusters
clusters['cluster_pred']=kmeans.fit_predict(x_scaled)
clusters
plt.scatter(clusters['Satisfaction'],clusters['Loyalty'],c=clusters['cluster_pred'],cmap='rainbow')
plt.xlabel('Satisfaction')
plt.ylabel('Loyalty')
Practical Application:
- Customer Segmentation where the dataset is unlabeled
- Inventory categorization
- Behavioral segmentation
DBSCAN:
Where clustering in KMeans done by finding the distance between the points, in DBSCAN ( Density Based Spatial Clustering of Application with Noise) method the clustering is done by finding the distance between nearest points & it can work upon the dataset which is not well separated. In our real life, data is not well structured or well separated.
Parameters:
- Epsilon: It defines the region of neighborhood around a data point. If it is very high, most of the potential clusters will be merged. If it is very low, all data points will be formed as a cluster.
- Min of points: It is the minimum number of points inside a Epsilon. If the dataset is very large, it should also be large.
Types of Data point:
- Core point : A point is a core point if it has more than minimum no of points within the Epsilon distance.
- Border point : A point is a border point if it has less than minimum no of points within Epsilon distance & it is a neighborhood of a core point.
- Outlier : A point which is neither core nor border is called as Outlier.
Working method :
- Find all the neighbour points within a Epsilon and identify the core point.
- For each core point, create a cluster.
- Find all it’s density connected points & assign them to the same cluster as the core point.
- Iterate through the remaining points & assign to the cluster. Those points which do not have any clusters are called Outliers.
Python Implementation :
Below are the basic syntax that can be used for anywhere when DBSCAN method is required for making cluster. During this we will make comparison between Kmeans & DBSCAN by a generalized function where we can clearly see that the data is separable by which method.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv(....//..)
df.describe()
df[].value_counts()
df_dummies=pd.get_dummies(df.drop('unique',axis=1)
def display_categories(model,data)
labels=model.fit_predict(data)
sns.scatterplot(data=data,x='X1',y='X2',hue=labels,palette='Set1')
from sklearn.cluster import KMeans
K_Means_model=KMeans(n_clusters=2)
display_categories(K_Means_model,data)
from sklearn.cluster import DBSCAN
DBSCAN_model=DBSCAN(eps=0.15)
display_categories(DBSCAN_model,data)
#for detecting outliers
np.sum(dbscan.labels==-1)
100*np.sum(dbscan.labels==-1)/len(dbscan.labels)
outlier_percent=[]
number_of_outliers=[]
for eps in np.linspace(0.001,10,100):
dbscan=DBSCAN(eps=eps)
dbscan.fit(data)
number_of_outliers.append(np.sum(dbscan.labels==-1))
perc_outlier=100*np.sum(dbscan.labels==-1)/len(dbscan.labels)
outlier_percent.append(perc_outlier)
sns.lineplot(x=np.linspace(0.001,10,100),y=number_of_outliers)
- KMeans fails to work upon the dataset where the data has not same variance in all direction. But DBSCAN density based approach catches all type of variance of data in any direction.
- KMeans is badly effected by outliers. DBSCAN method identifies the dense places by clustering the data together which are closed to each other by distance measurement. The remaining points will be the outliers.