DBSCAN Clustering

Balaji C
4 min readNov 11, 2023

--

Density-based spatial clustering of applications with noise(DBSCAN).

Unlike KMeans or Kmediods the desired number of clusters(K) is not given as input rather DBSCAN determine dense cluster from data points.

Main aim of DBSCAN is to create clusters with minimum size and density. Density is defined as minimum number of points within a certain distance of each other. It handles Outlier problem easily and efficiently because outliers are not dense and hence they can’t form clusters.

Concept of Min.points and ε(threshold value Eps).

In DBSCAN there are main internal concepts like Core Point, Noise Point, Border Point, Center Point, ε.

ε: It defines the neighborhood around a data point i,e distance between two points is lower or equal to ε then they are considering neighbors. If ε value is chosen too small then large part of data will be considered as outliers. If ε value is too large then the clusters will be merge and majority of data points will be in the same cluster. One way to find ε value is based on k-distance graph.

Min.points: Minimum number of neighbors (data points) with ε radius. Larger the dataset, the large value of Min.points must be chosen.

Core points: A point is said to core point if it has more than Min.points within ε.

Border point: A point which has fewer than Min.points within ε but its in the neighborhood of core point.

Noise point: A point which is not a core point or border point.

Example: ε = 1.9, Minpts = 4

We need to find distance between each two points using Euclidean distance. Euclian distance for distance matrix √(x₂ -x₁)² + (y₂ -y₁)²

i,e for (dP1,P2) => d((7,4),(6,4)) => √(x₂ -x₁)² + (y₂ -y₁)² => √(6 -7)²+(4 -4)² =>1

P1: P2,P5,P9 are points which are less than ε = 1.9. P1 has 4 points so they form Core Point.

P2: P1,P5,P9 are points which are less than ε = 1.9.P2 has 4 points so they form Core Point.

P3: P8,P9 are points which are less than ε = 1.9.P3 has 3points so they form Border Point because P8 and P9 are core points.

P4: P6,P7 are points which are less than ε = 1.9.P4 has 2points so they dont form Core Point.

P5: P1,P2,P6 are points which are less than ε = 1.9.P5 has 4 points so they form Core Point.

P6: P4,P5 are points which are less than ε = 1.9..P6 has 3points so they form Border Point because P5 are core points.

P7: P4,P11 are points which are less than ε = 1.9.P7 has 3 points so they form Noise point.

P8: P3,P10,P11 are points which are less than ε = 1.9.P8 has 4 points so they form Core Point.

P9: P1,P2,P3 are points which are less than ε = 1.9.P9 has 4 points so they form Core Point.

P10: P8 are points which are less than ε = 1.9. P10 has 2points so they form Border Point because P8 are core points.

P11: P7,P8 are points which are less than ε = 1.9. P11 has 3points so they form Border Point because P8 are core points.

P12: There are no points which are less than ε = 1.9. So it does not have core/border. Hence it is a noise point.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

df = pd.read_csv('Mall_Customers.csv')
X_train = df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]

clustering = DBSCAN(eps=12.5, min_samples=4).fit(X_train)
DBSCAN_dataset = X_train.copy()
DBSCAN_dataset.loc[:,'Cluster'] = clustering.labels_

outliers = DBSCAN_dataset[DBSCAN_dataset['Cluster']==-1]

fig2, (axes) = plt.subplots(1,2,figsize=(12,5))

sns.scatterplot('Annual Income (k$)', 'Spending Score (1-100)',

data=DBSCAN_dataset[DBSCAN_dataset['Cluster']!=-1],

hue='Cluster', ax=axes[0], palette='Set2', legend='full', s=200)

sns.scatterplot('Age', 'Spending Score (1-100)',

data=DBSCAN_dataset[DBSCAN_dataset['Cluster']!=-1],

hue='Cluster', palette='Set2', ax=axes[1], legend='full', s=200)

axes[0].scatter(outliers['Annual Income (k$)'], outliers['Spending Score (1-100)'], s=10, label='outliers', c="k")

axes[1].scatter(outliers['Age'], outliers['Spending Score (1-100)'], s=10, label='outliers', c="k")
axes[0].legend()
axes[1].legend()

plt.setp(axes[0].get_legend().get_texts(), fontsize='12')
plt.setp(axes[1].get_legend().get_texts(), fontsize='12')

plt.show()

--

--