“Clustering Algorithms, you say? I must be late to the party.”

Explaining Partial and Density-based Clustering Algorithms

Stuti Sehgal
DataX Journal
6 min readJul 16, 2020

--

Clustering is a process of grouping similar data points by finding similarities and determining patterns in unlabelled and unseen data. Clustering algorithms are widely used in Market Segmentation, Search engines, Recommendation Systems, and Diagnostic Systems.

Perhaps if you encounter a dataset without a labeled target variable there’s a way to gain some insight out of it without going through the hassle of labeling it. Maybe there isn’t a clear target variable to predict, is there still a way to use the data? This is the world of ‘unsupervised learning’.

Photo by Franki Chamaki on Unsplash

The algorithm tries to find the underlying structure of the data by different approaches which can be divided into three sub-categories:

  1. Partition-based clustering: K-means, K-median

2. Hierarchical clustering: Agglomerative, Divisive

3. Density-based clustering: DBSCAN

K-Means Clustering:

Partition-based clustering algorithms like k-means are used to classify data points based on similar features into K (positive number) of clusters such that dissimilar feature data points belong to other clusters.

Convex-spherical shaped datasets -> Clusters of data points

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into K pre-defined distinct non-overlapping clusters where each data point belongs to only one group. It tries to make the intra-cluster distance minimum and inter-cluster distance maximum. The less variation we have within clusters, the more homogeneous the data points are within the same cluster.

Methodological Approach towards K-means Clustering:

  1. Perform MinMaxScaling while applying the K-means algorithm.
    Plot dataset on a graph (EDA- Exploratory Data Analysis)
    Choose K ie optimum number of clusters.
  2. Initialize centroids (arithmetic mean of all the data points that belong to that cluster) and select random K data points.
    Note- K-means tries initial centroids symmetrically.
  3. Assign data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid is minimum.
  4. Compute and place new centroid to each cluster.
  5. Convergence: Check Euclidean/Manhattan distances such that the lowest With-in Cluster Sum of Squared Errors is reassigned.

Elbow Method:

Elbow method gives us an idea of what an optimum K ie number of clusters would be based on the sum of squared distance (SSE) between data points and their assigned clusters’ centroids. We pick K at the spot where SSE starts to flatten out and forming an elbow ie slow changes in With-in Cluster Sum of Squared Errors.

Elbow method -> Minimised With-in Cluster Sum of Squared Errors (J- Cost Estimation)

Minimize With-in Cluster Sum of Squared Errors: Accuracy Prediction

Scikit- learn Implementation:

We start by creating a sample dataset using the datasets module of scikit-learn. We will create a dataset with 2 features with a 0.5 standard deviation for each cluster. The number of samples is 150 and we also choose three points as centroids.

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
X , y = make_blobs(n_samples=150,n_features=2,centers=3,cluster_std=.5,
shuffle=True,random_state=0)

Plot dataset.

plt.scatter(X[:,0],X[:,1],marker='o',s=50)
plt.grid()
plt.show()

We set n_init=10 to run the k-means clustering algorithm 10 times independently with different random centroids to choose the final model as the one with the lowest SSE. max_iter parameter specifies the maximum number of iterations for each single run.

from sklearn.cluster import KMeans
km = KMeans(n_clusters=3,n_init=10,max_iter=300)
km.fit(X)
pred=km.predict(X)
pred

With-in Cluster Sum of Squared Errors:

km.inertia_

Plot cluster and centroids.

km.cluster_centers_plt.scatter(X[:,0],X[:,1],c=pred,cmap='winter')
centers = km.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200)
plt.show()

Using the elbow method to find the optimal number of clusters.

WSSE = []
for i in range(1,11):
km = KMeans(n_clusters=i,random_state=0)
km.fit(X)
WSSE.append(km.inertia_)
plt.plot(range(1,11),WSSE,marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('with-in cluster sum of square error')
plt.show()
import pandas as pd
pd.DataFrame(WSSE,index=range(1,11))

DBSCAN Clustering:

K-means clustering is useful for spherical-shaped clusters. However, when it comes to arbitrary shaped clusters or detecting outliers, and anomaly detection, density-based techniques are more efficient. In Density-Based Spatial Clustering of Applications (DBSCAN), there is no need to know the number of clusters, unlike the K-means algorithm. DBSCAN finds all dense regions of plotted sample points and treats these regions as clusters- robust to noise points that are far away from cluster core.

Types of data points in arbitrary DBSCAN dataset

There are two key parameters of DBSCAN:

  1. eps: The distance that specifies the neighborhoods. Two points are considered to be neighbors if the distance between them is less than or equal to eps. In general, eps should be chosen as small as possible(eps ≤ 4)

2. min_points: Minimum number of data points to define a cluster.

Types of points:

  1. Core Point: The data point where the number of sample points within the neighborhood radius (eps) ≥ min_points.
  2. Boundary Point: The data points that are within the neighborhood of core points.
  3. Noise Points: The data points far away from clusters and core points.

Methodological Approach towards DBSCAN Clustering:

  1. Plot data points. Pick an unassigned point at random and compute its neighborhood to find if it is a core point or not. If not, the point is marked as noise.
  2. Find the core point to form a temporary cluster if the number of points within a radius of sample point ≥ min_points.
    Check data points in the neighborhood in the temporary cluster.
  3. All the points in the direct density of the core point will form the temporary cluster. All data points in the direct density of the neighbors of the core point will also add to the temporary cluster.
  4. Combine temporary clusters to obtain final clusters.
    For each temporary cluster, check whether the data point in it is a core point. If so, then merge temporary clusters to form a new cluster.
    A point that is marked as noise may be revisited and be part of a cluster as a border point.
  5. Repeat step 4 until each point in the current cluster is either not in core points list or neighbor of core point ie all points have been visited.

Knee-Method:

The goal is to find the average of distances for every point to its K nearest neighbors and select the distance at which a sharp change happens. The value of K is set to be equal to min_points. By default, K = 4 = min_points.

Scikit- learn Implementation:

We start by creating a sample dataset using the datasets module of scikit-learn. After creating the sample data points, we will normalize the values using StandardScaler class from the preprocessing module of scikit-learn. We will create and plot a dataset with the number of samples as 400 with a 0.6 standard deviation for each cluster.

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
X, y = make_blobs(n_samples=400,cluster_std=0.6,random_state=0)plt.scatter(X[:,0],X[:,1],c=y,cmap='cool')scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

We can now create a DBSCAN object, fit the data, and visualize the clusters.

dbscan = DBSCAN(eps = 0.3,metric="euclidean",min_samples = 10)
clusters = dbscan.fit_predict(X_scaled)
plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_, cmap="plasma")
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.title('eps={},min samples={}'.format(dbscan.eps,dbscan.min_samples))
print("Adjusted Rand Index: %0.3f"% metrics.adjusted_rand_score(y, dbscan.labels_))

Parameter eps (Application of Knee-Method):

plt.figure(figsize=(12,8))
for i,k in enumerate([.15,.3,.5,1.5],start=1):
plt.subplot(2,2,i)
db = DBSCAN(eps = k,metric="euclidean",min_samples = 10)
db.fit(X_scaled)

plt.scatter(X[:, 0], X[:, 1], c=db.labels_, cmap="plasma")
# plt.xlabel("Feature 0")
# plt.ylabel("Feature 1")
plt.title('eps={},min samples={}'.format(db.eps,dbscan.min_samples))

Parameter min_samples:

plt.figure(figsize=(12,8))
for i,k in enumerate([5,10,15,20],start=1):
plt.subplot(2,2,i)
db = DBSCAN(eps = .3,metric="euclidean",min_samples = k)
db.fit(X_scaled)

plt.scatter(X[:, 0], X[:, 1], c=db.labels_, cmap="plasma")
# plt.xlabel("Feature 0")
# plt.ylabel("Feature 1")
plt.title('eps={},min samples={}'.format(db.eps,db.min_samples))

When to use which clustering techniques, depends on the dataset. Even though K-Means is the vastly used clustering approach, there are dataset problems where using DBSCAN results in better clusters.

And finally, if you liked this article or if it helped you in any way. Please do give a clap. And I wouldn’t mind 50 of them ;)

--

--

Stuti Sehgal
DataX Journal

A Computer Science undergraduate student with an interest in application development, machine learning, data and artificial science | Summer IT Intern @Google