Decluttering Clustering

7 min readMar 19, 2023

As you all know already, since you have landed on this page, clustering is a fundamental technique in Machine Learning and Data Analysis. It is useful for several objectives such as classification, segmentation, dimensionality reduction, outlier detection and extracting aditional insights from data. The unsupervised clustering algorithms such as K-means and DBSCAN need no prior knowledge or training and can identify clusters to group data points that are closer to each other. There are several good tutorials on the web on K-means and DBSCAN. This article provides a quick review of the two algorithms, K-means and DBSCAN, and shows the easy implementations for the one-dimensional case.

Let us start with importing the usual packages.

import matplotlib.pyplot as plt
from scipy import signal
import numpy as np

The data clusters can be created by generating different ellipses and adding noise. We define the following method to generate each data for each cluster.

def generate_ellipse(npts,majaxis,minaxis,rotation,center,nvar):
    # Generate angles between 0 and 2*pi
    angles = np.linspace(0, 2 * np.pi, npts)
    # Calculate x and y points on ellipse
    x1 = center[0] + majaxis*np.cos(angles)*np.cos(rotation) - minaxis*np.sin(angles)*np.sin(rotation)
    y1 = center[1] + majaxis*np.cos(angles)*np.sin(rotation) + minaxis*np.sin(angles)*np.cos(rotation)
    noise = np.random.normal(0, nvar, size=x2.shape)
    x1 += noise
    noise = np.random.normal(0, nvar, size=y2.shape)
    y1 += noise
    ellipsoid = np.column_stack((x1, y1))
    return ellipsoid

With the following code, we then generate data that has two clusters.

ellipse1 = generate_ellipse(1000,1,0.5,np.pi/4,(5,2),0.5)
ellipse2 = generate_ellipse(1000,1,0.5,np.pi/4,(1,1),0.5)
data = np.concatenate((ellipse1, ellipse2))
plt.figure()
plt.scatter(data[:, 0], data[:, 1])
plt.title('Data')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

K-means clustering finds the centroids that represent the each of the clusters found in the data. We however need to specify the number of clusters that are there in the dataset. The algorithm starts with random centroids for the defined number of clusters. For each centroid, we find all the data points that are closer to this centroid compared to other centroids. This step can also be replaced with computing the likelihood of the data points belonging to the cluster defiend by this centroid. Once a subset of data points are associated to this cluster, the centroid is then recomputed based on all the data points that are associated with this cluster. This constitutes one iteration of the algorithm. The iterations are continued until the centroids don’t change anymore.

from sklearn.cluster import KMeans
# Fit a k-means model to the data
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)

# Get the cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Plot the data points and centroids
plt.figure()
plt.scatter(data[:, 0], data[:, 1], c=labels)
plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=200, linewidths=3, color='r')
plt.title('K-means Clustering')
plt.show()

If we change the number of clusters to a different number, then we see that the K-means clustering grops data points into those many clusters. In practice, we don’t know how many clusters are there in the data and need to apply additional strategies to estimate the number of clusters.

kmeans = KMeans(n_clusters=5)
kmeans.fit(data)

# Get the cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Plot the data points and centroids
plt.figure()
plt.scatter(data[:, 0], data[:, 1], c=labels)
plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=200, linewidths=3, color='r')
plt.title('K-means Clustering')
plt.show()

DBSCAN on the other hand uses a epsilon ball to group data points together. We draw an epsilon ball around a data point, and if there are at least a certain number of neighbours within the epsilon radius, we associate those to the the data point. This is then repeated for each of the neighbours. All the data points that are reachable from one another (that is, they are all within epsilon distance from each other) are grouped together into one cluster. Those data points that don’t have sufficient neighbours within their epsilon ball are treated as outliers.

The following plot shows how DBSCAN groups the data points. We see here that we are still identifying 4 clusters instead of 2 and all the points in the boundaries are not included in the clusters.

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(data)
# Count the number of clusters (excluding outliers)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Number of clusters from DBSCAN:", n_clusters)

# Plot the results
plt.figure()
plt.scatter(data[:, 0], data[:, 1], c=labels)
plt.title('DBSCAN Clustering')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

If we get the number of clusters right, then there isn’t a huge difference between K-means and DBSCAN from the above dataset. But there are certain datasets where K-means just cannot find the right clusters while DBSCAN works well. See for example the dataset plotted below.

ellipse1 = generate_ellipse(1000,2,1,np.pi/4,(3,2),0.1)
ellipse2 = generate_ellipse(1000,4,2,np.pi/4,(3,2),0.1)
data = np.concatenate((ellipse1, ellipse2))
plt.figure()
plt.scatter(data[:, 0], data[:, 1])
plt.title('Data')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

We see now the stark difference in performance of K-means versus DBSCAN.

But even DBSCAN will fail if we increase the noise variance, as we see in the below plots.

ellipse1 = generate_ellipse(1000,2,1,np.pi/4,(3,2),0.25)
ellipse2 = generate_ellipse(1000,4,2,np.pi/4,(3,2),0.25)
data = np.concatenate((ellipse1, ellipse2))
plt.figure()
plt.scatter(data[:, 0], data[:, 1])
plt.title('Data')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

Let us now focus on the one-dimensional case where the data points just lie along a line. We will make the data as integers because we are interested in data points that lie on a grid, and integers can always be scaled to the grid resolution. One key advantage we have for the one-dimensional case is that the data can be easily ordered by just sorting.

data = np.sort(np.random.choice(np.arange(1, 100), size=10, replace=False))
print(data)

An unorthodox way of clustering here is to look at data points as locations where there is non-zero value. We can then group data points together by “smoothing”. See the following code for example. We create an impulse vector by assigning a 1 to all the data point locations. The smoothing is done by convolving this impulse vector with a Gaussian pulse. The centroids are then estimated by identifying the peaks in the resulting signal.

t = np.linspace(-1, 1, 11) 
i, q, e = signal.gausspulse(t, fc=1.5, retquad=True, retenv=True)

impvec = np.zeros(100)
impvec[data] = 1

clustered_signal = np.convolve(impvec,e)
clustered_signal = clustered_signal[5:104]
peak_locs, _ = signal.find_peaks(clustered_signal);
print("Estimated centroid locations: ", peak_locs)

fig, (ax1, ax2, ax3) = plt.subplots(nrows=1, ncols=3, figsize=(15, 3))

ax1.plot(e, '--*', color='red')
ax1.set_title('Gaussian Pulse')
ax2.stem(impvec)
ax2.set_title('Impulse vector')
ax3.plot(clustered_signal,'-*')
ax3.stem(peak_locs,clustered_signal[peaks_locs],linefmt='r-', markerfmt='ro', basefmt='r-')
ax3.stem(data,impvec[data],linefmt='g-', markerfmt='g+', basefmt='g-')
ax3.set_title('Clustered Signal and Centroids')
plt.show()

Let us now apply DBSCAN on the same dataset and see what we get for the centroids.

# Perform DBSCAN on the data
dbscan = DBSCAN(eps=3, min_samples=2)
dbscan.fit_predict(data.reshape(-1, 1))

# Plot the results
labels = dbscan.labels_
unique_labels = set(labels)
#print(unique_labels)

# Count the number of clusters (excluding outliers)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Number of DBSCAN clusters: ", n_clusters)

centroids_dbscan = []
for k in unique_labels:
    if k == -1:
        # For noise points, use the original point as the centroid
        #centroids_dbscan.append(data[labels == k])
        centroids_dbscan = np.concatenate((centroids_dbscan, data[labels == k]))
    else:
        class_members = (labels == k)
        this_centroid = np.mean(data[class_members])
        centroids_dbscan.append(this_centroid)
print("DBSCAN Centroids:", np.sort(centroids_dbscan))

Recollect that the expectation maximization (EM) algorithm starts with a random centroids of a predefined number and finds all the data points that are close to each of these centroids, before recomputing the centroids based on associations. We now combine the EM algorithm with the epsilon ball of DBSCAN to get an algorithm that starts with the first data point as the centroid, and for every additional data point that is within the epsilon ball of the centroid, a new centroid is computed by averaging the centroid with the new data point. If the next data point is outside the epsilon ball of the centroid, it initiates a new cluster and becomes the first centroid for that cluster. We just have to do a single pass on the data to find the number of clusters and their centroids. Note that this works for the one dimensional case since the data can be ordered!

epsilon = 5
centroids_em = np.zeros(data.shape);
centroids_em[0] = data[0];
cluster_idx = 0;
for k in range(1, len(data)):
    if abs(data[k] - centroids_em[cluster_idx]) < epsilon:
        centroids_em[cluster_idx] = 0.5 * (centroids_em[cluster_idx] + data[k])
    else:
        cluster_idx += 1
        centroids_em[cluster_idx] = data[k]

centroids_em = centroids_em[0:cluster_idx+1]
print("EM Centroids:", centroids_em)

And we now finally compare the different ways of clustering the one-dimensional data. We see from the following plot that the results from the different clustering algorithms closely match.

plt.figure()
plt.stem(data,4*impvec[data],linefmt='g-', markerfmt='g+', basefmt='g-',label='data')
plt.stem(peaks_locs,clustered_signal[peaks_locs],linefmt='r-', markerfmt='ro', basefmt='r-', label='smoothing')
plt.stem(centroids_dbscan,2*np.ones(len(centroids_dbscan)),linefmt='b-', markerfmt='bs', basefmt='b-', label='dbscan')
plt.stem(centroids_em,3*np.ones(len(centroids_em)),linefmt='k-', markerfmt='kx', basefmt='k-', label='EM')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

Decluttering Clustering

Written by Prasanna Sethuraman