Crash Course in Data Science: Clustering Techniques

0xdevshah
AI Skunks
Published in
10 min readMar 27, 2023

Clustering is a technique in unsupervised machine learning where data points are grouped into subsets, or clusters, based on their similarity or proximity to each other.

Clustering has many applications in various fields such as data analysis, image processing, natural language processing, and bioinformatics, among others.

1. K-Means Clustering

K-Means Clustering is a popular clustering algorithm that partitions data points into a specified number of clusters based on their Euclidean distances to the cluster centers.

a. Parameters: The number of clusters to be formed.
b. Use Case: This algorithm is suitable for general-purpose clustering problems where even cluster size and flat geometry are desired. It is also limited to a smaller number of clusters.
c. Metric Used: Points are clustered based on their Euclidean distances.

Mathematical equation: argmin∑(xi — μj)², where xi is a data point, μj is the mean of the jth cluster, and the summation is over all data points in the dataset. K-means clustering aims to partition n data points into k clusters in such a way that the sum of the squared distances between each data point and its assigned cluster centroid is minimized.

Choosing the number of clusters in K-means requires careful consideration to ensure optimal clustering performance. The following technical steps can help in choosing the appropriate number of clusters:

  1. Elbow Method: This method involves running K-means for a range of cluster numbers, and plotting the within-cluster sum of squares (WCSS) against the number of clusters. The plot will show a decreasing trend of WCSS as the number of clusters increases. However, the plot will eventually flatten out, forming an “elbow” shape, beyond which adding more clusters will not significantly reduce the WCSS. The number of clusters at the elbow point is a good indication of the optimal number of clusters.
  2. Silhouette Method: This approach is based on the Silhouette score, which measures the similarity of an observation to its cluster compared to other clusters. The Silhouette score ranges from -1 to 1, with a score closer to 1 indicating that the observation is well-matched to its cluster and poorly matched to other clusters. The Silhouette score can be computed for a range of cluster numbers, and the optimal number of clusters is the one with the highest average Silhouette score.
  3. Gap Statistic Method: This technique compares the WCSS of the Kmeans clusters to the expected WCSS of a null reference distribution, which is generated by randomizing the data. The optimal number of clusters is the one with the largest gap between the observed WCSS and the expected WCSS.
  4. Domain Knowledge: In some cases, the optimal number of clusters may be known based on the domain or problem context. For example, if the goal is to segment customers into high, medium, and low-value segments, then three clusters may be appropriate.
  5. Visual Inspection: Finally, it may be helpful to visualize the data using dimensionality reduction techniques like PCA or t-SNE, and observe if some natural groupings or clusters can be discerned visually. This approach can provide insights into the appropriate number of clusters.

In the following example, we generate a random dataset with 100 data points and 2 features. We then initialize the K-Means model with n_clusters=3, which specifies the number of clusters to form. Finally, we fit the model to the data and get the cluster labels for the resulting clustering.

from sklearn.cluster import KMeans
import numpy as np

X = np.random.rand(100, 2)
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
labels = kmeans.labels_

Drawbacks:

  • It requires a predetermined number of clusters.
  • It is sensitive to the initial choice of centroids.
  • It is sensitive to outliers and noisy data.

Here are the best applications and use case scenarios for using KMeans:

  • Customer Segmentation: K-means clustering can be used to segment customers based on their purchasing behavior or demographic data. This can help businesses to tailor their marketing strategies and improve customer engagement.
  • Image Segmentation: K-means clustering can be used to segment images into regions based on color or texture features. This is useful in various computer vision applications such as object detection and tracking.
  • Anomaly Detection: K-means clustering can be used to identify anomalies or outliers in data sets. This is useful in fraud detection and other applications where unusual data points need to be identified.

Overall, KMeans can be a useful tool for exploratory data analysis, pattern recognition, and clustering in a wide range of applications and industries.

2. Affinity Propagation

Affinity propagation is a clustering algorithm that does not require the number of clusters to be pre-specified. It works by iteratively propagating messages between data points to update the similarity matrix, and then using the resulting matrix to identify clusters based on exemplars.

a. Parameters: The damping factor and sample preference are the two parameters.
b. Use Case: It is well suited for situations with many clusters, uneven cluster size, and non-flat geometry.
c. Metric Used: The similarity between points is determined based on graph distances.

Mathematical equation: s(i, k) = -||xi — xk||² — λ * median(s(i, k)), where xi and xk are data points, s(i, k) is the similarity between data points i and k, and λ is a damping factor.

from sklearn.cluster import AffinityPropagation
import numpy as np

X = np.random.rand(100, 2)
damping = 0.5
preference = np.median(X)
aff = AffinityPropagation(damping=damping, preference=preference)
aff.fit(X)
labels = aff.labels_

In this example, we generate a random dataset with 100 data points and 2 features. We then initialize the Affinity Propagation model with damping=0.5, which controls the update rate of the message-passing algorithm, and preference=np.median(X), sets the initial preference for each point based on its similarity to other points. Finally, we fit the model to the data and get the cluster labels for the resulting clustering.

Drawbacks:

  • It can be computationally expensive for large datasets.
  • It may converge to suboptimal solutions, depending on the initial choice of exemplars.
  • It requires tuning of a preference parameter.

Here are the best applications and use case scenarios for using Affinity Propagation:

  • Gene Clustering: Affinity propagation can be used to cluster genes based on their expression profiles. This can help in understanding the function of genes and in identifying potential drug targets.
  • Image Segmentation: Affinity propagation can be used to segment images based on the similarity of their features. This is useful in various computer vision applications such as object recognition and tracking.
  • Social Network Analysis: Affinity propagation can be used to identify communities or groups of nodes in social networks based on their interactions and connections. This can help in understanding the structure and dynamics of social networks.

3. Spectral Clustering

Spectral Clustering is a powerful clustering algorithm that operates on a similarity graph of the data. The algorithm works by first constructing an affinity matrix that captures the pairwise similarity between data points, and then computing the eigenvalues and eigenvectors of this matrix to obtain a lower-dimensional representation of the data. The resulting eigenvectors are then used to cluster the data using a standard clustering algorithm like K-Means or hierarchical clustering.

a. Parameters: The number of clusters is the only parameter.
b. Use Case: It is most effective in cases with a limited number of clusters, even cluster size, and non-flat geometry.
c. Metric Used: Graph distances are used to measure the similarity between points.

Mathematical equation: L = D^-1/2 * W * D^-1/2, where W is the similarity matrix, D is the diagonal matrix of row sums of W, and L is the normalized graph Laplacian. Spectral clustering is a graph-based clustering algorithm that works by partitioning the eigenvectors of the graph Laplacian. The Laplacian is a matrix that describes the connectivity of data points in a graph and is used to identify clusters based on the spectral properties of the graph.

from sklearn.cluster import SpectralClustering
import numpy as np

X = np.random.rand(100, 2)
affinity_matrix = np.exp(-10 * np.sum((X[:, np.newaxis, :] - X[np.newaxis, :, :])**2, axis=-1))
n_clusters = 3
sc = SpectralClustering(n_clusters=n_clusters, affinity='precomputed')
sc.fit(affinity_matrix)
labels = sc.labels_

In this example, we generate a random dataset with 100 data points and 2 features. We then construct an affinity matrix using the Euclidean distance between data points and the Gaussian kernel with bandwidth = 10. We then initialize the Spectral Clustering model with n_clusters=3 and affinity=’precomputed’, indicating that we are using a precomputed affinity matrix. Finally, we fit the model to the data and get the cluster labels for the resulting clustering.

Drawbacks:

  • It requires the computation of a similarity matrix, which can be computationally expensive for large datasets.
  • It may be sensitive to the choice of kernel or similarity measure.
  • It may not perform well when the clusters have non-convex shapes.

Here are the best applications and use case scenarios for using Spectral Clustering:

  • Image Segmentation: Spectral clustering can be used to segment images based on their spectral properties. This is useful in various computer vision applications such as object detection and tracking.
  • Document Clustering: Spectral clustering can be used to group similar documents together based on their spectral properties. This is useful in various applications such as text classification and information retrieval.
  • Bioinformatics: Spectral clustering can be used to cluster genes or proteins based on their spectral properties. This can help in identifying gene expression patterns and in understanding various biological processes.

4. Agglomerative Clustering

Agglomerative Clustering is a hierarchical clustering algorithm that builds a tree-like structure of clusters by recursively merging the most similar clusters based on a distance metric. The algorithm starts with each data point as its cluster and iteratively merges the closest pair of clusters until a stopping criterion is met. The result is a dendrogram that shows the hierarchy of the clusters.

a. Parameters: Number of clusters or a distance threshold, linkage type, and distance metric.
b. Use Case: This algorithm is ideal for problems with many clusters, connectivity constraints, and non-Euclidean distances.
c. Metric Used: Any pairwise distance metric can be used to measure similarity between points.

Mathematical equation: linkage(y) = min(dist(y[i], y[j])), where dist is a distance metric, and y[i] and y[j] are two clusters. Agglomerative clustering is a hierarchical clustering algorithm that works by iteratively merging the closest pairs of clusters based on a linkage criterion. The linkage criterion defines the distance between two clusters and can be based on various distance metrics such as Euclidean distance or correlation.

from sklearn.cluster import AgglomerativeClustering
import numpy as np

X = np.random.rand(100, 2)
n_clusters = 3
agg = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
agg.fit(X)
labels = agg.labels_

In this example, we generate a random dataset with 100 data points and 2 features. We then initialize the Agglomerative Clustering model with n_clusters=3 and linkage=’ward’, which specifies the linkage criterion to be the Ward variance minimization algorithm. Finally, we fit the model to the data and get the cluster labels for the resulting clustering.

The AgglomerativeClustering object in scikit-learn applies a hierarchical clustering technique in which each observation starts as a separate cluster, and the clusters are merged gradually based on a chosen linkage criteria that determine the merging strategy.

The available linkage criteria are:

  • Ward linkage: This approach aims to minimize the sum of squared differences within all clusters, and it is similar to the objective function of k-means clustering but with a hierarchical agglomerative approach.
  • Maximum or complete linkage: This method minimizes the maximum distance between observations of pairs of clusters.
  • Average linkage: This method minimizes the average of the distances between all observations of pairs of clusters.
  • Single linkage: This approach aims to minimize the distance between the closest observations of pairs of clusters.

Drawbacks:

  • It can be computationally expensive for large datasets.
  • It may not perform well when the clusters have non-convex shapes.
  • It may suffer from the “chaining” problem, where clusters are merged based on their proximity to other clusters rather than their internal structure.

Here are the best applications and use case scenarios for using Agglomerative Clustering:

  • Molecular Biology: Agglomerative clustering can be used to cluster molecular structures based on their similarity. This can help in understanding the function of molecules and in identifying potential drug targets.
  • Image Segmentation: Agglomerative clustering can be used to segment images based on the similarity of their features. This is useful in various computer vision applications such as object recognition and tracking.
  • Ecology: Agglomerative clustering can be used to cluster species based on their ecological niches. This can help in understanding the distribution of species and in identifying areas of high biodiversity.

5. DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups together points that are close to each other based on a density criterion. The algorithm requires two hyperparameters: eps, which specifies the maximum distance between points in the same neighborhood, and min_samples, which specifies the minimum number of points required to form a dense region.

a. Parameters: The neighborhood size is the only parameter.
b. Use Case: This algorithm is most suitable for non-flat geometry, uneven cluster sizes, and outlier removal.
c. Metric Used: The similarity between points is determined by distances between nearest points.

Mathematical equation: density(x) = |{y: dist(x, y) ≤ ε}|, where ε is a distance threshold and dist is a distance metric. DBSCAN is a density-based clustering algorithm that identifies clusters as areas of high density separated by areas of low density. It works by defining a density threshold ε and a minimum number of data points min_samples and then identifying data points that belong to dense regions (core points) or are close to core points (boundary points).

from sklearn.cluster import DBSCAN
import numpy as np

X = np.random.rand(100, 2)
dbscan = DBSCAN(eps=0.3, min_samples=5)
dbscan.fit(X)
labels = dbscan.labels_

In this example, we generate a random dataset with 100 data points and 2 features. We then initialize the DBSCAN model with eps=0.3 and min_samples=5. Finally, we fit the model to the data and get the cluster labels for the resulting clustering.

Drawbacks:

  • It requires the tuning of two parameters: epsilon and min_samples.
  • It may not perform well when the clusters have varying densities or sizes.
  • It may not be able to cluster data with complex shapes or overlapping clusters.

Here are the best applications and use case scenarios for using DBSCAN:

  • Anomaly Detection: DBSCAN can be used to identify anomalies or outliers in data sets. This is useful in fraud detection and other applications where unusual data points need to be identified.
  • Customer Segmentation: DBSCAN can be used to segment customers based on their purchasing behavior or demographic data. This can help businesses to tailor their marketing strategies and improve customer engagement.
  • Image Segmentation: DBSCAN can be used to segment images based on the similarity of their features. This is useful in various computer vision applications such as object detection and tracking.

--

--