Fundamental Tasks of AI — Part 2 — Clustering (1 of 2)
In the previous article, how AI performs the fundamental tasks of Classification and Regression and its applicability in real-life. In this article, let us look at Clustering, the unsupervised learning pioneer, which actually pre-dates many supervised learning techniques, emerging in the early days of ML when labelled data was scarce.
Clustering’s early adoption, versatility, simplicity, efficiency, and role as a foundation for other techniques have solidified its position as a classic in the ML world.
Clustering
Clustering is the task of dividing the data points into a number of groups such that data points in the same groups are more similar to each other and dissimilar to the data points in other groups.
Clustering in Artificial Intelligence is a powerful unsupervised learning technique — as it determines the intrinsic grouping of the unlabelled data. Clustering is very useful in exploring unknown datasets, revealing hidden relationships within the data, reducing huge complex datasets to easily digestible clusters. It is suitable for various tasks including customer segmentation, image compression, text mining, gene expression analysis, document clustering (for efficient information retrieval), object recognition (and automatic annotation), anomaly detection, fraud detection and phylogenetics.
Clustering using ML/DL has the following advantages:
· Uncovers non-linear relationships: AI can identify complex, non-linear relationships between features, leading to more accurate and nuanced clustering solutions.
· Discovers hidden patterns: AI can analyze complex, high-dimensional data, revealing subtle relationships and clusters that might elude human intuition.
· Identifies outliers and anomalies: AI can automatically flag unusual data points that deviate from the clusters, potentially indicating fraud, errors, or interesting phenomena.
· Reduced bias: AI algorithms, rely solely on objective data patterns, minimizing bias (compared to human’s unconscious bias).
· Diverse applications without needing domain expertise: A single clustering algorithm can be applied to a wide range of problems — customer segmentation to gene expression analysis — with minimal adaptation.
In the context of clustering, “non-globular data” refers to data points that don’t neatly form compact, spherical clusters. In real-life, the clusters can take various shapes and forms (Elongated, spiral, cluster with holes, clusters that looks like cloud or splash of paint), making them more challenging. That is why exploring and experimenting with different clustering algorithms is needed for real-life implementations.
Centroid-based (Partitioning) Clustering
Centroid-based clustering algorithms is widely used as they are simple, efficient and can handle large datasets. Divides data points into a predefined number (k) non-overlapping clusters based on the proximity of data points to cluster centers (centroids).
There are various centroid-based clustering algorithms including:
1. K-Means clustering: Intuitive centroid-based algorithm that divides the data into k clusters by minimizing the total distance between data points and their closest cluster center. Every data point searches for its closest cluster leader (centroid), and these leaders keep moving around based on their followers, until everyone settles into a comfortable neighbourhood.
K-means produces compact and spherical clusters, but it requires specifying the number of clusters beforehand and is sensitive to outliers and the initial value of the centroid.
2. K-Medoids clustering: Similar to K-means but uses actual data points as centroids, making it more robust to outliers and non-spherical clusters. This makes it more robust to outliers and handling non-spherical clusters.
A centroid is a calculated point (that may not be an actual data point) within a cluster, typically the average of all data points. A medoid (is like a middle person in the group photo) is an actual data point within a cluster. It is chosen as the representative point that minimizes the total distance between itself and all other points in the cluster.
Three types of K-Medoids clustering algorithms are PAM (Partitioning Around Medoids) — most powerful but complex and time consuming, CLARA (Clustering Large Applications), CLARANS (Randomized Clustering Large Applications) — where the number of clusters is not known beforehand.
3. K-Means++ clustering: K-means randomly chooses initial cluster centers (centroids) and can lead to poor clustering, especially if there are outliers or clusters with high density. K-means++ tackles this by probabilistically selecting centroids based on their distance to existing ones. by choosing diverse and well-spread centroids, K-Means++ provides better starting points. Thus K-means++ addresses the critical problems of initialization bias, suboptimal solutions, slow convergence, and difficulty with outliers/unevenly distributed data.
Hierarchical Clustering
Hierarchical clustering builds a “tree-like” hierarchy of clusters, where each node represents a cluster and its children represent subclusters. This hierarchical structure allows the analysis of data from different perspectives — zoom out to see the big picture, or zoom in to specific levels, exploring subclusters and finer details within them.
Popular hierarchical clustering algorithms include:
1. Agglomerative (Bottom-up) clustering: Starts with each data point as its own cluster and iteratively merges the “closest” pair of clusters until you reach the desired number of clusters or a stopping criterion is met.
Agglomerative clustering offers a beautiful way to visualize the data structure through a dendrogram, choosing the appropriate linkage criterion is crucial for making sense of those clusters and ensuring they accurately reflect the underlying data relationships.
a) Single linkage computes the distance between the closest pair of data points in each cluster and merges the clusters that have the smallest distance. This can lead to elongated, “chain-like” clusters where outliers or points close to the merging edge heavily influence the cluster shape. Single linkage is computationally efficient, especially for large datasets. Considered “non-globular data friendly” as it excels at identifying elongated, chain-like clusters or those with irregular shapes.
b) Complete linkage computes the distance between the farthest pair of data points in each cluster and merges the clusters that have the largest distance. This tends to produce compact, spherical clusters and is less sensitive to outliers in the data. But it might miss broader relationships between clusters with just a few distant points. It shines on clean, globular clusters but has mixed results on anything else.
c) Average linkage balances both perspectives, taking the average distance between all pairs of points in the clusters. This provides a more “global” view but can still be affected by outliers or sparse data. It performs well across a wider range of data types compared to single or complete linkage. It can handle globular clusters, some non-globular shapes, and even moderate noise levels without being overly sensitive.
d) Ward’s method that aims to minimize the total variance within the merged clusters at each step of the clustering process. This method measures how spread out the points are within each cluster (typically using the sum of squared distances from each point to the cluster center) and merges the clusters that minimize the variance. It seeks to create clusters that are as tight and cohesive as possible, while also maintaining a balance between different sized clusters. It gives a quantitative measure of cluster compactness, enabling interpretability. But it is computationally expensive.
2. Divisive (Top-down) clustering: Starts with the entire dataset as a single cluster and iteratively splits it into two sub-clusters until the desired number of clusters is reached. This is commonly referred to as DIANA, DIvisive ANAlysis clustering algorithm.
DIANA is computationally efficient, flexible (with different splitting criteria), handle non-globular shapes, and less sensitive to outliers, making it a valuable tool for exploratory data analysis. But dendrograms generated from DIANA are less intuitive and harder to visualize. Also, the initial splits can significantly influence the final cluster structure — and may miss broader relationships
a) DIANA with Maximum Diameter (DIANA-MaxDiam): The most common DIANA algorithm that leads to well-separated, tight clusters, ideal for data with clear boundaries and globular shapes.
b) DIANA with Minimum Variance (DIANA-MinVar): This variant focuses on minimizing the variance within each subcluster. This results in internally consistent clusters where data points within each group are as similar as possible, suitable for data with diverse features or non-globular shapes.
c) Adaptive DIANA (ADIANA): This algorithm adds a layer of flexibility by allowing the splitting criterion to dynamically adapt based on the current cluster structure. It can switch between maximizing diameter and minimizing variance depending on the local characteristics of the data, potentially leading to more balanced and natural cluster formations.
d) Feature-Weighted DIANA (FW-DIANA): This aims at emphasizing specific aspects of the data that are crucial for defining the clusters, leading to more relevant and interpretable results. This approach incorporates domain knowledge by assigning weights to different features during the splitting process.
3. Hybrid clustering — algorithms like CLARA (Clustering Large Applications) and ROCK (RObust Clustering using k-means) combines elements of both agglomerative and divisive approaches, allowing for more flexibility and customization.
Density-based Clustering
Density-based clustering is a powerful technique that groups data points based on their local density (instead of proximity). Points are grouped based on how close they are to their neighbours and how many neighbours they have. As the algorithm automatically identifies clusters based on data density, there is no need to guess the number of clusters beforehand.
For each point, the number of neighbours is counted based on the specified distance threshold (called epsilon radius). Points with a high local density (many neighbours) are considered core points, which form the core of clusters. Points that have very few neighbours are considered noise, and excluded from clusters, making it useful for anomaly detection.
Popular density-based clustering algorithms include:
1. DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Most well-known and widely used. It identifies clusters based on eps (epsilon radius) and minPts (minimum points — number of neighbours a data point must have to be considered a core point). Points that have fewer than minPts neighbours but are neighbours of core points are considered border points. Remaining points are considered noise and excluded from clusters.
2. DBSCAN++ (An Improved DBSCAN): DBSCAN++ is specifically designed to handle noisy datasets more effectively. DBSCAN++ considers both the number of neighbours and their distances, giving closer neighbours more influence. It checks if adding a neighbour significantly increases the overall cluster density, and if not, it is considered as noise. It also tries to automatically adjust the eps value based on the local data density.
3. HDBSCAN (Hierarchical DBSCAN): HDBSCAN uses a hierarchical approach, by performing DBSCAN clustering with multiple eps values, creating a dendrogram that captures cluster information at various granularities.
HDBSCAN has several parameters that can be adjusted, such as min_cluster_size, min_samples, cluster_selection_epsilon, max_cluster_size, metric, alpha, algorithm, leaf_size, n_jobs, cluster_selection_method, allow_single_cluster, store_centers, and copy. This allows you to extract clusters of different sizes and densities, catering to data with diverse cluster structures.
HDBSCAN is implemented in Python using the scikit-learn library and also available as a standalone package (hdbscan · PyPI)
4. OPTICS (Ordering Points To Identify the Clustering Structure): It aims to reveal the underlying structure of clusters by analyzing the reachability distance of each data point. Reachability distance represents the minimum distance from the point to a higher density region, signifying its “easiness” to be reached by a core point. It maintains the priority queues where points are sorted in ascending order of their reachability. And it iteratively processes the queue until all points are processed.
The ordering process reveals valuable insights into the data’s cluster structure.
- Density peaks: Points with the lowest reachability distances are considered density peaks, representing the cores.
- Valleys: Points with high reachability distances are considered valleys, often lying between clusters.
- Gradients: The change in reachability distance across the ordering reflects the density gradient around each point, helping us understand how clusters transition into each other.
Unlike DBSCAN, OPTICS is a fascinating algorithm that reveals the full spectrum of cluster structures, including dense cores, sparse edges, and transitional areas. The cluster-ordering can be used to create informative visualizations that highlight the density landscapes and relationships between clusters. Points with significantly high reachability distances can be flagged as potential outliers or anomalies.
But, choosing specific thresholds or segmentation methods for extracting clusters involves some judgment. OPTICS is implemented in Python using the scikit-learn library.
5. DENCLUE (Density-Based Clustering Using Local Outlier Detection): It takes a unique approach by combining density estimation with outlier detection. Instead of simply counting neighbours, DENCLUE uses kernel density estimation to calculate a smooth, continuous density function for each point. This function reflects the local density around the point, considering not just the number of neighbours but also their distances. The “influence function” assigns higher weights to closer neighbours, resulting in a more accurate representation of how each point influences the surrounding density. The density function is analyzed to identify its local maxima (peaks) and minima (valleys). Peaks correspond to attractors, forming the cluster cores. Minima represent rejecters, considered outliers. Points with intermediate density, connecting attractors and rejecters, are designated as bridges.
DENCLUE’s focus on outlier detection makes it most suitable for data mining and anomaly detection.
DENCLUE is implemented in Python using the scikit-learn library. It can be used to cluster data based on density and stability, using various input formats, output formats, visualization tools, outlier detection methods, and robust single linkage.
Conclusion
The following picture (which itself is a subset of clustering algorithm) explains why this article turned out to be so long… and the next article will cover other algorithm approaches for clustering!