ML Part 5: Clustering

Avicsebooks
15 min readApr 17, 2024

--

Clustering is a type of unsupervised machine learning technique used for grouping similar objects or data points into clusters. The goal of clustering is to partition a dataset into subsets, or clusters, such that data points within the same cluster are more similar to each other than to those in other clusters, based on certain criteria or features.

Here are some key points about clustering:

  1. Unsupervised Learning: Clustering is an unsupervised learning technique, meaning that it does not require labeled data for training. Instead, it relies on the intrinsic structure of the data to identify patterns and groupings.
  2. Objective: The main objective of clustering is to find natural groupings or clusters in the data without prior knowledge of the class labels. Clustering algorithms seek to maximize the intra-cluster similarity and minimize the inter-cluster similarity.
  3. Types of Clustering Algorithms: There are various clustering algorithms, each with its own approach and characteristics. Some common clustering algorithms include K-means clustering, hierarchical clustering, density-based clustering (e.g., DBSCAN), and Gaussian mixture models (GMM).
  4. Cluster Representation: Each cluster typically has a representative point or centroid that summarizes the characteristics of the cluster. The choice of representation depends on the clustering algorithm used.
  5. Evaluation: Clustering algorithms may require evaluation to assess their effectiveness in partitioning the data into meaningful clusters. Evaluation metrics such as silhouette score, Davies–Bouldin index, or within-cluster sum of squares (WCSS) can be used to measure the quality of clustering.
  6. Applications: Clustering finds applications in various fields, including customer segmentation, market analysis, image segmentation, document clustering, anomaly detection, and recommendation systems.
  7. Challenges: Clustering can be challenging due to factors such as the curse of dimensionality, noisy or ambiguous data, and the subjective nature of defining similarity or distance metrics.

Clustering for segmentation

Clustering for segmentation involves using clustering algorithms to partition a dataset into distinct groups or segments based on similarities among data points. Segmentation aims to identify homogeneous subgroups within the data, allowing for more targeted analysis or actions.

Let’s illustrate clustering for segmentation with an example:

Customer Segmentation for Marketing

Example: Customer Segmentation for Marketing

Imagine you work for a retail company that wants to segment its customers for targeted marketing campaigns. You have a dataset containing information about customers, such as their age, income, spending habits, and purchase history. Your goal is to divide customers into distinct segments based on their similarities, so that marketing strategies can be tailored to each segment’s preferences and behaviours.

Here’s how clustering can be applied to achieve customer segmentation:

  1. Data Preparation: First, you preprocess and prepare the customer data, ensuring that it is cleaned, normalized, and relevant features are selected.
  2. Clustering Algorithm Selection: Next, you choose an appropriate clustering algorithm. For this example, you might choose K-means clustering or hierarchical clustering, which are commonly used for customer segmentation tasks.
  3. Feature Selection and Scaling: You may need to select the features that best represent customer behaviour and scale them appropriately to ensure that no single feature dominates the clustering process.
  4. Clustering: Apply the chosen clustering algorithm to the customer data. The algorithm will partition the customers into clusters based on similarities in their feature values. Each cluster represents a segment of customers who share similar characteristics.
  5. Interpretation: Analyse the resulting clusters to understand the characteristics of each segment. You may use visualization techniques to explore the clusters and identify meaningful patterns.
  6. Segment Profiling: Profile each segment by examining its demographic and behavioural traits. This helps in understanding the unique characteristics of each segment and devising targeted marketing strategies.
  7. Marketing Strategy Development: Based on the segment profiles, develop tailored marketing strategies for each segment. For example, you might design promotions or campaigns that appeal to the preferences and needs of specific customer segments.
  8. Evaluation and Refinement: Evaluate the effectiveness of the segmentation and marketing strategies over time. Refine the segmentation approach as needed based on feedback and changes in customer behaviour.

By using clustering for segmentation, the retail company can better understand its customer base, personalize marketing efforts, and improve customer engagement and satisfaction. This approach enables more effective resource allocation and ultimately leads to better business outcomes.

Clustering vs classification

Clustering and classification are both fundamental techniques in machine learning, but they serve different purposes and operate in different ways. Here’s a comparison between clustering and classification:

Classification vs Clustering

Purpose:

  • Clustering: Clustering is an unsupervised learning technique used to group similar data points together into clusters based on their intrinsic characteristics or similarities. The primary goal of clustering is to discover hidden patterns or structures in the data without any prior knowledge of class labels.
  • Classification: Classification is a supervised learning technique used to assign predefined class labels to data points based on their features. The goal of classification is to learn a mapping from input features to class labels using labelled training data, allowing the model to predict the class labels of unseen instances.

Supervision:

  • Clustering: Clustering is an unsupervised learning task, meaning that it does not require labelled data for training. Clustering algorithms partition the data into clusters based solely on the similarity of data points without reference to any class labels.
  • Classification: Classification is a supervised learning task, meaning that it requires labelled data for training. Classification algorithms learn to classify data points into predefined classes based on the features and corresponding class labels provided during training.

Output:

  • Clustering: The output of clustering is a partitioning of the data into clusters, with each cluster containing data points that are more similar to each other than to those in other clusters. Clustering algorithms do not assign explicit class labels to data points.
  • Classification: The output of classification is a predictive model that can assign class labels to new, unseen instances based on their features. Classification models learn to generalize from the training data to make predictions on new data.

Objective:

  • Clustering: The objective of clustering is to discover hidden patterns, groupings, or structures in the data, which can be useful for exploratory data analysis, segmentation, or anomaly detection.
  • Classification: The objective of classification is to learn a mapping from input features to class labels, allowing the model to accurately predict the class labels of new instances. Classification is commonly used for tasks such as spam detection, sentiment analysis, and image recognition.

Examples:

  • Clustering: Customer segmentation, image segmentation, document clustering, and anomaly detection are examples of tasks where clustering is commonly used.
  • Classification: Email spam detection, sentiment analysis of social media posts, and handwritten digit recognition are examples of tasks where classification is commonly used.

Why to use Clustering

  1. Exploratory data analysis.
  2. Summary generation.
  3. Outlier detection.
  4. Finding duplicates.
  5. Preprocessing step.

Clustering Algorithms

Partitioned-based Clustering.

Partition-based clustering is a type of clustering algorithm that divides the dataset into a set of non-overlapping clusters, where each data point belongs to exactly one cluster. These algorithms iteratively refine the partitioning of the data based on certain criteria, such as minimizing the intra-cluster distance or maximizing the inter-cluster distance. One of the most popular partition-based clustering algorithms is K-means clustering.

k-mean Clustering

Hierarchical Clustering.

Hierarchical clustering is a type of clustering algorithm that organizes data points into a hierarchical tree or dendrogram. It does so by recursively merging or dividing clusters of data points based on their similarity or distance.

hierarchical clustering:

There are two main approaches to hierarchical clustering:

  1. Agglomerative Hierarchical Clustering: This bottom-up approach starts with each data point as a separate cluster and then iteratively merges the closest pairs of clusters until only one cluster remains. The merging process continues until a stopping criterion is met, such as a predefined number of clusters or a specific distance threshold. The result is a dendrogram that illustrates the hierarchical structure of the clusters.
  2. Divisive Hierarchical Clustering: This top-down approach begins with all data points belonging to a single cluster and then recursively divides the cluster into smaller clusters until each data point is in its own cluster. Similar to agglomerative clustering, the division process continues until a stopping criterion is met.

Here’s a high-level overview of how agglomerative hierarchical clustering works:

  1. Initialization: Start with each data point as a singleton cluster.
  2. Merge Step: Calculate the pairwise distances or similarities between clusters and merge the two closest clusters into a larger cluster. This process continues until all data points are in a single cluster.
  3. Dendrogram Construction: As clusters are merged, a dendrogram is constructed to visualize the hierarchy of clusters. The vertical axis of the dendrogram represents the distance or similarity between clusters, while the horizontal axis represents the individual data points.
  4. Stopping Criterion: Determine when to stop merging clusters based on a predefined threshold, such as a maximum number of clusters or a specific distance threshold.
  5. Final Clusters: Once the merging process is complete, the final clusters are obtained based on the stopping criterion.

Hierarchical clustering has several advantages, including its ability to reveal the hierarchical structure of the data, its flexibility in handling different distance metrics, and its suitability for small to medium-sized datasets.

Density-based Clustering.

Density-based clustering is a type of clustering algorithm that identifies clusters in a dataset based on the density of data points. Unlike partition-based algorithms like K-means or hierarchical clustering, density-based clustering does not require a predefined number of clusters and can find clusters of arbitrary shape and size. Instead, it groups together data points that are closely packed in high-density regions, while marking points in low-density regions as outliers or noise.

DBSCAN

One of the most popular density-based clustering algorithms is DBSCAN (Density-Based Spatial Clustering of Applications with Noise). Here’s how DBSCAN works:

  1. Core Points and Neighbourhoods: DBSCAN defines two important parameters:
  • Epsilon (ε): The maximum distance that defines the neighbourhood of a data point.
  • MinPts: The minimum number of data points required to form a dense region (core point).

2. Core Points: For each data point, DBSCAN calculates the number of neighbouring points within a distance ε. If the number of neighbours is greater than or equal to MinPts, the point is considered a core point.

3, Density Reachability: DBSCAN then identifies reachable points from each core point. A point is considered reachable if it is within the ε-neighbourhood of a core point, or if there is a chain of core points leading to it.

4. Cluster Formation: Starting from a core point, DBSCAN recursively expands the cluster by adding reachable points to it. This process continues until no more reachable points can be added. Each cluster consists of all points directly or indirectly reachable from a core point.

5. Noise Points: Data points that are not core points and are not reachable from any core point are classified as noise points or outliers.

DBSCAN has several advantages, including its ability to handle clusters of varying shapes and sizes, its robustness to noise and outliers, and its avoidance of the need to specify the number of clusters in advance. However, it may struggle with datasets of varying densities or with clusters of significantly different densities.

K-Means Clustering

K-means clustering is an unsupervised machine learning algorithm used to partition a dataset into K distinct clusters. The algorithm aims to minimize the within-cluster sum of squares (WCSS), also known as inertia, by iteratively assigning data points to the nearest cluster centroid and updating the centroids.

Similarity Measure:

In K-means clustering, similarity measures are used to quantify the distance between data points and cluster centroids. The most common similarity measure is Euclidean distance, which calculates the straight-line distance between two points in a multidimensional space. Other distance metrics, such as Manhattan distance or cosine similarity, can also be used based on the nature of the data.

Sure, let’s delve into K-means clustering in detail, covering the algorithm, similarity measures, and centroids, along with an example.

K-means Clustering:

Definition: K-means clustering is an unsupervised machine learning algorithm used to partition a dataset into K distinct clusters. The algorithm aims to minimize the within-cluster sum of squares (WCSS), also known as inertia, by iteratively assigning data points to the nearest cluster centroid and updating the centroids.

Similarity Measure:

In K-means clustering, similarity measures are used to quantify the distance between data points and cluster centroids. The most common similarity measure is Euclidean distance, which calculates the straight-line distance between two points in a multidimensional space. Other distance metrics, such as Manhattan distance or cosine similarity, can also be used based on the nature of the data.

How K-means Works:

  1. Initialization: Choose the number of clusters K and randomly initialize K cluster centroids.
  2. Assignment Step: For each data point, calculate the distance to each centroid using a similarity measure (e.g., Euclidean distance). Assign each data point to the nearest centroid, forming K clusters.

3. Update Step: Recalculate the centroids of the clusters by taking the mean of all data points assigned to each cluster. The new centroids represent the centre of mass of the data points in each cluster.

4. Iteration: Repeat the assignment and update steps until convergence criteria are met. Convergence occurs when the cluster assignments and centroids no longer change significantly between iterations or after a fixed number of iterations.

5. Convergence: Once convergence is reached, the final clusters and centroids are obtained.

Centroids:

In K-means clustering, centroids are the representative points of the clusters. They serve as the centre of mass for the data points within each cluster. The centroids are updated iteratively during the algorithm’s execution to minimize the WCSS. The final centroids represent the final cluster centres after convergence.

k-mean algorithm:

Here’s a detailed explanation of how the K-means algorithm works:

Algorithm Steps:

  1. Initialization:
  • Choose the number of clusters, K, to partition the data into.
  • Randomly initialize K cluster centroids. Centroids are the initial representative points for each cluster.

2. Assignment Step:

  • For each data point in the dataset, calculate the distance to each centroid.
  • Assign each data point to the nearest centroid, forming K clusters.

3. Update Step:

  • Recalculate the centroids of the clusters by taking the mean of all data points assigned to each cluster.
  • The new centroids represent the centre of mass of the data points in each cluster.

4. Iteration:

  • Repeat the assignment and update steps until convergence criteria are met. Convergence occurs when the cluster assignments and centroids no longer change significantly between iterations or after a fixed number of iterations.

5. Convergence:

  • Once convergence is reached, the final clusters and centroids are obtained.

Detailed Explanation:

  • Objective: The goal of the K-means algorithm is to minimize the within-cluster sum of squares (WCSS), also known as inertia, by iteratively assigning data points to clusters and updating centroids.
  • Assignment Step: In the assignment step, each data point is assigned to the nearest centroid based on a similarity measure, commonly the Euclidean distance.
  • The data points are grouped into clusters, where each cluster is represented by its centroid.
  • Update Step: In the update step, the centroids of the clusters are recalculated based on the data points assigned to each cluster.
  • The new centroids represent the centre of mass of the data points in each cluster.
  • Iteration: The assignment and update steps are repeated iteratively until convergence criteria are met. Convergence typically occurs when the cluster assignments and centroids stabilize, indicating that the algorithm has found a stable clustering solution.
  • Convergence:
  • Once convergence is reached, the final clusters and centroids are obtained, and the algorithm terminates.

Example:

Let’s illustrate the K-means algorithm with a simple example:

Suppose we have a dataset of 2D points, and we want to partition it into K=3 clusters.

  1. Initialization: Randomly initialize three cluster centroids.
  2. Assignment Step: Assign each data point to the nearest centroid.
  3. Update Step: Recalculate the centroids based on the data points assigned to each cluster.
  4. Iteration: Repeat the assignment and update steps until convergence.
  5. Convergence: Once convergence is reached, obtain the final clusters and centroids.

K-means clustering is widely used for various applications such as customer segmentation, image compression, and anomaly detection. It is computationally efficient and scalable, making it suitable for large datasets. However, it is sensitive to the initial placement of centroids and may converge to local optima. Therefore, it’s common to run the algorithm multiple times with different initializations and choose the clustering result with the lowest WCSS.

k means characteristics

K-means clustering is a popular algorithm for partitioning a dataset into a predetermined number of clusters. Here are the key characteristics of K-means clustering:

  1. Unsupervised Learning: K-means is an unsupervised learning algorithm, meaning it doesn’t require labelled data for training. Instead, it seeks to identify patterns and structures within the data based solely on the input features.
  2. Partitioning Algorithm: K-means partitions the dataset into K clusters, where K is a predefined number specified by the user. Each data point belongs to exactly one cluster.
  3. Centroid-based: K-means clusters data points around centroids, which are representative points for each cluster. Centroids are iteratively updated during the algorithm’s execution to minimize the within-cluster sum of squares (WCSS), also known as inertia.
  4. Objective Function: The objective of K-means clustering is to minimize the distance between data points and their respective cluster centroids. This is achieved by minimizing the WCSS, which quantifies the compactness of the clusters.
  5. Iterative Optimization: K-means uses an iterative optimization process to converge to a clustering solution. It alternates between assigning data points to the nearest centroid (assignment step) and updating the centroids based on the data points assigned to each cluster (update step). This process continues until convergence criteria are met.
  6. Sensitivity to Initializations: K-means clustering is sensitive to the initial placement of centroids. Different initializations can lead to different clustering results, as the algorithm may converge to different local optima. To mitigate this, it’s common to run the algorithm multiple times with different initializations and choose the clustering result with the lowest WCSS.
  7. Euclidean Distance: By default, K-means uses Euclidean distance as the similarity measure to calculate the distance between data points and centroids. However, other distance metrics can be used depending on the nature of the data.
  8. Scalability: K-means is computationally efficient and scalable, making it suitable for large datasets. Its time complexity is linear with respect to the number of data points, making it applicable to datasets with thousands or even millions of data points.
  9. Assumes Globular Clusters: K-means assumes that clusters are spherical or globular in shape and have roughly equal variance. As a result, it may not perform well on datasets with irregularly shaped or non-convex clusters.
  10. Noisy and Outlier Sensitivity: K-means is sensitive to noisy data and outliers, as they can significantly impact the clustering results. Outliers may be erroneously assigned to clusters or form separate clusters themselves.

K in K-mean and error

As the parameter K for K-means clustering increases, the within-cluster sum of squares (WCSS) typically decreases.

Here’s why:

  1. More Clusters: Increasing K means dividing the dataset into a larger number of clusters, with each cluster potentially containing fewer data points.
  2. Smaller Distances: With more clusters, the centroids are more likely to be closer to the data points they represent. This results in smaller distances between data points and their respective cluster centroids.
  3. Decreased Inertia: The WCSS (also known as inertia) is the sum of squared distances of each data point to its nearest cluster centroid. With smaller distances, the sum of squared distances decreases, leading to lower WCSS.

However, it’s important to note that decreasing WCSS with increasing K doesn’t necessarily mean that the clustering is better or more meaningful. There is a trade-off between model complexity (number of clusters) and the interpretability and usefulness of the clustering results.

Using techniques like the elbow method or silhouette analysis can help determine an appropriate value of K that balances the trade-off between model complexity and clustering quality.

Gauge the performance of a k-means clustering model.

When ground truth labels are not available, evaluating the performance of a K-means clustering model becomes more challenging. However, there are several metrics and techniques that can be used to assess the quality of the clustering results. Here are some commonly used methods:

  1. Inertia or Within-Cluster Sum of Squares (WCSS):
  • Inertia measures the sum of squared distances of data points to their nearest cluster centroid. Lower inertia indicates tighter clusters.
  • While not a direct measure of clustering quality, decreasing inertia suggests better clustering.
  • However, inertia tends to decrease as the number of clusters increases, making it less useful for determining the optimal number of clusters.

2. Silhouette Score:

  • Silhouette score measures how similar an object is to its own cluster compared to other clusters.
  • It ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighbouring clusters.
  • Average silhouette score across all data points can be used as an overall measure of clustering quality.
  • A higher silhouette score suggests better clustering.

3. Gap Statistics:

  • Gap statistics compare the within-cluster dispersion to that expected under an appropriate reference null distribution.
  • It helps in estimating the optimal number of clusters by comparing the observed dispersion to a random distribution.
  • The number of clusters with the largest gap statistic is considered the optimal number of clusters.

4. Calinski-Harabasz Index:

  • The Calinski-Harabasz index measures the ratio of between-cluster dispersion to within-cluster dispersion.
  • A higher index value indicates better clustering, with more compact and well-separated clusters.
  • It is computed as the ratio of the sum of between-cluster dispersion to the sum of within-cluster dispersion.

5. Davies–Bouldin Index:

  • The Davies–Bouldin index measures the average similarity between each cluster and its most similar cluster, where similarity is based on both the within-cluster dispersion and the between-cluster dispersion.
  • A lower index value indicates better clustering, with well-separated and distinct clusters.

6. Visual Inspection:

  • Visual inspection of the clustering results can provide qualitative insights into the quality of the clusters.
  • Techniques such as scatter plots, heatmaps, or t-SNE visualizations can be used to visualize the data and cluster assignments.

It’s important to note that these evaluation metrics provide indications of clustering quality but do not guarantee optimal clustering. It’s often recommended to use multiple metrics and compare results across different parameter settings (e.g., number of clusters) to make informed decisions about the clustering model.

Part 4: Link

Part 6: Link

--

--