Understanding K-means Clustering

Nermeen Abd El-Hafeez
6 min readSep 28, 2023

--

What is K-mean clustering algorithm?

Clustering is a fundamental technique in unsupervised machine learning, used to identify patterns within data by grouping similar data points together. The core objective of a clustering algorithm is to locate data points that share common characteristics, thus assigning them to the same cluster. To achieve this, clustering algorithms utilize a critical component known as a distance measure to quantify the similarity or dissimilarity between data points.

Euclidean Distance

At the core of many clustering techniques, including K-means, is the Euclidean distance. It acts as a pivotal metric for assessing dissimilarity between two data points, typically denoted as Observation A and Observation B, both represented as pairs of coordinates (X, Y). The Euclidean distance formula for these two points is calculated as follows:

Euclidean Distance = √((X1 — X2)² + (Y1 — Y2)²)

Centroid

The K-means clustering algorithm relies on centroids to establish clusters. In simple terms, a centroid, when applied to a set of points in a two-dimensional X-Y plane, becomes another point characterized by its X and Y coordinates. For instance, given three points with coordinates (X1, Y1), (X2, Y2), and (X3, Y3), the centroid of these points can be computed as the average of their X and Y coordinates:

Centroid = ((X1 + X2 + X3) / 3, (Y1 + Y2 + Y3) / 3)

This formula can be extended to encompass n points as follows:

Centroid = ((X1 + X2 + … + Xn) / n, (Y1 + Y2 + … + Yn) / n)

K-means in Action

Now, let’s explore how the K-means algorithm operates in practice. Suppose we intend to divide a dataset into two clusters based on the Euclidean distance between data points and their respective centroids. Here, n signifies the number of data points, while k represents the desired number of clusters.

Step 1: Assignment Step

  • Initially, we calculate the Euclidean distance between each data point and the two cluster centers (initial centroids).
  • Subsequently, we assign each data point to the nearest centroid based on the minimum distance, effectively grouping them into their respective clusters.

Step 2: Optimization Step

  • In the next phase, we recompute the centroids. These new centroids are derived by computing the mean of individual points within each cluster.
  • This process yields updated cluster centers, often referred to as the “next optimal centroids”.

Step 3: Iteration

  • After computing the two new centroids in Step 2, we return to Step 1.
  • We initiate the process of reassigning each data point to the nearest cluster, now representing the updated optimal clusters.
  • This assignment follows the same methodology, involving the computation of the Euclidean distance between a data point and the centroids, followed by assignment to the nearest centroid.

This iterative cycle continues until the centroids no longer change significantly or until a predefined stopping condition is met. At this point, the K-means algorithm has successfully grouped the data points into ‘k’ clusters, with each data point belonging to the cluster represented by its nearest centroid.

Choosing the Right K

Elbow Method

In the Elbow Method, the goal is to find the ideal number of clusters (K) for a dataset. This method involves systematically varying the number of clusters from 1 to a certain upper limit (commonly 10) and assessing the Within-Cluster Sum of Squares (WCSS) for each K.

Understanding WCSS:

WCSS quantifies the sum of squared distances between each data point and the centroid of its respective cluster. In simpler terms, it measures how closely data points are grouped around their cluster’s center. A smaller WCSS indicates that data points are tightly clustered within their respective groups.

Interpreting the Elbow Point:

The K value associated with the elbow point represents the optimal number of clusters for your dataset. Beyond this point, adding more clusters may not provide substantial improvements in capturing data patterns, and the WCSS tends to decrease at a gentler rate.

Silhouette Method

The Silhouette Method is a powerful and intuitive technique for selecting the ideal number of clusters (K) in K-means clustering. Unlike the Elbow Method, which focuses solely on the Within-Cluster Sum of Squares (WCSS), the Silhouette Method delves into the quality of the clustering itself.

Silhouette Coefficient: A Measure of Cluster Fit

The Silhouette Coefficient serves as a fundamental metric within this method, providing a quantitative measure of how effectively an individual data point aligns with its designated cluster in comparison to alternative clusters. This metric yields values spanning the range from -1 to 1, offering invaluable insights into the quality of the clustering assignments:

  • A Silhouette Coefficient of 1 stands as a clear indicator that a data point seamlessly integrates with its own cluster while maintaining significant dissimilarity from other clusters. This signifies a robust and distinct cluster assignment.
  • Conversely, a value of -1 suggests that the data point might be better suited within a neighboring cluster, implying a suboptimal assignment.
  • When the coefficient approaches 0, it hints that the data point resides near or straddles the boundary between clusters, signaling potential overlap or uncertainty in cluster delineations.

Calculating the Silhouette Coefficient: A Formula for Precision

To unveil the optimal K, the Silhouette Method computes the Silhouette Coefficient for every data point within the dataset and subsequently derives the average across all data points. This average Silhouette Coefficient grants a holistic view of the overall clustering quality for a given K.

The formula for calculating the silhouette coefficient for a specific data point i is as follows:

S(i)=b(i)−a(i)​/(max{a(i),b(i)})

Where:

  • S(i) is the silhouette coefficient for the data point i.
  • a(i) is the average distance from the data point i to all other data points within the same cluster(inter-cluster distance).
  • b(i) is the smallest average distance from the data point i to all data points in a different cluster, minimized over clusters(inter-cluster distance).

Peak Silhouette Coefficient: Peaks of Clustering Excellence

In practice, when plotting the average Silhouette Coefficient against a range of K values, the graph often reveals one or more discernible peaks. These peaks mark points where clustering quality reaches its zenith, showcasing robust and well-defined cluster configurations.

Selecting the Optimal K: A Data-Driven Decision

To ascertain the optimal number of clusters, the task is straightforward: identify the K value that corresponds to the highest peak on the Silhouette Coefficient plot. This value signifies the number of clusters that best encapsulates the underlying data patterns while maintaining separation between clusters.

Interpreting Results: Clarity Through Coefficients

A notably high average Silhouette Coefficient serves as a testament to well-structured clusters with minimal overlap or misassignment. By harnessing the Silhouette Method, data-driven decisions come to the forefront, ensuring that your K-means analysis accurately captures the intricacies of your dataset’s underlying patterns.

In conclusion, K-means clustering stands as a powerful tool in the world of unsupervised machine learning, allowing us to uncover hidden structures and patterns within our data. From its foundation in Euclidean distance to the intricacies of centroid-based clustering, we’ve delved into the core concepts that drive this algorithm.

--

--

Nermeen Abd El-Hafeez

Passionate data science enthusiast, 2 years of self-study. Proficient in Python, skilled in data analysis, machine learning, and deep learning techniques.