K-Means Clustering

Vivek Salunkhe
3 min readJul 27, 2021

--

Clustering or Cluster Analysis is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups.

Clustering. Image by author

Types of Clustering:

  1. K-Means Clustering.
  2. Hierarchical Clustering.

In today’s blog we will be discussing about K-Means Clustering.

K-Means Intuition: K-Means is an iterative algorithm that tries to partition data in K distinct non overlapping clusters. Steps involved in K-Means Clustering:

  1. Choose the number K of clusters.
  2. Select at random K points, the centroid (not necessarily from the dataset).
  3. Assign each data point to the closest centroid that form K clusters. (Choose appropriate distance metrics for finding closest distance based on the business problem in hand).
  4. Compute and place the new centroid of each cluster.
  5. Reassign each data point to the new closest centroid. If any reassignment took place go to Step 4 otherwise STOP.

Examples:

  1. Market Segmentation
  2. Fraud detection
  3. To simply identify some clusters of your customers in your company or business.

Selecting K or The Appropriate Number of Clusters:

  1. Within Cluster Sum of Squares (WCSS): WCSS is the sum of squared distance between each point and the centroid in a cluster. As the number of clusters increases, the WCSS value will start to decrease substantially to 0 when each data point represent a single cluster.
WCSS Formula. Image by author

We choose the appropriate number of cluster when the difference in WCSS for different number of cluster goes from a large value to significantly low value. The popular method to understand this is the Elbow Method

Elbow Method. Image by author

Advantage of K-Means algorithm:

  1. Easy and Simple to implement.
  2. Scales to large datasets very well.
  3. Convergence is guaranteed.
  4. Works well on testing data.
  5. Generalizes very well.

Disadvantage of K-Means algorithm:

  1. Random Initialization Trap: Choosing the initial random cluster centroid is important because if it is done wrong then it might lead the K-Means algorithm to iterate in incorrect manner thereby forming incorrect clusters. Can be overcomed by using K-Means++ algorithm.
  2. Choosing optimal number of K.
  3. Clustering data of varying sizes and density.
  4. Prone to outliers.
  5. If the data contains a large number of features then it might suffer from Curse of Dimensionality.

Implementation: Refer the following link for Python and R implementation of K-Means Clustering:

K-Means Clustering

--

--