Basics of K-means clustering

Mohit Gupta
3 min readMar 7, 2024

--

Apart from reading these questions, try to write the code for K-means implementation from scratch. The main benefit is that while writing the code, answers to many such questions would be learned automatically, or it becomes easier to appreciate. You can follow the tutorial provided here.

  1. What is K-means clustering?

Answer: K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into K distinct, non-overlapping clusters.

2. What is the principle?

Answer: The principle behind K-means clustering is to minimize the within-cluster variance, aiming to create clusters where data points within the same cluster are as similar as possible.

Objective function = minimize intra-cluster variance or distance function

3. What are the steps involved in it?

Answer:

Step 0 — Preprocess
Step 1 — Choose K
Step 2 –Select K points randomly as cluster centroids
Step 3 — Assign pts to k clusters based on Euclidean Distance
Step 4 — Calculate the new centroids/ mean
Step 5 — Repeat steps 2–4 until convergence

4. What Preprocessing and why is it important (about Step-0)?

Answer: Two main preprocessing are:

i. Scaling/ Standardization of data — to ensure that all features contribute equally to the distance computation, which is crucial for the algorithm’s performance.

ii. Randomize the dataset — because the order of data affects the performance.

5. How to choose K? (about Step-1)

Answer: The number of clusters (K) can be chosen using methods like:

a. Elbow method
b. Silhouette score, or
c. Domain knowledge about the data.

6. How to select k random cluster centers? What is Kmeans++ ? (about Step-2)

Answer: The problem with random initialization in K-means clustering is that it can lead to suboptimal cluster centers, affecting both the convergence speed and the quality of the final clustering. This randomness can result in inconsistent results across runs and may lead to convergence to local minima rather than the global minimum.

To mitigate this, K-means++ is an improvement over random initialization, where initial cluster centers are chosen in a more intelligent way to improve the convergence speed and the quality of the final clustering.

7. Can another type of distance be used instead of Euclidean distance? (about Step-3)

Answer: Yes, other distance metrics like Manhattan distance (K-medians) or cosine similarity (spherical K-means)can be used instead of Euclidean distance (K-means).

8. What are convergence conditions for K-means clustering i.e. what are the stopping criteria?

Answer:

a. Centroid Stability
b. When points are assigned to the same clusters in consecutive rounds
c. Maximum number of Iterations
d. User-defined Tolerance

9. How to evaluate the effectiveness of K-means clustering?

Answer: The effectiveness of K-means clustering can be evaluated using metrics like:

a. Silhouette score,
b. Dunn Index
c. WCSS (Within-Cluster Sum of Squares)
d. Davies–Bouldin index, or
e. Visual inspection of the resulting clusters.

10. What shapes of clusters K-means works best?

Answer: K-means clustering works best for clusters that are spherical or isotropic in shape and have similar densities. If the clusters have irregular shapes or varying densities, K-means may produce suboptimal results, leading to misclassification of data points.

Try to visualize how Kmeans work on various shapes of clusters:

Fig 1 — Kmeans clustering results (as per scikit learn)

11. What are some variants of K-means clustering?

Answer: Variants of K-means clustering include K-medoids, K-medians, spherical K-means, hierarchical K-means, and kernel K-means, each with different clustering approaches.

12. How does the number of dimensions affect the performance of K-means clustering?

Answer: The performance of K-means clustering can degrade in high-dimensional spaces due to the curse of dimensionality, as the distance between points becomes less meaningful in higher dimensions. Dimensionality reduction techniques may be applied to mitigate this issue.

Unlisted

--

--