Data Scientists’ Interview Guide: k-Means

Common interview questions asked around k-Means such as its pros/cons, when to use it, variations of the simple k-Means, and how to code it from scratch in Python

Karun Thankachan
CodeX
10 min readMay 18, 2023

--

Photo by Chris Mok || @cr.mok on Unsplash

As part of the data science interviews you will typically face a machine learning round that tests your understanding of basic ML algorithms such as Linear Regression, Logistic Regression, SVM, etc. In this post we cover one of the most commonly asked ones — k-Means. How to explain it in simple terms during an interview, what are its pros/cons, best situation to use k-Means, and different variations of it.

What is k-Means?

K-Means is a popular algorithm used for clustering, which is a technique to group similar data points together. Imagine you have a bunch of unlabeled data points, and you want to find patterns or groups within them. K-Means can help you with that.

The general structure of the K-Means algorithm is as follows —

  1. Initial Step: First, you need to decide on the number of clusters you want to create, let’s say you choose K=3.
  2. Random Initialization: Next, you randomly select three points in your dataset as the initial centroids of the clusters. A centroid is simply the representative point for a cluster.
  3. Assignment Step: Now, for each data point, you calculate its distance to each centroid. The distance can be calculated using a mathematical formula called Euclidean distance. The data point is assigned to the cluster whose centroid is closest to it.
  4. Update Step: Once all the data points have been assigned to clusters, you recalculate the centroids based on the mean of the data points within each cluster. This step is called the update step.
  5. Repeat: Steps 3 and 4 are repeated until the algorithm converges, which means the centroids no longer change significantly or the maximum number of iterations is reached.

Let’s go through the steps of K-Means again, this time using some mock data to make it more concrete. Suppose we have the following mock animal data, where each animal is represented by its weight and height

We want to cluster these animals into three groups based on their weight and height using K-Means. Let’s go through the algorithm step by step:

Step 1: Initialization
We randomly initialize the centroids. Let’s assume the initial centroids are:
Centroid 1: [180, 100]
Centroid 2: [400, 300]
Centroid 3: [250, 150]

Step 2: Assignment
We calculate the distance between each animal and the centroids. Each animal is assigned to the nearest centroid.

Step 3: Centroid Update
We update the centroids by calculating the mean of the assigned animals’ weights and heights for each cluster.
New Centroid 1: [190, 115]
New Centroid 2: [500, 375]
New Centroid 3: [150, 135]

Step 4: Repeat
We repeat steps 2 and 3 until convergence. In this example, we’ll perform two more iterations.

Iteration 2:
Assignment:

Centroid Update:
New Centroid 1: [50, 140]
New Centroid 2: [0, 0]
New Centroid 3: [600, 400]

Iteration 3:
Assignment:

Centroid Update:
New Centroid 1: [50, 140]
New Centroid 2: [0, 0]
New Centroid 3: [600, 400]

Step 5: Convergence
The algorithm stops when the centroids no longer change. In this example, the centroids did not change after the third iteration.

Final Output:
The final clusters and centroids are:

Cluster 1: [Lion, Tiger, Giraffe, Zebra, Monkey]
Cluster 2: []
Cluster 3: [Elephant]

Centroid 1: [50, 140]
Centroid 2: [0, 0]
Centroid 3: [600, 400]

Please note that in this example, Cluster 2 ended up empty. This can happen if there are no data points assigned to a particular centroid. In practice, it’s essential to handle such cases and adjust the algorithm accordingly.

How to choose k-value?

Choosing the right value of k, the number of clusters, is an important step in applying the K-Means algorithm. While there is no definitive rule for determining the optimal k-value, here are a few approaches commonly used:

1. Domain Knowledge: If you have prior knowledge or understanding of the data and the problem domain, it can provide insights into the expected number of clusters. For example, if you’re analyzing customer segments, you might have an idea of how many distinct customer groups exist based on demographics or purchasing behavior.

2. Elbow Method: Plot the within-cluster sum of squares (WCSS) against the number of clusters (k). WCSS represents the sum of squared distances between each data point and its centroid within a cluster. The elbow method suggests choosing the value of k at the “elbow” or inflection point of the curve, where additional clusters start to provide diminishing improvements in reducing WCSS.

3. Silhouette Score: Compute the silhouette score for different values of k. The silhouette score measures the compactness and separation of clusters. It ranges from -1 to 1, where higher values indicate better-defined clusters. Choose the k-value that maximizes the silhouette score.

4. Business Constraints: Consider any practical or business constraints that may influence the choice of k. For example, the availability of resources or the need for interpretability might favor a specific number of clusters.

5. Hyper-paarmeter tuning: Use a validation data set and metrics associated with cluster quality (e.g. silhouette score) in a hyperparameter selection approach such as random search or grid search to can help determine the optimal value of K.

It’s important to note that the choice of k is subjective and may require experimentation and evaluation. It’s advisable to try multiple values of k and assess the quality of the resulting clusters using domain knowledge or validation techniques.

When to use k-Means?

K-Means, like any algorithm, has its drawbacks, advantages, and specific use cases. Let’s explore them:

Drawbacks of K-Means

  1. Sensitive to initial centroids: K-Means is sensitive to the initial placement of centroids. Depending on the initial positions, the algorithm may converge to different solutions or get stuck in suboptimal solutions.
  2. Requires predefined number of clusters: You need to specify the number of clusters (K) in advance. Determining the optimal number of clusters can be challenging and may require domain knowledge or additional techniques.
  3. Assumes isotropic clusters: K-Means assumes that the clusters are spherical and have similar sizes. It struggles with clusters of different shapes, densities, or sizes.
  4. Affected by outliers: Outliers can significantly impact the clustering result. K-Means tends to assign outliers to the nearest cluster, even if they don’t belong to any specific cluster.
  5. May converge to local optima: Depending on the initial centroids and data distribution, K-Means may converge to a local optimum rather than the globally optimal solution.

Advantages of K-Means

  1. Simplicity and efficiency: K-Means is relatively simple to understand and implement. It scales well to large datasets and is computationally efficient, making it suitable for many applications.
  2. Interpretability: The resulting clusters in K-Means are easy to interpret since each data point belongs to a specific cluster. It provides meaningful insights into the structure of the data.
  3. Scalability: K-Means can handle large datasets efficiently. It has a linear time complexity with respect to the number of data points and clusters.
  4. Parallelizable: K-Means is parallelizable, meaning it can be distributed across multiple processors or machines, making it faster for larger datasets.

Best Situations to Use K-Means

  1. Unsupervised learning: When you have unlabeled data and want to discover natural groupings or patterns within it, K-Means can be a suitable choice.
  2. Data exploration: K-Means can be used to gain initial insights into the structure of the data and identify potential clusters or outliers.
  3. Customer segmentation: In marketing, K-Means can help segment customers into different groups based on their behavior, preferences, or purchase patterns.
  4. Image compression: K-Means can be employed for image compression by clustering similar colors together and reducing the color palette without significant loss of visual quality.
  5. Anomaly detection: By considering outliers as anomalies, K-Means can help identify unusual or suspicious data points.

It’s important to note that K-Means is just one of many clustering algorithms available, and its suitability depends on the specific problem and data at hand. If the drawbacks of K-Means are significant in your scenario, you might consider exploring alternative clustering methods such as DBSCAN, hierarchical clustering, or Gaussian mixture models.

Variations of k-Means

There are several variations and extensions that have been developed to address specific challenges or improve upon its limitations. Let’s explore some commonly used variations of K-Means, along with their advantages and disadvantages:

K-Means++

  • Advantage: K-Means++ improves the initial centroid selection process in the standard K-Means algorithm. It does this in the following manner
  1. Choose the first centroid randomly from the dataset.
  2. For each data point, calculate its distance to the nearest centroid that has already been chosen. This distance is called the “minimum squared distance.”
  3. Select the next centroid with a probability proportional to its minimum squared distance. In other words, data points that are farther away from the already chosen centroids have a higher chance of being selected as the next centroid.
  4. Repeat steps 2 and 3 until all k centroids have been chosen.
  • Disadvantage: K-Means++ may require slightly more computational overhead during initialization compared to random initialization in the standard K-Means algorithm.

Mini-Batch K-Means

  • Advantage: Mini-Batch K-Means is a variant that uses random subsets (mini-batches) of the training data to perform updates at each iteration. This approach reduces memory requirements and can significantly speed up the convergence of the algorithm, making it suitable for large datasets.
  • Disadvantage: Mini-Batch K-Means may sacrifice some clustering quality compared to the standard K-Means algorithm, especially if the mini-batch size is small.

Fuzzy C-Means

  • Advantage: Fuzzy C-Means extends K-Means to allow data points to belong to multiple clusters with different degrees of membership. It provides more flexibility in handling data points that are ambiguous or have overlapping characteristics.
  • Disadvantage: Fuzzy C-Means introduces additional complexity compared to K-Means, requiring the determination of membership degrees and potentially increasing computational requirements.

K-Means with Dimensionality Reduction

  • Advantage: Combining K-Means with dimensionality reduction techniques like Principal Component Analysis (PCA) can help improve clustering performance, especially when dealing with high-dimensional data. Dimensionality reduction can help capture the most informative features and reduce noise.
  • Disadvantage: Dimensionality reduction techniques may discard some less informative features, potentially leading to the loss of valuable information for clustering.

It’s important to note that the suitability of these variations depends on the specific problem, data characteristics, and computational constraints. Experimentation and analysis of the specific requirements of your task are essential to determine the most appropriate variant of K-Means or alternative clustering algorithms.

Implementing k-Means in Python from Scratch

Some interviewers may ask you to implement k-Means from scratch in the language of your choice. Here, we look at implementing it in Python

import math
import random

# Euclidean distance calculation between two points
def euclidean_distance(point1, point2):
distance = 0
for i in range(len(point1)):
distance += (point1[i] - point2[i]) ** 2
return math.sqrt(distance)

# K-Means algorithm implementation
def kmeans(data, k, max_iterations=100):
centroids = random.sample(data, k) # Randomly initialize centroids

for _ in range(max_iterations):
clusters = [[] for _ in range(k)] # Create empty clusters

# Assign data points to the nearest centroid
for point in data:
distances = [euclidean_distance(point, centroid) for centroid in centroids]
nearest_centroid = distances.index(min(distances))
clusters[nearest_centroid].append(point)

prev_centroids = centroids.copy() # Store previous centroids

# Update centroids as the mean of each cluster
for i in range(k):
if clusters[i]:
centroids[i] = [sum(feature) / len(clusters[i]) for feature in zip(*clusters[i])]

# Check for convergence
if prev_centroids == centroids:
break

return clusters, centroids

# Example usage
data = [
[2, 4],
[3, 2],
[6, 8],
[7, 6],
[25, 28],
[26, 27]
]

k = 2
max_iterations = 100

# Apply K-Means and print the clusters and centroids
clusters, centroids = kmeans(data, k, max_iterations)
print("Clusters:")
for i, cluster in enumerate(clusters):
print(f"Cluster {i+1}: {cluster}")
print("Centroids:")
for i, centroid in enumerate(centroids):
print(f"Centroid {i+1}: {centroid}")

In this example, we start by defining a helper function euclidean_distance that calculates the Euclidean distance between two points in n-dimensional space.

The kmeans function takes the data, the number of clusters (k), and an optional maximum number of iterations as input. It randomly initializes the centroids and then iteratively assigns data points to the nearest centroid and updates the centroids based on the mean of each cluster. The algorithm stops when either the centroids converge or the maximum number of iterations is reached.

In the example usage section, we provide some sample data points. We set the number of clusters (k) to 2 and the maximum number of iterations to 100. We call the kmeans function to obtain the clusters and centroids, and then we print them in the console.

Please note that this is a basic implementation of K-Means and may not handle certain complexities such as initialization methods or handling empty clusters. It’s recommended to further enhance the code based on additional questions your interviewer may come up with.

Final Thoughts

In this blog post, we explored the fundamentals of the K-Means clustering algorithm. We learned that K-Means is an iterative algorithm that partitions data into k clusters by minimizing the distance between data points and cluster centroids. We discussed its advantages, such as simplicity and scalability, as well as its drawbacks, including sensitivity to initial centroids and the need to specify the number of clusters.

Some of the typical questions you can expect from interviews around the above concepts are

  • Explain how k-Means works? Can you code it using Python?
  • What would the result of the following <sample input> look like?
  • When would you use k-Means? Why?
  • How do you determine the value of K
  • What are k-Means drawbacks? How would you address XYZ drawback?

With this you are one step closer to being prepped for your DS/ML interview! All the best.

Credits

This post was written with help from ChatGPT. Some of the promopts used are

What other variation of k-Means are used in commonly in machine learning. Explain advantages and disadvantages of each

Summarize this entire chat in 2–3 lines as the ending to a blog post on k-means

--

--

Karun Thankachan
CodeX

Simplifying data science concepts and domains. Get free 1-on-1 coaching @ https://topmate.io/karun