Data Scientists’ Interview Guide: k-Nearest Neighbor

Common interview questions asked around kNN such as its pros/cons, when to use it, variations of the simple kNN, and how to code it from scratch in Python

Karun Thankachan
CodeX
10 min readMay 12, 2023

--

Photo by Clarisse Croset on Unsplash

As part of the data science interviews you will typically face a machine learning round that tests your understanding of basic ML algorithms such as Linear Regression, Logistic Regression, SVM, etc. In this post we cover one of the most commonly asked ones — kNN. How to explain it in simple terms during an interview, what are its pros/cons, best situation to use kNN, and different variations of it.

What is kNN?

kNN stands for “k-Nearest Neighbors” and it’s a simple machine learning algorithm used for classification or regression tasks. The basic idea is to find the ‘k’ closest data points in the training set to a given test data point and use the labels of those closest points to make a prediction for the test point.

Let me give you an example to make things clearer. Let’s say we have a dataset of 50 flowers, where each flower has two features: petal length and petal width. We want to classify a new flower based on its petal length and width using kNN. Our dataset looks like this:

Suppose we have a new flower with a petal length of 5.2 and a petal width of 1.8, and we want to classify it based on the kNN algorithm with k=3. Here’s how we would go about it —

1.Calculate the distance between the new flower and each flower in the dataset. You can use different distance metrics (covered in the next section). In this example we will use Euclidean distance, which is just the square root of the sum of the squared differences between the feature values of the two flowers.

2. Select the k nearest neighbors based on the calculated distances. In this case, since k=3, we would select the three nearest neighbors:

3. Determine the class of the new flower based on the majority class of the k nearest neighbors. In this case, since all three of the nearest neighbors are of the class Virginica, we would classify the new flower as Virginica.

That’s the basic idea behind kNN. Of course, in practice, we would want to use a larger dataset with more features, and we would want to tune the value of k (hyperparameter tuning using a validation set) to achieve the best classification accuracy.

The same can be applied to regression as well, where you would instead take the mean of the k closest neighbours’ values.

Distance Metrics

As mentioned before when using kNN, there are several distance metrics that can be used to measure the similarity between data points. Lets take a closer look at the most commonly used distance metrics

Euclidean distance
Its the most commonly used distance metric in KNN. It measures the straight-line distance between two points in Euclidean space. In other words, it measures the shortest distance between two points as if you were drawing a line between them. The formula for Euclidean distance between two points A and B with n dimensions can be expressed as:

d(A,B) = sqrt((A1-B1)² + (A2-B2)² + … + (An-Bn)²)

where A1, A2, …, An and B1, B2, …, Bn are the values of the n dimensions of points A and B, respectively.

Advantages —
1. Euclidean distance is easy to compute and widely used in many applications.
2. It is sensitive to differences in all dimensions, not just some of them. So its can accurrately represent ‘similarity’

Disadvantages —
It is affected by the scale of the features. Features with larger values can dominate the distance calculation.

Manhattan distance
Also known as taxicab distance or city block distance, measures the distance between two points by summing the absolute differences of their coordinates. It is called taxicab distance because it’s like calculating the distance between two points on a grid-like city block system.

Image from OmniCalculator

The formula for Manhattan distance between two points A and B with n dimensions can be expressed as:

d(A,B) = |A1-B1| + |A2-B2| + … + |An-Bn|

where A1, A2, …, An and B1, B2, …, Bn are the values of the n dimensions of points A and B, respectively.

Advantages —
Manhattan distance is also easy to compute and works well with datasets that have high dimensionality.

Disadvantages —
It does not take into account the actual distance between two points, only the sum of the differences in their coordinates.

Minkowski distance
A generalized distance metric that includes both Euclidean distance and Manhattan distance as special cases. The formula for Minkowski distance between two points A and B with n dimensions can be expressed as:

d(A,B) = (|A1-B1|^p + |A2-B2|^p + … + |An-Bn|^p)^(1/p)

where A1, A2, …, An and B1, B2, …, Bn are the values of the n dimensions of points A and B, respectively, and p is a parameter that determines the order of the distance metric. When p=1, Minkowski distance is the same as Manhattan distance. When p=2, Minkowski distance is the same as Euclidean distance.

Advantages —
Minkowski distance allows you to control the order of the distance metric based on the nature of the problem.

Disadvantages —
It requires you to select an appropriate value for the parameter p.

Overall, the choice of distance metric in KNN depends on the nature of the problem and the data. Some tips are

  1. Euclidean distance is a good default choice for continuous data. It works well when the data is dense and the differences between features are important.
  2. Manhattan distance is a good choice when the data has many outliers or when the scale of the features is different. For example, if we are comparing distances between two cities, the distance metric should not be affected by the difference in elevation or terrain between the cities.
  3. Minkowski distance with p=1 is equivalent to Manhattan distance, and Minkowski distance with p=2 is equivalent to Euclidean distance. So, if you are unsure which distance metric to use, you could try experimenting with different values of p in the Minkowski distance.

You may need to try different distance metrics and see which one gives the best results.

When to use kNN?

Like any machine learning algorithm, kNN has its own strengths and weaknesses. Let’s take a look at some of the drawbacks, advantages, and ideal use cases for kNN:

Drawbacks
- kNN can be sensitive to the choice of distance metric used to calculate the distances between data points. Different distance metrics may yield different results for the same dataset.
- It can also be sensitive to the choice of k, the number of nearest neighbors to consider. Choosing k too small may lead to overfitting, while choosing k too large may lead to underfitting.
- kNN can be computationally expensive, especially for large datasets, since it involves calculating distances between the query point and all data points in the dataset.
- It may not work well with high-dimensional data, since the curse of dimensionality can cause the distances between data points to become very similar, making it hard to identify the k-nearest neighbors.

Advantages
- kNN is a simple and intuitive algorithm that is easy to understand and implement.
- It can work well with both binary and multi-class classification problems. It can also be used for both classification and regression problems.
- It can handle both linear and nonlinear decision boundaries.
- It doesn’t require any assumptions about the underlying distribution of the data.

Ideal use cases
- kNN is best suited for small to medium-sized datasets with relatively low dimensionality.
- It can be useful in situations where the decision boundary is highly irregular or nonlinear.
- It can be effective in cases where the data is clustered or has distinct groups.
- It can be used as a baseline algorithm to compare the performance of other, more complex models.

In general, kNN is a useful and versatile algorithm that can be a good starting point for many machine learning problems. However, it’s important to keep in mind its limitations and drawbacks, and to carefully choose the distance metric and value of k based on the nature of the problem and the data.

Variations of kNN

Besides the standard kNN algorithm, there are several variations of kNN that are commonly used in machine learning to combat different shortcomings of the traditional kNN discussed thus far. Let’s take a look at some of these variations, along with their advantages and disadvantages:

Weighted kNN
Instead of just considering the k-nearest neighbors equally, we assign weights to them based on their distance from the query point. Closer neighbors are assigned higher weights, and farther neighbors are assigned lower weights. This way, we can give more importance to the closest neighbors while still considering the influence of further neighbors.

Advantages —
Weighted kNN can give more accurate results than standard kNN, especially when the data has a lot of noise or outliers.
It can handle imbalanced datasets, where some classes have much fewer instances than others, by assigning higher weights to the instances of the minority class.

Disadvantages —
It can be more computationally expensive than standard kNN, since it involves calculating weights for each of the k-nearest neighbors.

Ball Tree kNN
In standard kNN, we calculate the distances between the query point and all data points in the dataset, which can be computationally expensive when the dataset is large. Ball Tree kNN addresses this problem by using a data structure called a ball tree to efficiently find the k-nearest neighbors. The ball tree partitions the data space into a tree of nested hyperspheres, which allows for faster searching of the nearest neighbors.

Advantages —
Ball Tree kNN can be much faster than standard kNN, especially for high-dimensional datasets.
It can handle datasets with non-uniform distributions, where the density of data points varies across the dataset.

Disadvantages —
Building the ball tree can be computationally expensive for large datasets.
The accuracy of the algorithm can be affected by the choice of the ball tree parameters.

Radius kNN
In radius kNN, instead of finding the k-nearest neighbors, we find all data points within a certain radius around the query point. This can be useful in cases where we want to find all the data points that are similar to the query point, rather than just the k-most similar ones.

Advantages —
Radius kNN can be useful when we don’t know how many neighbors we want to consider, or when we want to find all the neighbors within a certain distance.
It can be faster than standard kNN when the dataset is sparse, meaning that there are large areas of empty space between data points.

Disadvantages —
The choice of radius can be tricky, since if the radius is too small, we may miss some important neighbors, and if it’s too large, we may include too many irrelevant neighbors.
It can be computationally expensive to find all the data points within a certain radius, especially for high-dimensional datasets.

Overall, for interview remeber that kNN can be modified quite a bit to combat any drawback the interviewer may surface. The final choice of kNN algorithm depends on the nature of the problem and the data.

Implementing kNN in Python from Scratch

The final possible question asked in data science interviews is — Can you implement kNN from scratch?. So let's look at how to do this.

import numpy as np

class KNN:
def __init__(self, k):
self.k = k

def fit(self, X, y):
self.X_train = X
self.y_train = y

def predict(self, X):
y_pred = np.zeros(X.shape[0])

for i, x_test in enumerate(X):
# Calculate distances between x_test and all training examples
# distance metric - euclidean
distances = np.sqrt(np.sum((self.X_train - x_test) ** 2, axis=1))

# Get indices of k-nearest neighbors
k_indices = np.argsort(distances)[:self.k]

# Get labels of k-nearest neighbors
k_labels = self.y_train[k_indices]

# Assign most common label to y_pred[i]
y_pred[i] = np.bincount(k_labels).argmax()

return y_pred

Let’s break down the code:

  • We define a class called KNN, which takes in the value of k as a parameter during initialization.
  • We define a fit() method, which takes in the training data X and labels y and saves them as instance variables.
  • We define a predict() method, which takes in a test set X and returns a list of predicted labels.
  • Inside the predict() method, we initialize a numpy array y_pred to hold the predicted labels.
  • We loop over each test example x_test in X and calculate the distances between x_test and all training examples using the Euclidean distance metric.
  • We get the indices of the k-nearest neighbors using np.argsort(), and then get their labels from the training set.
  • We use np.bincount() to count the occurrences of each label in the k-nearest neighbors, and then assign the most common label to y_pred[i].
  • Finally, we return the y_pred array.

Note that this is a very basic implementation of the KNN algorithm, and there are many ways to optimize it and improve its performance. This can be done based on any follow up the interviewer has.

Conclusion

K-Nearest Neighbors (kNN) is a simple and intuitive machine learning algorithm that can be used for both classification and regression problems. The algorithm works by finding the k-nearest data points in the training set to a query point, and then assigning a label to the query point based on the majority vote of the k-nearest neighbors.

One of the main advantages of kNN is its simplicity and flexibility. However, kNN also has its limitations like being computationally expensive, especially for large datasets, and may not work well with high-dimensional data due to the curse of dimensionality.

Overall, kNN can be a useful baseline algorithm for many machine learning problems, but it’s important to carefully consider its strengths and weaknesses before using it in practice. With that you are one step closer to being prepped for your next Data Science Interview.

Credits

This post was written with help from ChatGPT. Some of the promopts used are

Explain the different distance metrics that can be used in KNN. Inlcude the math. Mention the advantages and disadvatange of each

What other variation of kNN are used in commonly in machine learning. Explain advantages and disadvantages of each

--

--

Karun Thankachan
CodeX

Simplifying data science concepts and domains. Get free 1-on-1 coaching @ https://topmate.io/karun