Introduction to k-Nearest Neighbors (kNN) | Python Code | Machine Learning

Preethi Thakur
10 min readOct 14, 2022

--

The k-nearest neighbors (kNN) algorithm, is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point. While it can be used for either Regression or Classification problems, it is typically used as a classification algorithm, working off the assumption that similar points can be found near one another.

The graphic above shows how a kNN classifier works.

Intuition

The kNN algorithm is different from other machine learning algorithms. Each ML model has its specific formula that needs to be estimated during training. But, the kNN algorithm formula is computed during prediction rather than training. Why? Because of the intuition of the kNN algorithm, which is,

“When a new data point arrives, the kNN algorithm will start by finding the nearest neighbors of this new data point. Then it takes the feature values of these neighbors and uses them as a prediction for the new data point”

From this intuition we can say that, kNN algorithm is,

Instance-based learning: The model learns entire training instances then generalizes to new data values by using similarity measure(Euclidean distance) to compare them to the learned instances. In this blog we will learn how this happens.

Lazy Learning: The model will do nothing in training phase. It just memorizes the training dataset. Only during the time of prediction the model searches nearest neighbors from entire training set, making it expensive.

Non-parametric : Meaning, that it does not make any assumptions about the underlying data (For example linear regression assumes data is linearly distributed).

kNN for Classification

For classification problems, when we want to predict the class of a data point, the algorithm queries the k points that are closest to the data point and returns the class label on the basis of majority voting, that is the most frequently used label of their class will become the predicted label.

Look below, we took 7 nearest neighbors(k=7) that belong to Class A and Class B. Majority of the neighbors belong to Class B. Therefore the data point class will be labelled as Class B.

kNN for Regression

Regression problems use a similar concept as classification problem, the algorithm queries the k closest points to the data point and returns the average of their values as the predicted value. The main distinction here is that classification is used for discrete values, whereas regression is used with continuous values.

kNN Mathematical Description

Now, we know to predict the class label of the given query point we need to identify its nearest neighbors. But, how do we do this? You might have guessed already, we calculate the distance. There are few distance metrics that we can use to calculate the distance between the query point and other data points. These distance metrics help to form decision boundaries, dividing data points on basis of their class labels.

  • Euclidean distance: This is the most commonly used distance measure, and it is limited to real-valued vectors. Using the below formula, it measures a straight line between the query point and the other point being measured.

Euclidean distance refers to the distance between two points. These points can be in different dimensional space and are represented by different forms of coordinates.

Here we are computing the euclidean distance between two points (X21,Y21) and (X22,Y22), you take the square root of the sum of the squares of two difference vectors (Y22-Y21) and (X22-X21). Euclidean distance is nothing but the length of the vector and is refered as Norm or Euclidean Norm.

  • Manhattan distance: This is another popular distance metric, which measures the absolute value between two points. The formula is,
  • Minkowski distance: This distance measure is the generalized form of Euclidean and Manhattan distance metrics. The parameter, p, in the formula below, allows for the creation of other distance metrics. Euclidean distance is represented by this formula when p is equal to two, and Manhattan distance is denoted with p equal to one.
  • p=1: Manhattan Distance (l1 norm)
  • p=2: Euclidean Distance (l2 norm)

Hamming distance: This technique is typically used with Boolean or string vectors(categorical variables), identifying the points where the vectors do not match. As a result, it has also been referred to as the overlap metric. This can be represented with the following formula,

Choosing optimal value of k

There isn’t a specific way to determine the best k value. You need to experiment with a few values of k before deciding which one to go forward with. Let’s see what all ways we have to choose optimal value of k.

  • One of the common ways is to choose the value of k by calculating the square root of N samples in the training data set.
  • Another way is, by considering that a part of the training samples is “unknown”. Then, we can categorize the unknown data in the test set by using the kNN algorithm and analyze how good the new categorization is by comparing it with the information we already have in the training data.
  • When dealing with a binary class problem, try to choose an odd value for k. What will you do if the number of neighbors in each class is the same(50–50)? It will be difficult to predict the class label for our data point. So, choose k as even number in such cases.
  • k with lower values(k=1 or k=2), can be noisy and subjected to the effects of outliers. The chance for overfitting is also high in such cases.
  • While k with larger value, lead to smoother decision boundaries, but k shouldn’t be too large. Otherwise, groups with a fewer number of data points will always be outvoted by other groups. Moreover, larger k will be computationally expensive.

Standardization

One major drawback in calculating distance measures directly from the training set is in the case where variables have different measurement scales or there is a mixture of numerical and categorical variables. For example, if one variable is based on annual income in dollars, and the other is based on age in years then income will have a much higher influence on the distance calculated. One solution is to standardize the training set as shown below.

kNN Advantages and Disadvantages

Advantages

  1. Easy to implement: Given the algorithm’s simplicity and accuracy, this is a great model for many machine learning use cases that don’t require highly complex techniques.
  2. Adapts easily: As new training samples are added, the algorithm adjusts to account for any new data since all training data is stored into memory.
  3. Few hyperparameters: KNN only requires a k value and a distance metric hence it is very fast to develop.

Disadvantages

  1. Does not scale well: Since KNN is a lazy algorithm, it takes up more memory and data storage compared to other classifiers. This can be costly from both a time and money perspective. More memory and storage will drive up business expenses and more data can take longer to compute.
  2. Curse of dimensionality: The KNN algorithm tends to fall victim to the curse of dimensionality, which means that it doesn’t perform well with high-dimensional data inputs. This is sometimes also referred to as the peaking phenomenon, where after the algorithm attains the optimal number of features, additional features increases the amount of classification errors, especially when the sample size is smaller.
  3. Prone to overfitting: Due to the “curse of dimensionality”, KNN is also more prone to overfitting. While feature selection and dimensionality reduction techniques are leveraged to prevent this from occurring, the value of k can also impact the model’s behavior. Lower values of k can overfit the data, whereas higher values of k tend to “smooth out” the prediction values since it is averaging the values over a greater area, or neighborhood. However, if the value of k is too high, then it can underfit the data.
  4. Performance drop: kNN is less likely to perform well on advanced tasks like Computer Vision and Natural Language Processing NLP. You can try to push the performance of kNN by adding other techniques like called bagging, to improve predictive performances. At a certain point of complexity kNN will probably be less effective than other models regardless of the way it was tuned.

Implementation with python

  1. First step is to import necessary python libraries and read dataset.
#importing necessary librariesimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
#read dataset
df = pd.read_csv("../dataset.csv")
#size
df.shape
output: (400, 5)
df.head()

Our dataset has 400 samples, 5 features

2. Now lets split our predictor(input features X) and responce(target y) variables. As predictor X we took columns from Gender, Age, Estimator, and as response y we took Purchased column.

X = df.iloc[:,2:-1].values
y = df.iloc[:,-1].values

Now, with this code X and y will be two arrays. Why did we take arrays? In the mathematical description above, you can notice our data points are actually two vectors. Then calculated the distance between the data points.

3. Now, we need to split the data into train and test sets. The train set has 80% samples = 320 ; test set has 20% sample = 80

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)X_train.shape
#output: (320, 2)
y_train.shape
#output: (320,)
X_test.shape
#output: (80, 2)
y_test.shape
#output: (80,)

Data preprocessing — Feature Scaling

4. The values are varying too much, especially Estimated salary column values are really high. The Euclidean Distance will vary too much. This will affect the performance by giving higher weightage to these variables since their norm(magnitude) will be higher. We need to perform feature scaling.

Let’s scale the values using StandardScaler() from scikit-learn library.

trf = StandardScaler()X_train_trf = trf.fit_transform(X_train)
X_test_trf = trf.transform(X_test)

Finding the k Nearest Neighbors

5. Let’s train the kNN classifier and check the accuracy score of the model. Before going ahead we must have an idea of hyperparameter n_neighbors. n_neighbors is the value of k, number of neighbors. We need to decide how many nearest neighbors we want to compare with the new data value.

knn = KNeighborsClassifier(n_neighbors=20)knn.fit(X_train_trf, y_train)
y_pred = knn.predict(X_test_trf)
print(“Accuracy: “, accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Here, I took only 20 neighbor and checked my model’s accuracy and got 0.83(83%), not bad. But this will not happen in larger datasets, the accuracy score might fall really low. Confusion matrix will tell us how many right and predictions our model made. (43 and 24 are right predictions for the 2 classes. 7 and 6 are wrong predictions for the 2 classes)

I took k = 20, But do you know it is not ideal to take k value as even, because there are chances of getting a tie while voting. What if 10 neighbors belong to one class and other 10 belong to other class?

But we can improve this score by choosing right k value. Let’s see how we do it.

error_train=[]
error_test=[]
for i in range(1,26):
knn=KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train_trf,y_train)
x=confusion_matrix(y_train,knn.predict(X_train_trf))
y=confusion_matrix(y_test,knn.predict(X_test_trf))
error_train.append((x[0][1]+x[1][0])/x.sum())
error_test.append((y[0][1]+y[1][0])/y.sum())
# plot
plt.plot(range(1,26),error_train,label='training error rate')
plt.plot(range(1,26),error_test,label='test/validation error rate')
plt.xlabel('K Value')
plt.ylabel('Error')
plt.legend()

Here we have finding out error rate we will get on train and test set for different values of k. On plotting the output we got,

Looks like we accuracy is high near k =11

Prediction

6. Now for our final prediction, let’s update n_neighbors value (k=13) in our model and check the accuracy score and confusion matrix once again.

knn = KNeighborsClassifier(n_neighbors=11)knn.fit(X_train_trf, y_train)
y_pred = knn.predict(X_test_trf)
print(“Accuracy: “, accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

There you go! We could increase accuracy score by 0.02(2% increase). And our model correct predictions have increased too.

Conclusion

This was a basic approach to kNN model. It takes a few steps to move from a basic kNN model to a fully tuned model, but the performance increase is totally worth it.

In this tutorial we have learnt:

  • Understand the intuition and mathematical foundations behind the kNN algorithm
  • Used scikit-learn for data-preprocessing and to fit kNN.

--

--