Understanding the KNN-Algorithm

Published in

Analytics Vidhya

5 min readMay 4, 2020

KNN is a supervised learning algorithm used for both regression and classification problems. Mostly used for Classification though. KNN tries to predict the correct class of test data by calculating the distance between the test data and all the training points. It then selects the k-points which are closest to the data.

Given a training dataset as given below. We have a new test data that we need to assign to one of the two classes.

2) Now, the k-NN algorithm calculates the distance between the test data and the given training data.

3) After calculating the distance, it will select the k training points which are nearest to the test data. Let’s assume the value of k is 3 for our example.

So here K=3, green=2 and red=1. The KNN will classify this point as green. This is a classification example, for regression, the predicted value is the mean of the k-selected points. This approach is known as the Brute force KNN where the distance is calculated, there are many ways of calculating distance like Euclidean, Manhattan and Hamming etc.

Euclidean Distance:

Most Common method.
Consider two points P(p1,p2) and Q(q1,q2) the Euclidean Distance will be ED(p,q)= squareroot[(q1-p1)**2 + (q2-p2)**2]

Manhattan Distance:

The distance represents the sum of the absolute difference between the absolute values in a vector.
MD(p,q) = |q1-p1|+|q2-p2|
Less influenced by outliers, than Euclidean distance. Preferred with a very high dimensional dataset.

Brute Force KNN is computationally very expensive, so there are other methods like kd-tree and Ball Tree(which are more efficient and can work better in a high dimensional dataset).

Choosing the Value of K:

Low k-value has high variance and low bias. With low k-value, there is a chance of overfitting in the data.
High k-value has decreasing variance and increasing bias. With high k-value, there is a chance of underfitting in the data.

KNN algorithm is a lazy learner? Yes, it is. Unlike other algorithms like SVM, Logistic and Bayesian which are eager learners and generalize on the training dataset, before getting the test dataset. In other words, they create the model based on the training dataset and then make predictions on the test dataset. KNN is a lazy learner so it waits for the test dataset, after getting the test dataset it then starts generalizing on the training dataset. Lazy algorithms work less while training and more while testing.

I am going to show you an example of the KNN algorithm using Diabetes dataset from UCI Repository. Steps that I am going to perform are:

Data Cleaning
Data Visualization
Model Creation
Selecting a k-value using GridSearchCV
Model Validation using cross-val-score

#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV
from statsmodels.stats.outliers_influence import variance_inflation_factor 
from sklearn.metrics import accuracy_score
%matplotlib inline#reading data
data=pd.read_csv('diabetes.csv')
data.head()

There is no missing value in the dataset however, there are outliers present and some of the variables are skewed.

Checking the distribution of the Variables

After cleaning the dataset and removing the outliers from it, we then proceed to the KNN model creation. First, we will split the data into train and test after which we will train the model and make predictions. Then we will perform a GridSearch operation to find the best parameters and check if our initial model was overfitting.

x_train,x_test,y_train,y_test = train_test_split(df,y, test_size= 0.3,random_state=42)knn = KNeighborsClassifier()
knn.fit(x_train,y_train)KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',metric_params=None, n_jobs=None, n_neighbors=5, p=2,weights='uniform')y_pred = knn.predict(x_test)knn.score(x_train,y_train)
0.8435754189944135print("The accuracy score is : ", accuracy_score(y_test,y_pred))
The accuracy score is :  0.7272727272727273param_grid = { 'algorithm' : ['ball_tree', 'kd_tree', 'brute'],
               'leaf_size' : [18,20,25,27,30,32,34],
               'n_neighbors' : [3,5,7,9,10,11,12,13]
              }gridsearch = GridSearchCV(knn, param_grid).fit(x_train,y_train)gridsearch.best_params_
{'algorithm': 'ball_tree', 'leaf_size': 18, 'n_neighbors': 11}knn = KNeighborsClassifier(algorithm = 'ball_tree', leaf_size =18, n_neighbors =11)knn.fit(x_train,y_train)y_pred = knn.predict(x_test)knn.score(x_train,y_train)
0.8063314711359404print("The accuracy score is : ", accuracy_score(y_test,y_pred))
The accuracy score is :  0.7445887445887446cv_results = cross_val_score(knn, x_test, y_test)
msg = "%s: %f (%f)" % ('KNN', cv_results.mean(), cv_results.std())
print(msg)
KNN: 0.807018 (0.040441)

As you can see KNN was overfitting initially as we got a high training score of 84 but a low test score of just 72. Later using the GridSearch params we got a low training score of 80 but a higher test score of 74 so it is safe to say that KNN was overfitting the model. One reason for overfitting is that since we didn't provide a n_neighbours value, KNN took the default value of 5. Recall from above when I mentioned that small n value may cause overfitting.

I have tried to explain KNN in the simplest form through this article. KNN is a good algorithm for beginners, its easy to understand, the mathematical part of it is also easy to grasp and can be easily explained to anyone.

Understanding the KNN-Algorithm

Written by Nitin