Introduction To K-Nearest Neighbors

Amod Kolwalkar
Analytics Vidhya
Published in
6 min readAug 15, 2019

K-Nearest Neighbors is simple algorithm. The elegance of this algorithm lies in it’s simplicity. Despite it’s various drawback’s such as high compute time in high dimension data it is still widely used in areas of bio-informatics and computer vision tasks.

Link to this paper making K-NN one of the top 10 Machine Learning Algorithms.

When you read about K-NN you will have to come across some terms such as supervised learning,hyper parameter tuning,non-parametric, distance measures,classification,regression, lazy learner.

Supervised Learning-Learning a task where output labels are given.

Classification- It is a Supervised learning task where the labels are discrete value or labels for e.g. if your checking if a patient has cancer or not based on the tumor you will have tumor labels of malignant and benign.This can be extended to more than a binary class but the labels are discrete values.

Regression- It is a Supervised learning task where we predict the outcome given real valued data. Class of prediction problems which fall in this are prediction of stock market prices, real estate prices with a given time data.

Best part here is we can use K-NN to do both the Supervised Learning Tasks.

  • Lazy Learner- Consider this as a student who learns one day before the exam. You might see them prosper well in examination’s which don’t have much need of concept understanding or the conceptual rigor required to pass these tests are really low. Student’s from Indian Engineering colleges reading this probably would understand this concept the best .
  • This is the main reason for it’s drawback of failing in high dimensional data. Failing here means in two ways :-
  1. If you have an internet company where business requirements require low latency i.e you need a fast output for your query. K-NN might take ages.
  2. You might not get the best performance metric score on K-NN due to data distribution. K-NN in general tends to be a high bias model i.e it’s makes it’s decision based on a distance metric.
  • Distance Measures
Source Google Images

There are many more distance measure here are some of the most prominent ones.

  • Hyper-parameter- This is like your tuning nob of the model in this case K is the hyper-parameter which decides the number of neighbors you select.

We have gone through some typical Machine Learning Jargon now let’s look at K-NN from an intuitive sense.

Intuition For KNN- Choose your friends wisely!

Consider this image you want some advice and you have two sets of friends whom you would like to take advice from. Let’s call the blue points you school friends and the red points your college friends. You are the smiley and want to take advice from them if you should go for action movie or romantic movie. Based on some metric you will make the decision, most likely when you are around your college friends you will get influenced by them and make decisions similar to what they do, if you are around your school friends you will do what your school friends would do.

The distance here decides your decisions as in the image case if you are more closer you are to the blue points you might end up becoming a blue point else red otherwise.

Failure Cases for KNN

Well this naive example was just to give you a high level sense of how K-NN works but even in this example things may not work they way they would as people evolve, social networks are very dynamic and keep changing, certain people ( The person influencing you can be an outlier point also !) influence you more than a group would do.

Real Word Example with Code

We will use KNN on Pima Indians Diabetes Database to predict if a patient has diabetes or not. We will cover concepts of train test split, hyper parameter tuning and explain the concepts by using sklearn funtions.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split #to split into train and test
from sklearn.model_selection import GridSearchCV #for hyperparam tuning
from sklearn.neighbors import KNeighborsClassifier #KNN Classifier

After importing the required libraries we will load the dataset. I have used Kaggle Kernels it’s a jupyter notebook like interface, where you can import kaggle datasets there itself and no need to download them on the local machine.

df=pd.read_csv("../input/diabetes.csv")
df.head() #print's the top 5 rows in the dataframe
PIMA Dataset

We can see that this dataset has 768 rows and 9 Columns, Outcome is the predictor label which denotes presence of diabetes or not.

y=df['Outcome'] 
X=df.drop('Outcome',axis=1)
Class Distribution Counts

We found the ratio so as to do the train test split accordingly. Consider the train test split concept of preparing for a test where you study for some questions and in the test you get new questions. The whole idea to have new questions is to see how well the student has learnt the concepts. So it’s always a good practice so separate train from test. If the test see’s the train data then it leads to data leakage where we might get a very good model score but it doesn’t serve the purpose of model building. The whole idea of test is to check how well a model does on unseen data.

Cross Validation

Now before getting into the details of Hyperparamter tuning, let us understand the concept of Cross validation.

The trained model’s performance is dependent on way the data is split. It might not representative of the model’s ability to generalize.

The solution is cross validation.

Cross-validation is a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it.

In k-fold cross-validation, the original sample is randomly partitioned into k equal size sub samples. Of the k sub samples, a single sub sample is retained as the validation data for testing the model, and the remaining k-1 sub samples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k sub samples used exactly once as the validation data. The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation. The advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once.

We can input K value as the number of neighbors and check for model performance. But instead we will do hyper parameter tuning i.e we take a set of K values and perform KNN on it and select the one that gives the best model score on test data cause there are chances of overfit and underfit. Overfit happens when the model learns well on train and performs poorly on train. Underfit happens when it learns poorly on train and it’s performance on test data is also low.

Source: Google Images
model=KNeighborsClassifier() # KNN Classifier
param_grid = {'n_neighbors':np.arange(1,50)} # taking K[1:50]
knn= GridSearchCV(model,param_grid,cv=5,scoring='roc_auc',n_jobs=-1,pre_dispatch='2*n_jobs') #Grid Search for Hyperam Tuning
knn.fit(X_tr,y_tr) #fit the KNN model on the train data

ROC-AUC Metric

As you can see that data set is imbalanced to the ratio of 65:35 imagine if I wrote a one line model which says print(1) I would have got 65% accuracy on this model. Hence in most cases accuracy must be avoided cause real world data is never balanced.

This blog explains ROC-AUC metric in a nice way. ROC-AUC metric is a good metric to use when data is imbalanced. AUC suggests that there is a 78% chance of a point being correctly classified given positive and negative class points.

knn.best_params_    #{'n_neighbors': 10}
knn.best_score_ #0.783873305758252

--

--