Understanding KNN Algorithm

Tanya Gupta
The Startup
Published in
6 min readJan 11, 2021

Today, we will be learning about KNN algorithm, its applications, intuition behind it and its implementation with MNIST dataset.

Source: edureka.co

Introduction

KNN or K-Nearest Neighbours is a supervised classification algorithm and is as straightforward of an algorithm as its name. This algorithm classifies a given instance based on the groups of its surrounding “K” number of neighbours. For instance,

Assume you are given 2 groups : food and vehicle. You were given the task of classifying apple in any one of the groups, then the distance between apple and the other food items would be less compared to the vehicle group. This is because they would have more similarity to the features than that when compared to the vehicle group. If you don’t get it, then don’t worry. We will elucidate it further down the road. For now, you can admire the nice visuals given below:

Classification of flowers based on their appearance. (source: researchgate.net)

Where is it used?

It can be best utilized if the dataset :

  • has labelled data ( otherwise you won’t be able to group an item)
  • is small because it involves a lot of computations for calculating distance( although you can do it with fairly large datasets)
  • If data is noise free, i.e., the columns are not filled with random values irrelevant to the datatype of the column.

Applications:

  • Used in face-recognition. You could use this algorithm to judge the similarity between your test cases and training set and accordingly “label” or “classify” the test image.
  • Used in recommendation systems: during online shopping it sorts out similar items for you based on your past purchases or search values.
  • Used in retail for pattern recognition to avoid credit card fraud and theft.
  • Image classification

Notice that most of the above applications are using KNN algorithms ability to find some commonality among instances of objects

Why is it known as “Instance-based learning” ?

Before we get into this question, let’s take a small detour to learn what is “instance-based learning”.

Source:blogs.sas.com

Take for instance, if you were to teach a kid the difference between a cat and a dog, you would emphasize on their appearance like the shape of ears, their paws etc. So, you would be explaining the differences between them by giving instances of cats and dogs and the child would learn from them.

This is what we call instance-based learning. This type of learning is also known as “lazy learning” because they delay all the calculations until we find a new instance to classify.

Similarly, in KNN algorithm we compare a test set and the training set based on their features. This comparison is made mathematically by calculating the Euclidean distances between them.

What is K and how to find it ?

KNN using scikit-learn | LaptrinhX
For K=3, The star would be classified as class B but for K=6 it would have been classified as class A.

K just defines the number of nearest neighbours for classifying the test instance. K can be chosen as :

  • square root of the number of instances in our set.
  • It must be an odd number. This is because if it were even, then there might be a case where there are equal number of neighbours belonging to each group. For odd K-value, you would always have a majority.

The K-value should not be too large because this would lead to increase in computations and longer time to process. However, if we take it too small then, it might lead to unstable predictions.

The Implementation and steps

Let’s get into the steps for understanding the implementation better:

  • We are given a dataset and test instance to classify. So we would calculate the differences in features for every row in our training set with our test instance.
  • These differences are squared and added in together for every row to find how “close” each instance in training data is to our test instance. This is Euclidean distance.
  • We sort these distances in ascending order and choose the first K neighbours.
  • From these neighbours we see which group has the majority of neighbours. That group is our result.

Given below is the implementation of KNN algorithm on MNIST dataset

MNIST (Modified National Institute of Standards and Technology) dataset provides us with images of digits 0 to 9 in different handwritings. It has 60,000 images in the training set and 10,000 images in testing set.

MNIST dataset ( Source:researchgate.net)

(the screenshots below are from google colab)

  • We first import the necessary libraries for implementation.
  • Incase there are null values, you could fill them up with 0’s or the mean of values from 0 to 255 i.e., 128.
I sorted based on any column value to create more randomness in label values.
The position of the pixels are the column labels for this dataset
  • The KNNClassifier( ) function calculates the distance between our test instance and each of our training instance.
  • neighbours list stores the tuple of (label, distances). Here label is a number from 0 to 9 represented by the values in that row.
  • We sort the neighbours list by distance and check the label where majority of our K neighbours belong.
  • count is a dictionary which stores the number of instances of each label among those K neighbours.
  • testAccuracy() sees how many of our predictions are the same as in our test set and then calculates the percentage accuracy.
  • plotAccuracy() plots the ‘accuracy’ vs ‘K-values’ curve using matplotlib library.

In the above code snippet,

  • Predictions list stores all the labels predicted by our classifier
  • for every row in our test set, we will initialize values in count dictionary to 0. The keys here are numbers 0 to 9.
  • After all iterations are over for a particular K, we will test its accuracy and store it in our accuracy_val list.
  • Notice that I have used all odd values for K.

The output for above is:

Confusion Matrix

This matrix tells us the ratio of false positives, false negatives, true positives and true negatives. It is a very good tool to judge the accuracy of our classifier.

We will explain the above terms with an example. Assume that a doctor is diagnosing patients with diabetes. In this scenario,

False positives: doctor predicted that a patient has diabetes while actually it is not true.

False negatives: doctor predicted that a patient doesn’t have diabetes but the prediction is wrong.

True positive: doctor diagnoses that patient has diabetes and it turns out to be true.

True negative: doctor diagnoses that patient doesn’t have diabetes and it turns out to be true.

That’s all for today. I hope that you learned something out of this. Thank you for reading my blog!!!

--

--