K-Nearest Neighbor with Scratch (KNN)

Dipnot
Analytics Vidhya
Published in
4 min readJan 9, 2021

--

Hello, welcome to our K-Nearest Neighbor with Scratch, the third article in our Machine Learning with Scratch series. In this article, we will examine and code the K-Nearest Neighbor algorithm with you.

While coding our algorithm, we will use the iris data set as the data set. Iris dataset is an ideal dataset for classification problems. You can access the data set here or in the sklearn library. Below you can see how you can import the data set into your code.

from sklearn import datasets
iris = datasets.load_iris()

df.describe()

K-Nearest Neighbor

The main purpose of using the K-Nearest Neighbor algorithm is to find out which category our categorically classified data belongs to.
Our algorithm (K-Nearest Neighbor, or KNN for short) predicts data by looking at its closest neighbors in the coordinate system.

So what does that mean?

If we consider the data as in a coordinate system, the middle (star) is the place of the data we will predict in the coordinate system. And the K element closest to it tells us which category our data is actually close to. In this picture, we see that 2 different values, 3 and 6, are processed for K. If K is 3, there is 1 A and 2 B class members at the closest point to the value we can estimate. By looking at this inequality, we can say that our star data belongs to class B. In this picture, we see that 6 variables are used for K. For 6 values of K, situations change and our data is adjacent to data from 4 A and 2 B classes. In this case, it appears that the most logical decision for our star data is class A.

Our goal when using our algorithm is to find the best K value and minimize the margin of error.

Briefly; On the basis of the KNN algorithm, there are 2 different basic poles: distance and K (nearest neighbor number). As I wrote in the previous parts of the text, we decide on K.

So what do we use as distance variables?

As an answer to this question, there are multiple distance measuring formulas to be used in the KNN algorithm. These:

  1. Minkowski Distance
  2. Manhattan Distance
  3. Euclidean Distance
  4. Cosine Distance
  5. Jaccard Distance
  6. Hamming Distance

There are different distance measurement formulas like these. We will use Euclidean Distance, which is frequently used in our code. To observe the effect of other distance meters on the code, I suggest you read the article “Effects of Distance Measure Choice on KNN Classifier Performance — A Review”.

Euclidean Distance

Euclidean Distance is the most frequently used distance type in the KNN algorithm. It adopts the distance finding principle in geometry as its working principle.

Euclidean Distance Formula

Why is the K value important?

Error Rates are given for different K values selected in this graph. Choosing K value close to one and equal to one will cause Overfitting in our code. So what is overfitting? Overfitting is a situation where the code is powerless against different data that can come by memorizing the data. In this case, our code will look at the nearest single data and make false predictions without analyzing the data. As a result, our error rates increase.

In order for us to find the best K value, we need to try multiple K values in our code and choose the best one.

Let’s Code:

1. First, let’s import the necessary libraries.

2. We upload our data and give them to the model after pre-processing them 😊

In this code snippet, after loading our dataset, we divide them into two parts as df and df_test. After that, we try the values in the range (3,10) to determine the K value and calculate the error score.

3. We code our class structure.

We start coding our class structure with the pre-definitions in the “__init__” function. As the next step, we add the “__distance” function and define the euclid function we will use in our class. As a last step, we will end our code with our “predict” function. In the Predict function, we use the “heapq” structure to sort the measured distances. Here you can sort the distances into a normal list, but the “heapq” structure will improve your code a little more. After sorting the measured distances from small to large, you can reach the result of which category your data belongs to by processing the first K element.

That’s it from this article 😊

To see the full version of the code, visit my Github page: https://github.com/capogluuu/ml_scratch

THANK YOU

--

--