k-NN — Love thy neighbor

Bhanu Kiran
6 min readJan 9, 2023

--

This probably might be one of my shortest blogs ever. Something so short might not seem credible, or it does feel like some information is missing. But, trust me, by the end of this blog you shall understand k-NN and the theory behind the model.

k-Nearest Neighbors

Also known as kNN is a widely used model in the world of machine learning. Mainly used as a supervised technique. What is supervised? If you are not familiar with supervised, it is when there is a target variable for your model to map a relationship between X — your features and y — your output. So basically our model is just a bunch of rows and columns, with a target or output column. In other words, your model just features, and targets

In such cases, grouping them together can help find hidden patterns in the data, and these hidden patterns have much value to them. From the hidden patterns, you can separate the data into their respective classes and do analysis and all that jazz.

Say for instance I have a bunch of data scattered, and I have already grouped them.

Fig 1. data in groups

Now, take a moment and analyze Fig 1. and try to answer this question. If I was to put in a new instance, or a new data point, how am I going to group it?

Fig 2. adding a new instance

If you have answered something along the lines of, checking the closest point beside the new instance or checking the “neighbor” of the new instances, then you are absolutely right. When there is a new instance, kNN checks the neighbors of the instance on the plane, and whatever neighbor group is the majority, the new instance falls in that group or category.

But how does the model know how many neighbors to check? This is done by assigning a value to our model, and this is what k in our kNN refers to. if I was to take the value of k=1, then kNN becomes 1-nearest neighbor as seen in Fig 3.

Fig 3. k = 1

If we assign our k value to be k=2, then our kNN becomes, 2-nearest neighbors as seen in Fig 4. below

Fig 4. k = 2

Following the trend above, I take k=6 then, my kNN becomes 6-nearest neighbors as seen in Fig 5. below

Fig 5. k=6

How does the prediction happen after assigning my k value, and telling my kNN how many neighbors to check? Well, this happens via majority voting.

For any new instance and given k value, the model does the prediction by taking the majority of the neighbor's categories/classes/groups.

Fig 6. majority voting

Does this just happen by itself? No, as you know, machine learning is built upon statistics and a lot of math, the neighbors are sorted by a distance measure, generally, euclidean distance, and the first k points are selected.

Fig 7. distance measures

Now it becomes more intuitive, given a new instance, the kNN model takes a value for k as seen in the figures above, and checks the number of neighbors as per the value of k. In measuring the distance between neighbors, we can identify the majority class closest to the new instance and assign the class of the new record. As simple as that!

If you followed along, you can observe that the above methods work for classification. But, kNN can also be used for regression, it is as simple as finding the average among similar records and predicting the average of the new record!

Now that we understand kNN, there are a few things to keep in mind.

1. Standardization

When measuring, we are not interested in “how much” but, “how different from the average”. For models such as kNN, it is essential to standardize the data prior to applying the model, also known as normalization. Doing so, puts all our variables on similar scales. This is important as it ensures that a variable does not overly influence a model simply due to the scale of its original measurement.

In other words, in Fig 6. you have different colors of data, this is done to visualize the different classes. You are not going to compare apples with oranges. You compare apples with apples. By applying standardization methods, you are somehow making the oranges into apples????

2. Selecting the k value

Choosing a large k value will result in oversimplification. Observe Fig 6. If I take my k value to be 10, and then all the neighbors turn out to be of class B, then automatically my new instance will be classified as class B, and we do not want this!

On the other hand, a small k value overfits the model and there is a lot of sensitivity to noise. If my k = 1, then only one neighbor is compared, and the model classifies an apple as an orange, and as stupid as it sounds, the next time I throw an apple at my model, it says orange.

So what do we do? The most common method is to test a range of values and optimize the performance of the model. This is why when you see examples or other blogs or code online, you have a curve, with a bunch of k values. This is done to find the most optimal k value for the model.

--

--