Understanding Classical ML algorithms

An Illustrated Guide to the K-Nearest Neighbors (KNN) Algorithm

6 min readOct 3, 2022

A series of articles covering the intuition, math and code for the k-nearest neighbors algorithm.

Intent
This series of articles aspires to elucidate ML algorithms and concepts by breaking them down into intuition and math, each being illustrated with the help of short gif viz. for understanding and clarity.

For ease of reading, I have split this article on kNN algorithm over three parts —
1. The geometric intuition for kNN algorithm (click here to read)
2. Significance of ‘k’ in the kNN algorithm (this article)

K, THE HYPERPARAMETER

We got the intuition and significance of the term ‘K’ in the kNN algorithm in the previous article. Here, we start off by understanding how different values of the hyper parameter k affects the model performance.

Building upon the working of the kNN algorithm, I’ll take a logical leap and try to define a boundary separating the classes as best as possible. For understanding this, please observe the gif viz. below, which uses the same dataset in 2-D space as in the previous article.

I’ll take the value of the hyperparameter, k to be 5 (arbitrarily). Now, since k is 5, a minimum of 3 negative points in the neighborhood of a query point results in it being classified as belonging to the negative class. K being an odd number, we don’t have a tie-breaker case to deal with. Same goes for the case with positive class.

Now, imagine we have moved our query point across the surface and recorded its model-predicted class. With the arrangement of the points in the dataset (which is fairly separable), we might get a straight edge separating the positive and the negative classes. I draw a line along the straight edge and that is our decision surface.

The line/curve that separates the points belonging to two different classes is termed a ‘decision surface’.

Yes, there are a few misclassifications when you notice the red triangles on the same side of the decision surface as the blue stars. But the curve is smooth (basically a line) and easy to plot.

Here, we see that the curve is pretty contorted. However, all the points are correctly classified.

Now, let us plot the decision surface when k equals 1. This is the minimum value the model hyperparameter can take.

Here, we see that the curve is pretty contorted. However, all the points are correctly classified.

This seems to be perfect. So, is setting k=1 resulting in the best model? Not really.

If you see the red triangle highlighted on the viz., you’ll notice that this guy is a bit too far away from his fellow negative reviews. In fact the decision surface with k equals to 5 classifies it as a positive review.

There can be many reasons attributable for this point’s unique location. Perhaps the human reviewer made a mistake while labeling this data point. Maybe the review was written in a very convoluted way. Regardless of the reason, if a new query point is close to this point with k=1, it’ll be classified as a negative review. However, if the location of that query point shifts even so lightly which is possible by just changing a neutral word or any minor modification, the point may be classified as a positive one.

So we observe that, for small values of ‘K’, the class labels of the points are very volatile and keep changing as we get different points as the neighbor for consideration. If we consider the highlighted red triangle as an outlier, then we can see that for small values of K, the predictions are heavily impacted by presence of outliers.

We now consider k to be equal to the highest value possible which is the number of the data points itself. This means that all the points including the outliers are considered as the neighbors and the prediction for any query point is simply the majority vote.

In our case, the negative review points are greater in number than the positive ones. Hence, regardless of its location, a new query point will always be classified as a negative review.

In case of a perfectly balanced dataset, the prediction is as good as tossing a coin with each outcome being equally likely.

BIAS & VARIANCE TRADE-OFF

Having visualized the different possible values and the extremums for hyperparameter k, we now can understand how tinkering with it can cause Overfitting or Underfitting.

For the uninitiated, in mathematical modeling —

Overfitting occurs when a mathematical model fits closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably.

Underfitting occurs when a mathematical model cannot adequately capture the underlying structure of the data and fails to fit to additional data or predict future observations reliably.

In overfitting, presence of outliers creates a decision surface that is far from the truth. This decision surface is significantly affected by the outliers/noise. In such cases, we generally have a very high train accuracy and poor test accuracy. The model is working extremely hard to get each training point correctly classified. Any change in the value of hyper parameter k will result in different predictions for a query point and this is a case of high variance.

In Underfitting, the decision surface is underworking and blindly gives the majority class label as the prediction. Hence even if you make a small change in the value of k, you do not find much difference in the predictions. This is a case of high bias.

A well-fit model is the one which has a value of k with a bias-variance trade-off. The decision surface is smoother and tries to classify the maximum number of points correctly (except the noise/outliers). The decision surface is less prone to the outliers/noise when compared to the decision surface obtained with overfitting. The surface also does not blindly predict majority class and in fact considers the local neighborhood unlike in the case of underfit models.

The table below is a summary of the bias-variance trade-off for extreme cases as well as the optimum case —

Table — Bias-Variance trade-off for kNN algorithm summary (by author)

OPTIMAL VALUE OF K, THE HYPERPARAMETER

Since k can take any discrete value between 1 and the number of data points in the training set, there will be an optimal value which hits that sweet spot on the bias-variance trade-off line.

One way for arriving at this optimal value is by using cross-validation.

Cross-Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two segments: one used to learn or train a model and the other used to validate the model. In typical cross-validation, the training and validation sets must cross-over in successive rounds such that each data point has a chance of being validated against. The basic form of cross-validation is k-fold cross-validation. Other forms of cross-validation are special cases of k-fold cross-validation or involve repeated rounds of k-fold cross-validation. (source)

Cross-validation can be carried out using different values for k. Different values of k gets us different cross-validation accuracy points on the graph.

The value of k corresponding to the highest accuracy may be selected as the optimal k.

Graph — Cross-validation for deciding upon optimal value of hyperparameter, k (by author)

In case multiple values of k yield the same highest value of accuracy; If we have lots of data, then picking the largest value of k is preferable. This is due to the reasoning that as we have relatively greater trust in the prediction, as there are greater number of points points supporting it.

Having understood the impact of the hyperparameter and a technique to decide the optimal value for k, I’ll conclude this article.
In the next post , we shall see how distance between neighbors is calculated for arriving at the k nearest neighbors. (the math, basically!)
Thanks for sticking around and hope to see you in the next post through completion.😀

Bouquets💐 & brickbats🧱 may be directed to me here.

Understanding Classical ML algorithms

An Illustrated Guide to the K-Nearest Neighbors (KNN) Algorithm

K, THE HYPERPARAMETER

BIAS & VARIANCE TRADE-OFF

OPTIMAL VALUE OF K, THE HYPERPARAMETER

Written by Narasimha Shenoy