Some study notes on machine learning algorithms — The series

Ep01: The one with the Nearest Neighbors

Elisa Ribeiro
comunidadeds
11 min readSep 18, 2022

--

It’s been a while since I published something here on Medium (and it was in PT-BR), so I decided to try it out again. This time, I will bring some transcripts/ thoughts/ notes on machine learning algorithms that I am studying or that I need a refresher on.

I’ll start with a group of supervised algorithms that are simple and that can be used for classification and regression: Nearest Neighbors.

Disclaimer: As I said, my idea with these posts is to write them at the same time I study or reinforce my knowledge of the ML algorithms. Do not take every single word you read here as a sacred fact. I can make mistakes too, you know? And if you read something wrong, please, warn me so I can correct the mistake and we can all be smarter :)

Nearest Neighbors

The principle of nearest neighbor algorithms is to predict the label of a new data point based on the closest labeled data points from the training set. The number of neighbors used to help in this process can be fixed ( k from the k-nearest neighbor) or it can vary based on the density of points ( radius from the radius-based neighbor ).

How the distance is measured?

As you noticed, NN methods rely on distance between points. However, distance can be measured in different ways, as we all know when we go to Google maps to check how far away the grocery store is and we can take the distance in a straight line or through the actual streets in the city.

The same happens for NN methods, and the distance between points can be calculated by different methods. The most common distance measure used for NN methods is the standard Euclidean Distance (the straight line), but it can be Manhattan Distance (which is interesting if you are working with geographical data), Cosine Distance, etc.

An interesting image on the topic, summarizing some of the most common distance measures:

Image found here. Created by Maarten Grootendorst

For the NN methods implemented in sci-kit learn, you can set up the distance metric using the metric parameter. If you want to know all distances that sci-kit learn accepts, you can import sklearn.neighbors and run sklearn.neighbors.VALID_METRICS['brute'].

Keep in mind that it is usually a good idea to test different distance metrics, as there is no universal distance that performs better in every situation. Below, is a test with most of those distance metrics and how the score is affected by them:

The chosen distance after the GridSearch was the canberra distance

Just a reminder, again, the canberra distance performed better for this dataset. Always test different distance metrics in your data. If you are interested in how these distances (and others) can affect a KNN model, here is an interesting paper on this subject. If you have any sources on this topic, feel free to send them to me so I can improve this article!

Nearest Neighbors: lazy and flexible

Another property of NN methods, according to sci-kit learn documentation is that:

“ Neighbors-based methods are known as non-generalizing machine learning methods, since they simply “remember” all of its training data” [1]

This means that the final model does not have complex rules against which new data is tested and classified. Neighbors-based methods simply store the “layout” of the data. When new data arrives, the model can compare it with the training set, measure distances, and then classify it.

Neighbors-based methods are also non-parametric, which means they make no assumption on the relationship between the target (y) and the explanatory variables (X₁… Xₚ). So, Neighbors-based methods are more flexible and more prone to overfitting (in comparison with parametric methods).

How new points are classified?

As I explained earlier, the idea of NN methods is to predict the label of a data point by looking at its neighbors. This decision is made by majority vote, so the new point is classified according to the group that has more close points to it.

This means NN methods use uniform weights ( weights = 'uniform' ), so each vote has the same impact on the final result. In some cases, it can be useful to weigh these votes differently, giving more importance to the ‘opinion’ of the closest neighbors. You can do that by setting the argument like this: weights = 'distance'.

In the case you set weights = 'distance', the distance that is used here is the same yset uptup in the metric argument.

All of these properties I explained so far are the basis of NN methods and they apply for Nearest Neighbor classification and regression. For both cases, the most common methods (that are implemented in scikit-learn ) are K Nearest Neighbors (KNN) and the Radius Nearest Neighbors.

The differences between the two are that the NN classification is used for data with discrete labels and NN regression is used with continuous labels. The classification also computes the label of the new point based on the ‘winner’ group label, while the regression computes the new point based on the “mean of the labels of its nearest neighbors.”

How many neighbors are used in the voting?

As I said, NN methods use majority vote to decide on a new point. But you might be wondering how many votes (or points) are used in this decision, right?

For KNN, the number of neighbors used to make the decision on the new point is based on the n_neighbors parameter, set by the user. The n_neighbors is the k, and it is usually an odd integer number, to avoid ties during the voting.

So imagine you want to predict if a data point belongs to group blue or purple. First, KNN finds the k closest points (from the training set). Then, it counts these data points in each group, so the group with the most points wins. Finally, that new data point you wanted to classify is then labeled as part of the winner group.

As the purple group “wins”, the orange point is assigned to this group.

As you can see, k or n_neighbors control a lot of things in KNN, including the ability of the model to overfit or underfit. Smaller k values can lead to a overfit of the model, while larger k values can cause the opposite (underfit).

If you don’t visualize this, here comes a demo on how different values of k can create different decision boundaries. This code snippet was slightly modified from the “Introduction to Machine Learning with Python” book.

Notice the blue point on the left. This point is either noisy or it was mislabeled in the source (these things happen on real-life data). Now, look at how squiggly the decision boundary is with a k = 1 and how it ‘embraces’ the blue dot. However, as we increase the value of k, the decision boundary becomes stiffer and does not care about these extreme points anymore. You could even say that a large k regularizes the model…

If we look at how the scores behave in similar cases, here is the code for different values of k and how the score changed for the training and test sets from the brest_cancer dataset:

Notice that when k is small, the training set has a score of 1.0 and that score drops to 0.92 when the model is applied to the test set. This is typical when you have a model that is overfitting.

When you have a high k, the score drastically drops and the opposite can occur: the training set has a low score, but when you apply the model to the test set, there is even a chance it will perform better. This happens because the model is too simple, so it is underfitting the data. And finally, when you have a good value for k, your score will probably drop for the test set as well, but the difference won’t be so drastic.

Now, for the Radius Neighbors Algorithm, the number of neighbors used in the voting is determined by the radius parameter. Any point within a fixed radius from the new point (including points lying on the boundary) is considered a neighbor and it is responsible for predicting the label of the new point. The radius parameter is a floating value, also specified by the user. Everything else is the same for Radius Neighbors and KNN.

A note on ties

Before talking about the qualities and limitations of the KNN, we need to consider the possibility of ties happening during the testing step. When you think about KNN and how a new point is classified, there are two scenarios where this process can get stuck. First, when deciding on which neighbors to use as reference and then, on the voting process itself.

When you have a situation in which two or more points are equally distant from a query point, you need to decide which point (or points) are going to be used as neighbors in the voting process. In this case, scikit-learn will choose the point that comes first on the training dataset. In their words:

Warning: Regarding the Nearest Neighbors algorithms, if two neighbors k and k+1 have identical distances but different labels, the result will depend on the ordering of the training data. [1]

The second scenario involves choosing the label for the new point which happens during the voting process. As I mentioned before, there is a reason to set the k mostly with odd numbers. If you set k=4, for example, it is possible to have 2 “votes” for one group and two “votes” for the other group. That is a tie, so which label is given to the new query point?

In cases like this, scikit-learn apparently classifies the new point as part of the group with the lowest label. If you look at the source code of KNN, you will see that for cases in which you have a weight = 'uniform', the new label is returned using the mode function from the scipy package, which returns the most common number (label). However, when this function encounters a tie, it returns the lowest number as the mode.

There is an interesting thread from StackExchange that shows an experiment with KNN, to find out more about this. Here is the code:

Credits to Haritz Laboa for the neat experiment

And if you want to test the mode function from the scipy package, here is the code to do so:

Note that it returns the lowest value, not the first value on the list.

The good, the bad, and the ugly of Nearest Neighbors

I guess by now you noticed that Nearest Neighbors methods are all about distances. Do you see a problem here?

a little tip

If you work with a lot of features it is reasonable to expect they will not be of the same type. Some features can be weight, real-life distances (meters and centimeters), currency, etc.

So, to use all these different things with a Nearest Neighbor method, we need to scale the data first, as this model can be highly affected by the scale of the features. Variables that are on a larger scale will have a larger effect on the distance between data points, and hence on the Nearest Neighbor method itself.

Below, is another demonstration of how scaling can affect the score of a KNN:

And the output:

“passthrough” means that no scaling was performed

The algorithm of Nearest Neighbors

Before I end this article, there is one more thing to talk about, and that’s the algorithm parameter of the Nearest Neighbors methods on scikit-learn.

This parameter can change the algorithm used to compute the nearest neighbor itself and can take the values of ball_tree, ks_tree, brute, or auto.

The Brute Force basically computes the distances by pair of points, so when a new point arrives, it compares the distance from this point to every other in the training (or in a certain radius) to decide on the neighbors. This makes Brute Force a slow algorithm, that gets worse as you increase the number of samples and features (O[DN²])

K-D Tree tries to improve on the Brute Force by using logic, sort of like those logic tests on interviews. “ If point A is distant from point B and point B is close to C …” then A is distant from C as well. By storing these logic tests on ‘trees’ K-D Tree can select neighbors without explicitly calculating their distances, so it is a lot faster than Brute Force(O[DN log(N)]. However, this algorithm suffers from the infamous “curse of dimensionality” so, if you have too many features you can use Ball Tree (O[D log(N)]).

Ball Tree creates some center points (centroids) and some radius, to divide the training data points. When a new data point comes in, Ball Tree calculates the distance of that point to those centroids and creates some lower and upper bounds of distance. With this, it can decide on the neighbors also without having to calculate every single distance.

The algorithm is chosen “automagically” if you set algorithm = 'auto'. According to scikit-learn docs, this is done based on: the number of samples (n_samples) and dimensionality (n_features) of the data; the data structure itself; the k value; and the size of the testing set.

You can read more about these algorithms, and how they are chosen here.

Recap time

Finally, to end this article, here comes a quick summary:

  • Nearest neighbors work by majority vote (weight = 'uniform')
  • Nearest neighbors are non-parametric and non-generalizing methods
  • Larger k or n_neighborson the KNN can lead to underfit
  • Smaller k or n_neighborson the KNN can lead to overfit
  • The advantages of a KNN: simple and easy to implement; fast to ‘train’.
  • The disadvantages of a KNN: computationally expensive during the ‘testing’ step, especially with larger datasets or if your data has too many features; sensitive to outliers and scaling. Don’t forget that.
  • some Nearest Neighbor methods can use different algorithms. Here is a quick summary of all of them:

Brute Force is basically the dumb math-addicted one, calculating distances all over the place. K-D Tree tries to be the smart-lazy one by using those interview logic tests we hate so much. And Ball Tree is even more lazy, just grouping data and making general calculations.

I hope you liked this article and that you found it as useful to you as it was to me (making myself write about these things helps me study them). Again, if you see something wrong or confusing here, please warn me so I can improve this article! Feel free to reach me if you want :)

--

--

comunidadeds
comunidadeds

Published in comunidadeds

A Comunidade DS é uma comunidade de ensino que transforma pessoas comuns em Cientistas de Dados de alto nível, sem fazer uma segunda faculdade.

Elisa Ribeiro
Elisa Ribeiro

Written by Elisa Ribeiro

A Biologist that 'mutated' to Data Scientist