# K-Nearest-Neighbor (KNN) explained, with examples!

Hello, and welcome to my first story on medium!

Today, I’ll be explaining how the algorithm K-Nearest-Neighbor works and how it can be used for classification. We will touch upon the theory, bias/variance trade-off followed by implementation and a very simple example.

## Theory of K-Nearest-Neighbor (KNN)

K-Nearest-Neighbor is a non-parametric algorithm, meaning that no prior information about the distribution is needed or assumed for the algorithm. Meaning that KNN does only rely on the data, to be more exact, the training data. An example can be seen in the figure below:

In general, the algorithm is pretty simple. When the model meets an unlabelled datapoint it does measure the distance to the K nearest neighbours, thereby the name, and then the unlabelled datapoint will be classified as belonging to the class which has the most training instances among the K nearest neighbour. See below image for an illustration:

Here the yellow X is an unlabelled datapoint, and K = 3. The most prominent class is Class 1, hence the datapoint is classified to be Class 1.

As you probably have guessed already, K is a hyperparameter of the algorithm, which specify the number of neighbours (training instances) the unlabelled datapoint will be compared against. Setting K=1 reduces the algorithm to Nearest-Neighbor (NN), resulting in choosing the class of the closest datapoint to be the label of the unlabelled datapoint.

Besides the number of neighbours, K controls the bias/variance trade-off of the model. In general, you are interested in low bias and low variance, but this is almost never the case. So either choose a model with high bias and then fits the model to provide low variance or vice versa. High bias tends to lose important patterns in the dataset, while high variance tends to overfit the training data.

In other words, setting K = 1, tends to overfit the complex structure of the training set and noise in the data will influence a lot, but setting K too high, annihilate the model from finding the relevant pattern in the data. The most optimal value for K can be found using a simple grid search over a predefined set of values for K.

## Probability Calculation of Outcome

Normally, the KNN algorithm is not used for probability estimation, however, it is possible to estimate density and posterior probability of a given classification.

Let us assume, that we create a sphere at the unlabelled datapoint, and we expand that sphere until it contains exactly K neighbours (training instances). See illustration below:

Let V be the volume of the sphere and K_k be the number of points belonging to class k, N_k be the total number of point of class k in the training data and N to be the total number of training points. Then the likelihood can be calculated as:

And the evidence or marginalization can be calculated as:

And the prior probability for all the classes (in this case Class 1 and Class 2) can be calculated as follows:

Now that prior probability, likelihood and evidence is calculated, we can calculate the posterior probability that a datapoint belongs to a certain class, by combining the above equations:

As we can see, the posterior probability can be derived by the ratio of the points belonging to a given class and the K value determined earlier. When classification, it is common to choose the class with the highest posterior probability as the class of interest.

## Implementation of KNN using Scikit-Learn

Implementation of KNN is straightforward using the Scikit-Learn library for Python.

fromsklearn.neighborsimportKNeighborsClassifierX_train = ... # Training data in format (n_instances, n_features)

y_train = ... # Training labels (n_instances, )

k = 3

model = KNeighborsClassifier(n_neighbors=k)

model.fit(X_train, y_train)

That was all for now, I’ll get back soon with some more articles about Machine Learning. I hope that this article helped on understanding KNN as an algorithm, and you feel prepared to use this algorithm in your ML pipeline.

Feel free to contact me, with any comments or suggestions.

Thank you for reading…

## Sources:

- Pattern Recognition and Machine Learning, 2006, Christopher M. Bishop
- Scikit-Learn