# K nearest neighbor

Published in

--

Motto of K nearest neighbor:

K nearest neighbor is one of the simplest algorithms that falls into the category of Non-Parametric learning. We can use this algorithm for both regression as well as classification problems.

K nearest Neighbor (popularly known as KNN) is a supervised learning algorithm, i.e. it is dependent upon input data that is labeled to be able learn a function that helps in producing a suitable output when given new data that is unlabeled. Regression and classification both can be done under supervised learning algorithm.

Example to understand supervised learning:

Let’s suppose we have a friend to whom we are trying to explain that how an apple looks like.

So, what do we do?

We will show him many pictures, some of them that have an apple (telling them that this is an apple) and the rest which have some other fruits, for instance watermelons, strawberries, bananas etc. ( telling them this is not an apple) With the help of this method he will be able to learn and differentiate between apples and other fruits. Voila!! We made him learn if the fruit is an apple or not, hence we supervised them. This kind of learning is known as supervised learning.

If we are just making our friend learn that if the fruit is an Apple or not, it will be termed as a classification problem, while if we are making him learn to predict the weight of the apple which can hold values like 75 grams, 80 grams, 65 grams 70 grams etc. then it will be termed as a regression problem.

Lets get back to what is KNN !

This easily yet widely used algorithm was proposed by Thomas Cover, helpful in both classification and regression which believes in the idea that things that belong to one group are close to each other. In more technical terms, the output is dependent upon it’s k closest input training examples given in the feature space. KNN is also termed as a lazy learner because it is an instance-based learning algorithm, in this type of learning the machine is known to learn the training instances by heart and then the new data point is generalized based on the similarity measure that is acquired from the learnt training data. KNN focuses on distance from one point to the other for giving out its result.

From the above image we can see that the new data point is close to category two , hence it will be classified as category two.

From the above diagram (known as Voronoi diagram) we can see different regions been shaded with different colors, hence the new data point will be classified depending upon which colored regions it falls into.

As mentioned earlier, KNN uses the approach of measurement of distance between two points for providing the appropriate output results.

There are a number of ways we can calculate the distance between two points, depending upon our data and the kind of problem that we are solving. The most commonly used formula we use to find distance is known as Euclidian Distance:

Simple implementation of KNN for classifying a point:

In the above code:

1. We create A function classify_point to classify our point p amongst the other already existing points.

2. We create an array distance to store all the distances from our point p.

3. For calculating the distance, we use the Euclidean distance formula.

4. We then sort all the distances in ascending order and store them.

5. We have only two groups of points in the above implementation, i.e. group 0 or group 1.

6. The entire array of distance will be checked and if p is closer to more points in group 1, it will be categorized in group 1, else in group 0.

7. We have then pre mentioned our group 1 points and group 0 points for classification.

8. Point p is given for getting classified.

9. Value of k is given that tells our code how many nearest neighbors are to be considered for classification.

10. We call our function for classification of our point p.

General Steps to follow in case of applying KNN on a Dataset:

2. Give a value of K to specify number of nearest neighbors to be considered.

3. Calculate the distance between the point whose result is to be given and the current points from the data.

4. Append the distance along with the index of the data point used for calculation in a sorted manner into an array (in ascending order)

5. The first K entries from the sorted array is picked.

6. We check the labels of the selected K entries.

7. In case of a regression problem, we return the mean of the K labels.

8. In case of a classification problem; we return the mode of the K labels.

How to choose the right K value?

For reducing the number of errors, it is important to choose the most appropriate K value to correctly classify our sample or to give the most suitable value in case of regression. This is done by running our model several times with different K values until we get a satisfactory result.

Some important points we can keep in mind while choosing our K value are:

1. Do not choose a very small value of K , for example K=1 or K=2 will not give a very satisfactory result as it hardly got any points to make an assumption from.

2. Usually a large K value gives a good result, but keep in mind to avoid taking too large values as it might result in overfitting.

3. In case we are doing a classification problem, its always a good idea to use the K value as an odd number so that we can have a higher vote in one of the categories and a chance of being a tie is avoided.

Pros and Cons of KNN:

Pros:

1. KNN is a very simple algorithm to understand and implement.

2. As KNN is a memory-based algorithm, it immediately adapts the new data we feed into it.

3. This algorithm can be used for both classification and regression.

4. We have only one hyperparameter i.e. the value of K to tune, hence decreasing our complexity.

Cons:

1. KNN is easy to implement, but as the dataset grows the algorithm becomes slow in implementation.

2. KNN functions well with small number of input variables but o increasing the numbers of variables, the algorithm faces difficulty in predicting the output of the new data point.

3. Choosing the right value of K becomes an issue sometimes for giving out the right results.

4. KNN is very sensitive to outliers and does not function properly in case the data is imbalanced.

Some concrete examples where KNN is used:

1. It is used in concept search, where one tries to search for similar documents.

2. It can be efficiently used in recommender systems where similar movies, products or videos ca be recommended to users.

3. It is used widely in medical field, such as prediction of diabetes ratio and breast cancer.

4. It is used for text categorization or text mining.

5. It is used for agricultural purposes such as predicting the daily rate of precipitations and other important weather variables.

Summary:

KNN is an instance based easy to use algorithm which can be used for both classification as well as regression problems. It uses the measurement of distance between the points (most popularly Euclidian distance) selecting the given number examples (i.e. K) nearest to the new data point entered, then chooses the most frequent label found (in the case of classification) or takes the mean of the labels (in the case of regression).It is important to choose the right value of K to get a satisfactory result. KNN is although easy to use, yet it can become slow as the amount of data gets increased. KNN is a very popular algorithm that has a number of real-life applications.

KNN be like: