k-nearest neighbor

K-Nearest Neighbors algorithm (KNN) With Python

Abhijeet Pujara
Analytics Vidhya
Published in
4 min readMay 10, 2020

--

This article covers six parts:

  1. What is K-Nearest Neighbor (K-NN)?
  2. To find “K” in k Nearest Neighbor?
  3. distance functions
  4. Pros K-Nearest Neighbor
  5. Cons of K-Nearest Neighbor
  6. K-Nearest Neighbor with Python(With code)

In this article, we will explore and also see the code of the famous supervised machine learning algorithm, “K-Nearest Neighbor .” that can be used to solve both classification and regression problems.

What is K-Nearest Neighbor?

K- Nearest Neighbors is a

K-nearest neighbors (kNN) is a supervised machine learning algorithm that can be used to solve both classification and regression tasks. The algorithm makes predictions by calculating the similarity between the input sample and each training instance. In K-NN algorithm output is a class membership. Intuitively K is always a positive integer.K-nearest neighbor algorithm essentially boils down to forming a majority vote between the K most similar instances to a given “unseen” observation. The similarity is defined according to a distance metric between two data points.

The following two properties would define KNN well:-

Lazy learning algorithm:- KNN is a lazy learning algorithm because it does not have a specialized training phase and uses all the data for training while classification.

Non-parametric learning algorithm:- KNN is also a non-parametric learning algorithm because it doesn’t assume anything about the underlying data

How to find the best k value

The choice of “K” has a drastic impact on the results we obtain from KNN. “K” is a number used to identify similar neighbors for the new data point. “k” in the KNN algorithm is based on feature similarity

choosing the right value of “K” is a process called parameter tuning and is important for better accuracy. Finding the value of “k” is not easy.

Some frequently used distance functions are

Manhattan & Minkowski (Distance Measures)
Euclidean(Distance Measures)

Pros of KNN

  1. KNN is very easy to implement. There are only two parameters required to implement KNN
  2. Since the KNN algorithm requires no training before making predictions, new data can be added seamlessly which will not impact the accuracy of the algorithm.
  3. No Training Period: KNN is called Lazy Learner (Instance-based learning). It does not learn anything in the training period

Cons of KNN

1. It does not work well with the large dataset.

2. The KNN algorithm doesn’t work well with high dimensional data because, with a large number of dimensions, it becomes difficult for the algorithm to calculate the distance in each dimension.

3. Need feature scaling: We need to do feature scaling (standardization and normalization) before applying the KNN algorithm to any dataset. If we don’t do so, KNN may generate wrong predictions.

4. KNN is sensitive to noise in the dataset. We need to manually impute missing values and remove outliers.

KNN With Python

iris dataset

iris_data.target_names

X=iris.values[:,0:4]
Y=iris.values[:,4]
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=47)

K-NN
output

Happy Learning !!!

Happy coding :)

And Don’t forget to clap clap clap…

--

--

Abhijeet Pujara
Analytics Vidhya

Data Science enthusiast. A flexible professional who enjoys learning new skills and quickly adapts to organizational changes.