Finding out Optimum Neighbours (n) number in the KNN classification using Python

Sai Kumar Gandhi
Analytics Vidhya
Published in
4 min readJun 22, 2021
Source: Research Gate

Classification Algorithms in Machine Learning: There are several instances where we need to identify the unknown object or data that belongs to which class by using Machine Learning. This is called Classification, and there might be an ‘m’ number of classes involved in the problem. There are various types of Classification models available in Python such as GuassianNB, DecisionTree, RandomForest, KNN with n neighbors, LogisticRegression, and SVM Classifiers. For now, we will discuss how KNN works and how to decide optimum ’n’ neighbors for maximum accuracy.

KNN Classifier: KNN Classifier falls under the Supervised Classification algorithm which yields better results when compared to other classification models. KNN can be used for both classification and also regression problems. While training it tries to identify patterns from the example of the same class and identifies how each class differs from other classes. Then it maps those examples, and classes on a graph with borders at the edges of the Classes. When a new object or example is given for predicting which class it should fall under, the KNN will find the attributes associated with the given example and compares them to the various classes given in the training. The example belongs to a class if the properties of the class match that of the example.

For example, we have trained the KNN with two classes as Flying and Non-Flying animals. Now we need to identify to which class does a Chicken belongs. At first look, other classifiers will classify chicken into Flying animals due to their wings. But the KNN compares all the properties of chicken with given classes. In this case, chicken would be placed at the edge of both Flying and Non-FLying animals as it has wings but cannot fly using them. Now Gaussian distance is calculated between the chicken and other neighbor samples in the graph. If chicken matches with more neighbors of a class mean that chicken belongs to that class.

Source: kdnuggets

How to figure out Optimal ’n’ neighbors: Based upon ’n’ neighbors used in the KNN classifier the same example might fall under various classes. So how to figure out optimal ’n’ number for better accuracy and low error rates? To answer this we need to draw a graph between Error Rate vs K Value. Let’s see how to draw this graph using Python.

Import the StandardScaler and KNeighborsClassifier, and transform the training and testing datasets. We also need Matplotlib for drawing the graphs. You can install these packages using the pip. (Don’t forget to import matplotlib. I have imported it in previous steps as plt)

#feature Scaling  
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

st_x= StandardScaler()
x_train= st_x.fit_transform(X_train)
x_test= st_x.transform(X_test)

Now find out the various error rates for various ’n’ neighbors. You can use the range function for this purpose. I have found error rates and stores them in an array for 1 to 40 neighbors.

error_rate = []
for i in range(1,40):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train,y_train)
pred_i = knn.predict(x_test)
error_rate.append(np.mean(pred_i != y_test))

Now we need to plot these Error Rates against the Neighbor values to see which ’n’ neighbor got a low error rate. For this, we can use matplotlib package. I am also printing the lowest error rate and corresponding ’n’ neighbor value over the plot for quick results.

plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed',marker='o',markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
req_k_value = error_rate.index(min(error_rate))+1
print("Minimum error:-",min(error_rate),"at K =",req_k_value)

The graph would like the below image. So for best accuracy, we need to use ‘n=10’ for the KNN classifier.

I am going to train the KNN classifier with the dataset for n=10 neighbors and see how much accuracy I have got. I have saved the model into y_pred

#Fitting K-NN classifier to the training set  
from sklearn.neighbors import KNeighborsClassifier
classifier= KNeighborsClassifier(n_neighbors=req_k_value, metric='minkowski', p=2 )
classifier.fit(x_train, y_train)
y_pred= classifier.predict(x_test)

Now I will predict the value for the testing dataset, at the same time I am also calculating the accuracy and confusion matrix for the classifier we have trained.

#Creating the Confusion matrix  
from sklearn.metrics import confusion_matrix
print("KNN - 10 neighbors Accuracy score and Confusion matrix")
print(accuracy_score(y_pred,y_test))
print(confusion_matrix(y_test, y_pred))
print("-------------------------")
Accuracy of 88.52 for n=10 neighbors

Let's beautify the Confusion matrix using the matplotlib.

plot_confusion_matrix(classifier, x_test, y_test)
plt.show()
Confusion Matrix for KNN with n=10 (optimal) neighbors

Conclusion: By using the above steps we will be able to use optimum ’n’ neighbors to get maximum accuracy for the training dataset with a low error rate. However, in some cases, accuracy may be high due to Overfitting and Underfitting as training data may be very small. So, to be safe do calculate the Recall, Precision, F-Measure, Specificity, Sensitivity, ROC Score, and draw the ROC curve also. I will discuss how to calculate these in upcoming posts.

Stay Tuned 😉

Note: If this helped you in learning anything, do click on claps. Do let me know what you think of the article in the comments.

--

--

Sai Kumar Gandhi
Analytics Vidhya

Software engineer and team lead with a passion for building secure and reliable software. Experienced in the cybersecurity domain.