The aim of writing this article is to summarise the KNN or the K-nearest neighbor algorithm in such a way that the parameters that will be discussed will be useful to decide how the algorithm will be used for classification. Let us start with the first parameter that is the problem definition.
1) Problem Definition:
This is the first and most important parameter as it refers to the part that if we should use the KNN algorithm or not as many other algorithms can be used for classification. The main advantage of KNN over other algorithms is that KNN can be used for multiclass classification. Therefore if the data consists of more than two labels or in simple words if you are required to classify the data in more than two categories then KNN can be a suitable algorithm.
There are no specific assumptions that should be made concerning the data. For example, the data has to be linearly separable to use the Logistic regression algorithm. As the KNN is capable of performing the multiclass classification it does not require any specific assumptions. It works on all kinds of data on which the classification is to be performed. The graphical representation of the two parameters that are discussed is shown below.
As we can see below, there are more than two classes and the data is also not linearly separable. The new data element will be classified in either of the three classes.
3) Preliminary concepts:
The preliminary concepts that one needs to be aware of are: nearest neighbors, distance metrics — Euclidian distance, Manhattan distance, Majority vote. All these concepts are the basic mathematical concepts on which the classification in the KNN algorithm is done.
As the classification is done by taking into consideration the K-nearest neighbors, we need to decide the optimum value of ‘K’ or the number of nearest neighbors. By default, the value of ‘K’ is set as ‘5’. One more thing that needs to be considered is that as the algorithm makes use of majority voting, it will be better if we set an odd number value for ‘K’ to avoid any condition where a two-class classification needs to be performed and say for example if we set the value of ‘K’ an even number like 6, then there can be a situation where there are equal votes for both the classes (3 for each class). In this case, there is a possibility of misclassification. Hence we should consider an odd number value for ‘K’. A simple method for deciding the optimum value is by trying different values of ‘K’. For this, we can simply make use of a ‘for’ loop. A syntax for the same is shown below.
for i in range(1,105,2):
knn = KNeighborsClassifier(n_neighbors=i)
#Train the model using the training sets
#Predict the response for test dataset
y_pred = knn.predict(X_test)
print(“Accuracy:”,metrics.accuracy_score(y_test, y_pred), “for “,i)
The accuracy for all the values of ‘K’ will be printed based on which we will be able to decide the optimum value to get the maximum accuracy.
Other parameters that can be used for getting the optimum accuracy are by using cross-validation.
5) Time and space complexity:
The time and space complexity is decided by the algorithm we use for choosing the neighbors. There are a total of three algorithms and each one has different time complexity.
1. Brute Force:
Consider there are N samples and D dimensions. The time complexity will be O[DN²]. Thus if we have small dimensions and overall a small dataset, this would take an acceptable amount of time. An increase in the size of the data set will correspond to an increase in time complexity.
2. K-D Tree:
This algorithm improves the time complexity by reducing the number of distance calculations. The time complexity for this algorithm turns out to be O[D N *log(N)]. This is better than brute force if the number of samples is more. But an increase in the dimensions of the data will again cause the algorithm to take a longer time.
3. Ball Tree:
If data is having higher dimensions then this algorithm is used. The time complexity for this algorithm turns out to be O[Dlog(N)].
6) Limitations of the KNN algorithm:
As it is clear that the KNN algorithm internally calculates the distance between the points, it is therefore obvious that the time taken by the algorithm for classification will be more as compared to other algorithms in certain cases. It is advised to use the KNN algorithm for multiclass classification if the number of samples of the data is less than 50,000. Another limitation is the feature importance is not possible for the KNN algorithm. It means that there is not an easy way which is defined to compute the features which are responsible for the classification.
Thus by taking into consideration all the parameters that are discussed, it will be easy to use the KNN algorithm for classification and get optimum accuracy. That’s all for today!