Week 4 — K-Nearest Neighbor

Öner İnce
bbm406f19
Published in
4 min readDec 22, 2019

Last week we have examined the correlation between features. This week we start with our first classification algorithm. To begin with simplest one, we preferred kNN algorithm which is one of the simplest algorithms in machine learning. kNN basically finds the most similar data points to the target data point inside of the train set. The prediction determined by the k closest training examples in the feature space. The algorithm examines all training data to predict. Therefore it works too slowly.

People are living in tent city in Kathmandu after the earthquake — CNN

We used the sklearn library for implementation. We have changed the parameters to see how the algorithm works better.

Here are the parameters we examined:

Metric >It determines which method should be preferred when looking at the similarity between test data and training data.

Weight > It determines the effect of similar data on the estimation.

k value >It determines how many neighbors will be taken during the estimation process. It affects overfit or underfit.

K fold cross-validation number > We used cross-valuation to prevent our data from overfitting and choosing hyperparameter k.

Number of chosen features > It affects how many features to consider by looking at its coefficient.

We have examined the effects of these parameters on the results by changing the parameters one by one to find the best method. We kept our k value, the number of features and k fold cross-validation value constant.

We first applied the unweighted Manhattan. As a result of the process, we have achieved an accuracy of 34.67%. This algorithm runs on average in 18 minutes.

Then we applied the weighted Manhattan. As a result of the process, we have achieved an accuracy of 38.49%. The fact that the similarity constant of the data is not the same increases our accuracy. This algorithm runs on average in 20 minutes.

And then we applied the weighted Manhattan with a higher K value of K fold cross-validation. As a result of the process, there is not much improvement in the accuracy of our algorithms, while the time of the algorithms has increased greatly. This algorithm runs on average in 28 minutes.

Finally, we applied the weighted cosine. As a result of the process, we have achieved an accuracy of 36.12%. The duration of the algorithm increased and our accuracy dropped. This algorithm runs on average in 34 minutes.

As a result, we saw that we got the best result in weighted Manhattan. To test the effect of the number k, we changed our k value.

Weighted Manhattan

As a result of this experiment, we found that our k-value and accuracy increased to some extent, and there was no change after that.

As a result of our observations, we saw the slow motion of the algorithm and tried to accelerate. kNN algorithm is not a suitable method to study in GPU due to its structure. But we can implement cosine similarity with dot product which is suitable to run on GPU.

The cosine between two vectors x and w is defined as
cos(x, w) = x · w / (|x|·|w|). Then we normalize the “w” vector and the formula becomes cos(x, w) = x · w / (|x|). When we change “x” with test or validation data and “w” with training data we can calculate cosine similarity with a single-layer neural network that has no activation function and no need for an epoch.

During this process, due to the size of the train data, we had trouble getting the data into the GPU memory. To solve this we have sent validation data batch by batch. We also took advantage of KD-tree algorithm to reduce our train data size. KD-tree is a method to solve the kNN’s problem of examining all training data. KD-tree data variance features the largest features by sorting the data into two subsets and continues this process for all subsets. This way KD-tree divides the space to be investigated. We have benefited from this algorithm and divided our train data. In this way, we have provided the implementation of the algorithm on GPU.

As a result of the transaction, we split our data according to only 2 features. But since this division is not good enough, our accuracy has lower results than the sklearn implementation. We have achieved 10 times faster results in terms of time, our algorithm has achieved an accuracy of 36 percent in the most optimal settings and was run in a time of 2 minutes.

KD-tree implementation

Conclusion

As a result, we saw that there was a speed-accuracy tradeoff among the methods we tried. We saw that it could be accelerated for a slow running algorithm like kNN. We also think that the method that we try can work better in terms of both accuracy and speed than sklearn in more optimal settings and better datasets.

See you next week!

--

--