K-NN from scratch
K- Nearest Neighbors is one of the most used algorithms in classification problems. implementing it from scratch is a great programming exercise and can give us a lot of insights about data processing, learning optimization, and general statistical learning insights, that’s what I’ll try to accomplish in this article and it hopefully will lead us to a second part to discuss PCA analysis.
How it works
If you are familiar with libraries such as scikit-learn, you know how easy can be to implement a machine learning algorithm. But there’s a lot happening behind a classification or regression algorithm, and understanding the way those algorithms work is essential to build a truly efficient model that work’s properly. Also, implementing ML algorithms from scratch is the most efficient way to really learn what Artificial Intelligence is about. All right, let’s dig in.
You can find a lot of detailed explanations of the algorithm in a quick google search, but here I’ll focus on a more intuitive approach and discuss some results. So, imagine you have a dataset with a number of features and corresponding labels, and some unknown data that you wish to classify accordingly to the labels available
The image above illustrates the problem. the features of our data are their coordinates in the plane (x,y), and labels are the colors. If we introduce a new point in space, the K-Nearest Neighbours as the name says will try to classify this new point based on the distance between the neighbors of the point. the K in the name stands for the number of neighbors that the classification will be based. The gif below shows the classification process. The nearest neighbors will “vote” for the classification based on their own classes. So, if we have k = 3, and two red nearest neighbors and one blue neighbor, we get 2 votes for red and one for blue, so the new point will be classified as red, and so on. You should notice that if we are using an even value for K, we can get a tie, and this should be considered in algorithm development.
The steps for the algorithm are:
- Calculate the distance between the new instance from the known data.
- Count the classes of the K Nearest Neighbors.
- Classify the instance based on the majority of classes obtained in the previous step.
We use the Euclidean distance for step 1:
Let's go step-by-step through the code. we using mostly NumPy functions to do the job. Also, as an example, we will keep using the wine dataset to apply our algorithm.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
to run our algorithm we take the Wine dataset as example. we can easily read it with scikit learn datasets library.
dataset = load_wine()X = pd.DataFrame(data['data'], columns = data['feature_names'])y = pd.DataFrame(data['target'],columns=['target'])
The X data frame looks like:
So we have points in space with 13 coordinates, which is very hard (actually impossible) to visualize, but we can still use Euclidean distance for a real space of n dimensions. We generate a new dataframe called “scaled’, with data properly standardized:
scaled = (X - X.mean())/X.std()
the Y dataframe keep’s the classes of each X row points. the values are 0,1 or 2 and they represent a type of wine. So suppose we get a new data point with the 13 features available but we don’t know the kind of wine it is (0, 1 or 2). KNN comes to rescue!
The Algorithm
the code for the algorithm is fairly simple, but I’ll comment on it step-by-step. first, we define a function to calculate de distance from a query point from our data:
def euclidean(X, query):
return np.sqrt(np.sum((X - query)) ** 2)
And the function for KNN looks like:
def knn(query, k, X, Y): E = np.apply_along_axis(euclidean,1, X, query)
ids = np.argsort(E)[:k]
classes = np.unique(Y)
count = np.zeros(len(classes))
i=0
for c in classes:
count[i] = np.sum(c == Y[ids])
i += 1
max_prob_class = classes[np.argmax(count)]
return [max_prob_class]
We calculate the distance for each point from our query with the “apply_along_axis” function, and we store in the “ids” variable the position of the k-nearest neighbors. then we initialize the “count” vector with zeros, to count the classes of the nearest neighbors. the last for loop is responsible for counting and storing the values to us, using the Y vector. the voting process is done and stored in the “max_prob_classes”:
max_prob_class = classes[np.argmax(count)]
so if we have 3 classes, our “count” variable is initialized like this:
count = [0 0 0]
and our “classes” variable like this:
classes = [1 2 3]
the loop will check the number of classes of the k nearest neighbors, using their location in Y vector stored in “ids” variable:
count[i] = np.sum(c == Y[ids])
so after the loop run the count vector will look something like:
count = [5 0 1]
meaning it has five neighbors with class 1, zero neighbors with class 2 and one neighbor with class 3. so the query point will be classified as 1. the “np.argmax” function gets the index of the biggest value in the “count” variable, and when we pass it to classes it returns the correct class to us. we store that in the “max_prob_class” variable, meaning “maximum probability to be this class”.
Testing (Raw Data/ Standardized Data)
let’s do some data science with our fresh algorithm. We need to divide our data into train and test as usual.
msk = np.random.rand(len(X)) < 0.5
X_train = X[msk]
Y_train = Y[msk]
X_test = X[~msk]
Y_test = Y[~msk]
Now, we can run some trials for different values of k and determine the accuracy of your algorithm. in the above loop we calculate the accuracy for different values of k (1 to 10):
R = []for k in range(1,11):
r = []
accuracy = 0
for i in range(len(X_test)):
r.append( knn(X_test.values[i], k, X_train, Y_train))
if r[i] == Y_test[i]:
accuracy +=1R.append([k, accuracy/len(X_test)])
We then plot the results:
We got an accuracy range of 0.69–0.76… actually not that bad for a first run!
Here’s the code to do the plotting:
plt.figure(figsize=(10,5))
plt.scatter(np.asarray(R)[:,0], np.asarray(R)[:,1])
plt.plot(np.asarray(R)[:,0], np.asarray(R)[:,1], '--')
plt.title('KNN results for different K values')
plt.ylabel('accuracy')
plt.xlabel('number os neighbours')
plt.grid(True)
plt.show()
Next Steps
you can use this algorithm and test it in other datasets to see how it performs. In the next article, I’ll show how we can improve our results with some data preprocessing techniques! See ya!