K- Nearest Neighbors
INTRODUCTION-
Most of the real-world problems that can be solved using machine learning are supervised learning problems. The problem of classifying an object into one of many categories comprises the classification task that comes under supervised learning. In this problem, we will address classification problems using a popular and simple algorithm called K-Nearest Neighbors. And We will be doing it all from scratch
In General, most of the machine learning algorithms have three important traits viz Representation, Evaluation, and Optimization.
1. Representation: As its name suggests, it represents the intuition behind the algorithm, how it utilizes the training data to make predictions. Does it use training data to learn parameters to create a function and then make predictions giving rise to model-based algorithms, or does it rely on the instances in the training data giving rise to instance-based learning algorithms.
2. Evaluation: It refers to the evaluation criteria you use to compare two models. Examples- precision, recall, f1 score, accuracy, etc.
3. Optimization: If your algorithm learns some parameters from the training data, then how do you arrive at the optimal values of parameters to get the best performing model. Examples- Convex Optimization (Gradient Descent), Combinatorial Optimization (Greedy Search), Constrained Optimization (Linear Programming) etc.
Now let’s see the representation of K-Nearest Neighbors and then we will see what representation category it falls into.
KNN Representation-
Intuition behind KNN is very simple. Consider the scenario: -
Here points having the same color represents the same class as is evident from the figure.
So, what class would you assign to that star marked point? The first thing that comes to mind is this: — Star point seems closer to red points so let’s assign it, red class.
That’s it!!
That’s exactly what KNN does.
Since KNN relies on training instances to classify test points it’s an instance-based algorithm.
KNN Evaluation-
We will use the f1 score to evaluate our model’s performance and to choose the best model among different models having different hyperparameters.
F1 score takes precision and recalls into account.
F1 score = (2 * (recall) * (precision)) / (recall + precision)
If this seems strange, then the F1 score is just the harmonic mean of recall and precision.
KNN Optimization-
Since KNN does not use any cost function, so there is no optimization step in KNN.
Now let’s focus on some small things.
What if many examples of different classes are close to our test point?
Again, intuitively we take the class whose frequency is highest among all the nearest points being considered.
We talked about “closeness” to the red points in the above example.
How do we define closeness?
Intuitively, the distance between two points can be used as the measure of closeness. It turns out that we can represent the distance between two points in many ways.
But the general form is this: -
So, we use the nth norm of distance vector as a metric of closeness between two points.
Here n can vary from 1 to ∞.
When we make n = ∞, the norm just calculates the maximum value of the vector. This can be derived by using limits and derivatives although we won’t be diving into that in this blog.
We are free to choose n as we wish. However, our mode performance will vary based on the value of n we choose. Hence n is one of the hyperparameters of the KNN model.
What’s K in KNN?
As the name suggests, K in K-Nearest Neighbors represents the number of training examples we consider to classify the test point.
K is another hyperparameter of the KNN model.
Now we are in a position to make our own KNN model from scratch.
The only Libraries we will be using are NumPy, and pandas and some plotting libraries like Plotly.
# Distance between two points
def get_distance(vec1, vec2, norm):
diff_vect = []
for i in range(len(vec1)):
diff_vect.append(abs(vec1[i] — vec2[i]) ** norm)
return sum(diff_vect) ** (1/norm)
# most frequent class
def most_frequent(classes):
return max(set(classes), key = classes.count)
# F1 score calculation
def f1_score(y_pred, y_true):
from sklearn.metrics import f1_score
return f1_score(y_true, y_pred, average = ‘weighted’)
# Finding k nearest neighbors
def get_k_nearest(X_train, y_train, test_example, k, norm):
distances = []
for i in range(len(X_train)):
d = get_distance(X_train[i], test_example, norm)
distances.append((i, d))
distances.sort(key = lambda x : x[1])
k_nearest = distances[:k]
classes = [y_train[point[0]][0] for point in k_nearest]
return classes
# Making predictions
def predict(X_train, y_train, X_test, k, norm):
y_pred = [ ]
for test_example in X_test:
k_nearest_points = get_k_nearest(X_train, y_train, test_example, k, norm)
pred = most_frequent(k_nearest_points)
y_pred.append(pred)
return y_pred
Now let’s test our model on a dataset. We will also compare its performance with sciki-learn’s KNN implementation .
Here is the well-commented code that uses our functions (I have defined these functions in the helper module which I import as you can see)
# Importing Libraries
import numpy as np # Numpy
import helper # Our helper module in which we defined all above written function from scratch
import pandas as pd # Pandas
# Loading Data
X_df = pd.read_csv(‘train_X_knn.csv’, index_col = False) # Feature Matrix, does not contain true label column in it.
y_df = pd.read_csv(‘train_Y_knn.csv’, index_col = False, header = None) # True Label vector
y_df.columns = [‘Labels’]
dataset_df = pd.concat([X_df, y_df], axis = 1)
Our dataset is about different types of plastics found in the ocean, each characterized by its chemical composition. There are a total of 6 types of plastics with labels ranging from 1 to 6. Here are the first 5 rows of our dataset.
#Let’s see the number of rows and columns in our data
X_df.shape, y_df.shape
>> ((160, 7), (160, 1))
We have 160 rows and 7 features.
Time to train our Model :)
Since our model depends on the norm and value of k, so we try some combinations of these to get the best performance.
scores = [ ]# List of tuples (k_value, norm, f1_score) that will keep track of the score of the different model with different k and norm combination
# Let’s try some different combinations of k and norm values to get the best possible parameters for our model.
for k in range(1, 6):
for n in range(1, 6):
y_pred = helper.predict(X_df.values, y_df.values, X_df.values, k = k, norm = n)
score = helper.f1_score(y_pred, y_df.values)
scores.append((k, n, np.mean(score)))
# Since we tried 25 combinations scores list is large. So lets find the best model with highest f1_score
best_model = max(scores, key = lambda x : x[2])
best_model
>> (1, 1, 1.0)
Let’s plot our scores as function of k and norm values-
import IPython
import plotly.express as px
k_values, norm_values, f1_scores = [[tup[i] for tup in scores] for i in [0, 1, 2]]
px.scatter_3d(x = k_values, y = norm_values, z = f1_scores, labels = {‘x’ : ‘k values’, ‘y’ : ‘norm values’, ‘z’ : ‘f1 score’}, title = ‘Score of different KNN Models’, height = 700, width = 700)
From this plot, we can see that some models have f1 score of 1 (actually all models with k = 1 have that) and after that, the best f1 score is given by k = 3 which is highlighted in the plot. So which one should we choose?
Is k = 1 the best selection?
From the graph, we can see that all models with k = 1 have f1_score of 1.0
So we should select k = 1.
Trap-
We trained our model on a complete dataset and are then predicting on the same data. So when we predict, each test-example is closest to itself in our training data. That is why k = 1 gives the best f1_score. But this may not be true for new data points. Hence it is not best to choose k = 1 otherwise our model will be highly variable for new data points. So we should choose a value larger than 1 that gives the best f1_score to reduce the variance of our model.
Selecting Model with k = 4 and norm = 1, our f1_score = 0.8325
Let’s compare with scikit-learn’s implementation
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors = 3, p = 1)
clf.fit(X_df.values, y_df.values.reshape(len(y_df), ))
KNeighborsClassifier(n_neighbors=3, p=1)
clf.score(X_df.values, y_df.values)
Scikit-Learn score -> 0.8375
So close :), these shows that our implementation is correct. Although scikit-learn’s implementation is much more sophisticated, still our KNN from scratch works reasonably well.
Also, if you have been into machine learning for quite some time, you must have heard about preprocessing the data before feeding it to the algorithm. In our KNN model, since we use distance metric, we could scale our data first so that no feature dominates over other features and could also do feature selection. but since this blog was aimed at writing knn from scratch and comparing its performance with scikit-learn i have not done it HERE.
When should we use KNN?
- As we can see from implementation, KNN needs to store all the examples to make predictions, so if our dataset is very large then some other techniques like KD-Trees, Inverted Lists, and Locality Sensitive Hashing should be used. Again, these are heuristics and may not lead to true nearest neighbors.
- We have calculated the distance between all points. This makes our implementation computationally very expensive. Hence above-mentioned heuristic techniques are used to get nearest neighbors.
This is all for this blog. Thanks for reading this far :)
CONTRIBUTED BY: Vikas Choudhary