K-Neighbors From Scratch

Datascience George
The Startup
Published in
5 min readNov 9, 2020

Using just python and numpy

Why use K-Neighbors?

K-Neighbors is a good at modeling complex non-linear relationships. It is a supervised form of machine learning and is different from linear and tree based algorithms because it is distance based.

How Does K-Neighbors Work?

K-Neighbors estimates values by examining nearby data points. A K-Neighbors model works by first saving the training data. Then when making predictions it takes each point in the testing data and finds its “neighbors” in the training data by comparing distance. The model is preset to get some number, “K”, of the neighbors. If the problem is classification it looks at the counts of the neighbors’ classes and sets the prediction to be the class with the highest count. If the problem is regression it sets the prediction to be the mean of the neighbors’ target values.

Setting K

The hyperparameter “K” should be set to an odd number for classification and an even number for regression. The larger the “K” value the less likely the model is to overfit. However if you set “K” to be too large it can quickly and severely underfit.

Coding From Scratch

I will be coding a regressor from scratch by creating a python class. I’m going to make it Scikit-learn style with .fit and .predict methods. The first thing to do is create an __init__ method to create some attributes for the class. Some attributes we will need are the “K” value, a place to save the training feature data (X values) and a place to save the training target data (y values).

import numpy as npclass KNRegressor():    def __init__(self, k=6):
self.k = k
self.X_train = None
self.y_train = None

The next thing to do is create the .fit method. All this will do is take in X and y data and save it to the object.

    def fit(self, X_train, y_train):
self.X_train = X_train
self.y_train = y_train

It would be nice to have a helper function to calculate distance. For this post I’m going to stick to Euclidean distance, the type of distance we all know. The formula for distance is similar to the Pythagorean theorem. We are essentially finding the distances between two points on a graph, but it is an n-dimensional graph. I have drawn out the formula here where a represents a row of X values from the test set, b represents a row of X values from the training set, i represents a feature, and n represents the total amount of features.

The key to doing this in python is to use vector math. Vector math involves doing arithmetic element wise with ordered lists. Numpy arrays automatically operate with vector math when responding to normal arithmetic operators in python. The formula for distance is to take two vectors and subtract one from the other. Then square that difference vector and find its sum. Finally take the square root of that sum and return the distance.

    def euclidean_distance(self, v1, v2):
difs = v1 - v2
squares = difs**2

# use try and except in case
# vectors have length of 1
try:
summation = sum(squares)
except:
summation = squares

distance = np.sqrt(summation)

return distance

Another good use for a helper function would be to find the indices for of the K smallest values in an array of distances. To make this function we can take advantage of python’s built in enumerate and sorted methods. Enumerating an array adds an index to each point in a form similar to a nested list. Call the list function on the enumeration and you will get a list of tuples. Then you can sort the list of tuples using the second value of each tuple (the first value is the index). To sort by the second value just write a lambda function lambda x: x[1] and pass it as sorted’s parameter key. Now create a list for storing the indices called nearest_indices. Finally, loop over the first K indices of the sorted list and slice out the first value for every tuple inside it, append each value to nearest_indices, and return nearest_indices.

    def find_K_nearest_indices(self, distances):        # will create a nested list like structure
enum_dist = list(enumerate(distances))
# sort by distance
sort_dist = sorted(enum_dist, key = lambda x: x[1])
# create a list for storing indices
nearest_indices = []

# loop over first K tuples in "sort_dist"
for t in sort_dist[:self.k]:

# and record the first value of each tuple
nearest_indices.append(t[0])
# return the nearest indices
return nearest_indices

Now to make the predict method. Let’s start off by making a list called “predictions” to save our predictions. Now we will find the K nearest neighbors of each X value. We can do this by looping through each row of the testing X data and then for each of those rows loop through each row of the training X data. This forms a nested loop. Inside the outer loop we create a list called “distances” to record the distances. Then in the inner loop we find and record those distances using our helper function euclidean_distance. After that we can close the inner loop. distances now holds the distances between one testing value and all of the training values. Then we use our second helper function find_K_nearest_indices to find the indices for the neighbors. Then we take the mean values of the points at those indices in our y data, append the mean to predictions, and close the outer loop. All that is left to do after that is return the predictions!

    def predict(self, X_test):       # list for recording predictions
predictions = []
# loop over X_test
for i in range(len(X_test)):
# list for recording distances
distances = []
# loop over X_train
for j in range(len(self.X_train)):
# find distances
distances.append(self.euclidean_distance(
X_test[i], self.X_train[j]))
# find nearest indices
indices = self.find_K_nearest_indices(distances)
# record mean of y values for those indices
predictions.append(self.y_train[indices].mean())
# return the predictions
return predictions

There you have it!

To use the regressor instantiate a model using the class, then simply call .fit and pass in the training X and y data. Then call .predict on the testing X data and the method will return a set of prediction. Optionally pass in a K value when creating the object if you want K to be a number other than 6.

model = KNRegressor()model.fit(X_train, y_train)model.predict(X_test)

--

--