A Guide to k-Nearest Neighbors

A straightforward guide explaining step by step how to implement the algorithm

Johan Mazorra

Published in

The Startup

10 min readJul 31, 2020

k-Nearest Neighbors Explained

The k-Nearest Neighbors algorithm is quite a straightforward method. The whole training dataset is reserved, and when a prediction is required, the k-most similar records to another record from the training dataset are then found. From these neighbors, a summed-up prediction is made.

Comparability between records can be estimated in several ways. A recommended starting stage is the Euclidean distance since we are for the most part managing tabular data when moving toward Data Science related issues in Python. It is the straight-line distance between two points (I will expand upon into the Euclidean distance and explain the substance of it in another section of this article).

When neighbors are found, a rundown prediction can be made by restoring the most common result or taking the average. Accordingly, KNN are truly flexible and can be utilized for classification or even regression problems.

Also, keep in mind that KNN is a supervised learning algorithm, implying that the models in the dataset must have labels doled out to them/their classes must be known. There are two other significant details to point out about KNN. To start with, KNN is a non-parametric algorithm. This implies no presumptions about the dataset are made when the model is utilized. But rather, the model is developed totally from the given data. Second, there is no splitting of the dataset into training and test sets when utilizing KNN. KNN makes no speculations between a training and testing set, so all the training data is additionally utilized when the model is approached to make predictions.

Furthermore, we will be following three basic steps for implementing the k-Nearest Neighbors algorithm:

1) Formulate the Euclidean distance.

2) Fit the model to the training data and return the predictions.

3) Obtain the Nearest Neighbors.

The Importance of the Value Of K

Before we proceed with the steps on how to implement the KNN algorithm, let me explain why the value of k matters. The fundamental constraint when utilizing KNN is that a wrong estimation of K (a wrong number of neighbors to be thought of) may be picked. On the off chance that this occurs, the predictions that are returned can be off considerably. It’s significant that, when utilizing a KNN algorithm, the best possible incentive for K is picked. We need to pick an incentive for K that expands the model’s capacity to make predictions on concealed data while lessening the quantity of errors it makes.

Lower estimations of K imply that the predictions delivered by the KNN are less stable and dependable. To get an instinct of why this happens, consider a situation where we have seven neighbors around a target data point. We should assume that the KNN model is working with a K estimation of two (we’re requesting that it take a gander at the two nearest neighbors to make an expectation). On the off chance that the majority of the neighbors (five out of seven) have a place with the Blue class, but the two nearest neighbors simply happen to be Red, the model will predict that the query sample is Red. Regardless of the model’s guess, in such a situation Blue would be a better conjecture.

If so, why not simply pick the highest K value we can? This is on the grounds that telling the model to consider an excessive number of neighbors will likewise lessen precision. As the radius that the KNN model examines increases, it will in the long run begin considering data points that are closer to different groups than they are the target data points and misclassification will begin happening. For instance, regardless of whether the point that was initially picked was in one of the red regions above, if K was set excessively high, the model would venture into different areas to consider points. When utilizing a KNN model, various estimations of K are attempted to see which worth gives the model the best execution.

I. So what exactly is the Euclidean Distance?

Euclidean distance is the distance between two points in Euclidean distance. Euclidean distance was initially formulated by the Greek mathematician Euclid around 300 B.C. to consider the connections among angles and distances. This arrangement of Geometry is still being used today and is the one that high school understudies concentrate on regularly. Euclidean geometry explicitly applies to the points of two and three dimensions. In any case, it can without much of a stretch be summed up to higher request measurements.

A straightforward way of understanding how Euclidean distance would be formulated is to:

1) Take two points P and Q in a two-dimensional Euclidean space. We will depict P with the coordinates (p1, p2) and Q with the coordinates (q1, q2). Next develop a line portion with the endpoints of P and Q. This line fragment will frame the hypotenuse of a correct triangle. Broadening the outcomes, we note that the lengths of the legs of this triangle are given by |p1 — q1| and |p2 — q2|. The distance between the two points will at that point be given as the length of the hypotenuse.

2) Then, utilize the Pythagorean Theorem to decide the length of the hypotenuse. This theorem expresses that c² = a² + b² where c is the length of a right triangle’s hypotenuse and a, b are the lengths of the other two legs. This gives us c = (a² + b²) ^ (1/2) = ((p1 — q1) ^2 + (p2 — q2) ^2) ^ (1/2). The distance between 2 focuses P = (p1, p2) and Q = (q1, q2) in two-dimensional space is subsequently ((p1 — q1)² + (p2 — q2)²)^(1/2)).

3) Expand the results to a three-dimensional space. The distance between focuses P = (p1, p2, p3) and Q = (q1, q2, q3) would then be able to be given as ((p1-q1)² + (p2-q2)² + (p3-q3)²)^(1/2).

4) Sum up the solution for the distance between two points P = (p1, p2, …, pn) and Q = (q1, q2, …, qn) in n measurements. This overall arrangement can be given as ((p1-q1)² + (p2-q2)² + … + (pn-qn)²)^(1/2).

Now, you may be asking: “How would I translate this into code? Where would I even start?” and the answer to that is quite simple actually. Let me guide you through it giving you a sample for the Euclidean distance function:

def euclidean_distance(self, row1, row2):"""Returns euclidian distance of values between row1 and row2.Inputs: row1 : int or float.row2 : int or float.Output: euclidian_distance : float"""distance = 0.0for i in range(len(row1) - 1):"""Calculation: Subtract b from a, square difference,add to euclidean_distance."""distance += (row1[i] - row2[i]) ** 2
euclidean_distance = distance ** (1 / 2)return euclidean_distance
# Note: when calling the euclidean_distance object we could have used the sqrt() function which is an inbuilt function from the Math library in Python. All that is required to implement it into the code is to import the math module into the algorithm program, and then substitute it as such euclidean_distance = math.sqrt(distance). The former is the proper way of doing it without importing any libraries.

II. Fit the model and return the predictions

The subsequent step is to fit the KNN classifier, this would be the most straightforward step to follow. This progression is not carefully vital for making predictions, however, stacking the whole historical data into the memory is valuable for the estimation of ACCURATE predictions in the following step.

In Python we would implement it like this:

def fit(self, X_train, y_train):"""Fit the model using X as training data and y as target valuesParameters----------X_train : array-like shape of int or float.y : array-like shape or list of target.Output will pass along to the predictor function below."""# Function to fit models for training data.
self.X_train = X_train
self.y_train = y_train

Then, after fitting the model we will add the predictive function:

def predictor(self, X):"""Predict the class labels for the provided data.Parameters----------X : array-like shape or list.Returns-------y : list of floats for each vector in X."""predictor = []# Main loop doing an iteration through the length of X.for index in range(len(X)):# Initialize euclidian_distances as empty list.euclidian_distances = []for row in self.X_train:# For every row in X_train, find eucl_distance to X using# euclidean_distance() and append to euclidian_distances list.eucl_distance = self.euclidean_distance(row, X[index])euclidian_distances.append(eucl_distance)# Sort euclidian_distances in ascending order, and retain only k# neighbors as specified in n_neighbors (n_neighbors = k).neighbors = np.array(euclidian_distances).argsort()[:self.n_neighbors]# Initialize dict to count class occurrences in y_train.count_neighbors = {}for val in neighbors:if self.y_train[val] in count_neighbors:count_neighbors[self.y_train[val]] += 1else:count_neighbors[self.y_train[val]] = 1predictor.append(max(count_neighbors,key=count_neighbors.get))return predictor

Let me clarify what this piece of the code is doing. Basically, we have a few for loops. A main loop which thus incorporates several sub loops.

The main loop emphasizes through the data points of X. On each emphasis, the euclidean_distance() function from the first step is called to figure the distance to each data point in X_train (yes, there are among other productive approaches to do this). As we are not intrigued by each distance estimation of each data point blend, we reject a substantial portion of these Euclidian distance values, and just keeps the k-smallest values (with k being the quantity of neighbors).

Ultimately, we simply need to iterate over neighbors and tally the occurrences of values and return the most constant as our prediction.

III. Obtaining the nearest neighbors

We have a technique for making predictions, and can ascertain how exact our KNN classifier is. Preferably we would need a technique to just return the anticipated neighbors.

The function beneath restores a rundown of tuples containing k closest neighbors and the Euclidean distances to the relating value inputs:

def dis_neighbors(self, x):"""Inputs: x : vector xOutput: dis_neighbors_values : returns a list containing nearestneighbors and their corresponding Euclidean distancesto the vector x wrapped in tuples"""# Initialize euclidian_distances as empty list.euclidian_distances = []# For every row in X_train, find eucl_distance to x# using euclidean_distance() and append to euclidian_distances list.for row in self.X_train:eucl_distance = self.euclidean_distance(row, x)euclidian_distances.append(eucl_distance)
# Sort euclidian_distances in ascending order, and retain only k# neighbors as specified in n_neighbors (n_neighbors = k).neighbors = np.array(euclidian_distances).argsort()[: self.n_neighbors]
# Initiate empty dis_neighborss_values list.dis_neighbors_values = []for index in range(len(neighbors)):neighbor_index = neighbors[index]e_distances = euclidian_distances[index]dis_neighbors_values.append((neighbor_index, e_distances))
return dis_neighbors_values

Now, let’s test the performance of this KNN classifier algorithm that we built by doing a comparison against Sklearn’s own KNN classifier. I actually created two tests with different datasets to see how they will compare in accuracy predictions and see if there was one that will achieve a better result between the two. For the first test, I utilized the wine quality dataset, and for the second set I used the digits dataset directly from sklearn’s dataset module.

Load sklearn’s packages and prepare the wine quality data.

# Import necessary libraries.
from sklearn.datasets import load_winefrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score# Let's load the data, I will be using the wine quality dataset directly from sklearn.wine = load_wine()data = wine.datatarget = wine.target# Split the data into training and testing sets.X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3)

KNN Classifier from Sklearn

# Instantiate modelclf = KNeighborsClassifier(n_neighbors=10)# Fitclf.fit(X_train, y_train)# Predictionpredict = clf.predict(X_test)# Accuracy Scoreprint(f"Sklearn's KNN classifier accuracy: {accuracy_score(y_test, predict)}")

Result for the accuracy of Sklearn’s KNN classifier

KNN Classifier built from scratch

# Instantiate modelclassifier = Knn(n_neighbors=10)# Fitclassifier.fit(X_train, y_train)# Predictionpredict = classifier.predictor(X_test)# Accuracy Scoreprint(f"Build's k_nearest_neighbors model accuracy: {accuracy_score(y_test, predict)}")

Result for the accuracy of the KNN classifier built from scratch.

For the second test, the only difference will be the import of the dataset being utilized (in this case the digits dataset) and when loading the data, everything else is exactly the same. Let me show you below:

# Import necessary libraries.
from sklearn.datasets import load_digitsfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score# Let's load the data, I will be using the wine quality dataset directly from sklearn.digits = load_digits()data = digits.datatarget = digits.target# Split the data into training and testing sets.X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3)

As stated above, everything else will be set up in the same manner as in the first test. So let’s check the results below starting with sklearn’s KNN classifier.

Yes! As we can see my personal goal of beating Sklearn’s KNN classifier algorithm was pleasantly achieved.

Conclusion

Finally, we implemented a rather simple and highly accurate version of the KNN classifier algorithm. Its performance is superior to Sklearn’s KNN classifier which is a satisfying achievement and makes me extremely content since I put my hardest effort into beating it. In the future, I am planning on making more predicting models and implementing them in a manner which will have a better performance than other machine learning libraries.

Link to the repository for the KNN classifier algorithm that was built from scratch:

jsmazorra/CS-Data-Science-Build-Week-1

For your first CS Build Week, you'll be implementing a few watershed algorithms that are very often used in the data…

github.com