Choosing K Nearest Neighbors

And saving time with GridSearchCV

Jamel Dargan

Published in

The Startup

7 min readAug 5, 2020

Dyed squares, in variegated shades from light pink to dark blue — Photo by Jerome on Unsplash

Classification is more-or-less just a matter of figuring out to what available group something belongs.

Is Old Town Road a rap song or a country song?

Is the tomato a fruit or a vegetable?

Machine learning (ML) can help us efficiently classify such data, even when we do not know (or have names for) the classes to which they belong. In cases where we do have labels for our groups, an easy-to-implement algorithm that may be used to classify new data is K Nearest Neighbors (KNN). This article will consider the following, with regards to KNN:

What is KNN
The KNN Algorithm
How to implement a simple KNN in Python, step by step

Supervised Learning

In the image above, we have a collection of dyed squares, in variegated shades from light pink to dark blue. If we decide to separate the cards into two groups, where should we place the cards that are purple or violet?

In supervised learning we are given labeled data, e.g., knowing that, “these 5 cards are red-tinted, and these five cards are blue-tinted.” A supervised learning algorithm analyzes the training data — in this case, the 10 identified cards — and produces an inferred function. This function may then be used for mapping new examples or determining to which or the two classes each of the other cards belongs.

What is Classification?

Classification is an example of supervised learning. In ML, this involves identifying to which of a set of categories a new observation belongs, on the basis of a training dataset containing observations whose category membership is known (is labeled). Practical examples of classification include assigning an email as spam or not spam or predicting whether or not a client will default on a bank loan.

K Nearest Neighbors

The KNN algorithm is commonly used in many simpler ML tasks. KNN is a non-parametric algorithm which means that it doesn’t make any assumptions about the data. KNN makes its decision based on similarity measures, which may be thought of as the distance of one example from others. This distance can simply be Euclidean distance. Also, KNN is a lazy algorithm, which means that there is little to no training phase. Therefore, new data can be immediately classified.

Advantages and Disadvantages of KNN

Advantages

Makes no assumptions about the data
Simple algorithm
Easily applied to classification problems

Disadvantages

High sensitivity to irrelevant features
Sensitive to the scale of data used to compute distance
Can use a lot of memory

Grouped rows of forks and spoons, with identical items stacked and held together with rubber bands — Photo by Alina Kovalchuk on Unsplash

While KNN is considered a ‘lazy learner’, it can also be a bit of an over-achiever — searching the entire dataset to compute the distance between each new observation and each known observation.

So, how do we use KNN?

Algorithm of KNN

We start by selecting some value of k, such as 3, 5 or 7.

The value of k can be any number below the number of observations in the dataset. When the choice is between an even number of classes, setting this parameter to an odd number avoids the possibility of a tie between the two.

One approach for selecting k is to use the integer nearest to the square root of the number of samples in the labeled classes (+/- 1 if the square root is an even number). Given 10 labeled points from our two classes, we would set k equal to 3, the integer nearest to √10.

Choose k samples closest to the new data point according to their Euclidean distance from that point.
For each data point in test: Calculate the distance between test data and each row of training data with the help of Euclidean distance.
Now, sort point distances in ascending order according to the distance computed.
Choose top k from the distance array.
Now, assign a class to the test sample based on most frequent class of these rows.

If you comfortably read through those bullet points, you may already know enough about ML algorithms that you did not need to read this article (but please, continue).

Essentially, each of the k nearest neighbors is a vote for its own class. The new data point will be classified based on which class has the greater number of votes out of the test points k nearest neighbors.

Example

Let’s see an example to understand better.

Suppose we have some data which is plotted as follows:

Scatter plot with five red points near the upper-right and five purple points converging toward the lower-right — 10 data-points in two classes

You can see that there are two classes of data, one red and the other purple.

Now, consider that we have a test data point (indicated in black ) and we have to predict whether it belongs to the red class or the purple class. We will compute the Euclidean distance of the test point with k nearest neighbors. Here k = 3.

Scatter plot with lines connecting a black test point to its 3 nearest neighbors and a circle around the connected points — Test point encircled with its three nearest neighbors

Now, we have computed the distance between the test point and its three nearest neighbors. Two of the neighboring points are from the red class, and one is from the purple class. Hence this data point will be classified as belonging to the red class.

Implementation using Python

We will use the Numpy and Sklearn libraries to implement KNN. In addition, we will use Sklearn’s GridSearchCV function.

Grid Search CV

Grid search is the process of performing hyperparameter tuning in order to determine the optimal values of the hyperparameters for a given model. This is significant as the performance of the entire model is based on the values specified.

Why use it?

Models can involve more than a dozen parameters. Each of these parameters can take on specific characteristics, based on their hyperparameter settings; and hyperparameters can present as ranges or conditions, some of which may be programmatically changed during modeling.

Manually selecting best hyperparameters in the ML process can feel like a nightmare for practitioners. Sklearn’s GridSearchCV instance helps to automate this process, programatically determining the best settings for specified parameters.

So, what does this look like in (pseudocode) practice? We start be importing required libraries.

import pandas as pd
import numpy as npfrom sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

KNN function

We will create a custom KNN method with 5 parameters: training examples, training labels, test examples, test label and a list of possible values of k to train on.

First, we create a KNeighborsClassifier() object, imported from Sklearn. Then we create a dictionary named “parameters” and store the list k in it. Our third step is to pass the classifier, i.e. KNN, and the parameters to GridSearchCV and fit this model on the training data. GridSearchCV will optimize hyperparameters for training and we will make predictions on test data using the tuned hyperparameters. To predict the labels on test data, we call model.predict(). We can check the accuracy of our model and its predictions with the accuracy_score() function we import from Sklearn.

def KNN(x_tr, y_tr, x_te, y_te, k):
    print('\nTraining Started for values of k', [each for each in k],'.......')
    # Create an knn object using imported KNeighborsClassifier() from sklearn
    knn = KNeighborsClassifier()# parameters i.e. k neighbors list
    parameters = {'n_neighbors':k}
    
    # Training the model
    model = GridSearchCV(knn, param_grid = parameters, cv=3)
    model.fit(x_tr, y_tr)
    print('Best value of k is ',model.best_params_)
    
    # Making Predictions on test data
    print('\nPredicting on Test data.......')
    pred = model.predict(x_te)
    print('\nAccuracy of model on test is', accuracy_score(y_te, pred)*100,'%')
    return accuracy_score(y_te, pred)

This custom method is just some pre-processing done on the Google Playstore dataset. Note: a version of the dataset may be obtained from Kaggle. Data filenames and required pre-processing steps may vary.

def data_preprocess():
    # processing Apps.csv
    data = pd.read_csv('apps.csv')
    columns = ['App', 'Category', 'Rating', 'Size', 'Type', 'Price', 'Genres']
    data[columns]
    new_data = data[columns].copy()
    new_data = new_data.fillna(0)
    for each in range(0, len(new_data['Rating'])):
        if new_data['Rating'][each] == 0:
            new_data.at[each, 'Rating'] = new_data['Rating'].mean()
    price_list = [float(each.replace("$","")) for each in new_data.Price]
    new_data.Price = price_list
    
    # processing User_reviews.csv
    data2 = pd.read_csv('user_reviews.csv')
    column = ['App', 'Sentiment_Polarity', 'Sentiment_Subjectivity', 'Sentiment']
    data2[column]
    new_data2 = data2[column].copy()
    
    # merging the two datasets into one final dataset
    df = new_data.merge(new_data2, on='App')
    df.Sentiment = df['Sentiment'].replace(to_replace='Positive', value=1).replace(to_replace='Negative', value=-1).replace(to_replace='Neutral', value=0)
    df.Sentiment_Polarity = df.Sentiment_Polarity.fillna(df.Sentiment_Polarity.mean())
    df.Sentiment_Subjectivity = df.Sentiment_Subjectivity.fillna(df.Sentiment_Subjectivity.mean())
    df = df[df['Sentiment'].notna()]
    df.Type = df['Type'].replace(to_replace='Free', value=1).replace(to_replace='Paid', value=0)
    df = df.drop(['Size'], axis=1)
    
    # separating dataset into samples and labels
    X = df.iloc[:, 0:7]
    y = df.iloc[:, 8:9]
    
    # encoding the dataset 
    X = pd.get_dummies(X)
    print('\nFinished pre-processing data....')
    return X, y

We create a main function and all the processing is done in this function. We will call the above created methods in this main function. Also, we are applying some data normalization techniques in this function and calling the custom function on our data.

Normalization may not be required, depending on the data you use.

Finished pre-processing data....Training Started for values of k [3, 5, 7] .......
Best value of k is  {'n_neighbors': 7}Predicting on Test data.......Accuracy of model on test is 86.07469428225184 %

Running our function results in a respectable accuracy score of 86 %.

In this article, we took a look at the K Nearest Neighbors machine learning algorithm. We discussed how KNN uses Euclidean distance to compare the similarity of test data features to those of labeled training data. We also explored a simple solution for determining a value for k. In our custom code example, we demonstrated the use of Sklearn’s GridSearchCV for optimizing our model’s hyperperameters (and for sparing ourselves the intense manual effort that might be otherwise required to exhaustively tune those hyperparameters).

We can dive much deeper into KNN theory and leverage it over a broad range of applications. KNN has many uses, from data mining to recommender systems and competitor analysis. For those seeking to further explore KNN in Python, a good course of action is to try it for yourself.

If you would like some suggestions, let me know in the comments or feel free to connect with me on Linkedin.