K-Nearest Neighbors

Introduction to Advanced Machine Learning

Published in

WomenWhoCode Silicon Valley

5 min readDec 7, 2020

During September, WWCode Data Science led 6 machine learning sessions. In this article we’ll cover the first session that goes around K-Nearest Neighbor’s introduction, having Sneha Thanasekaran as speaker.

What is K-Nearest Neighbor?

In Machine Learning, it’s a classification algorithm based on the concept that similar similar cases, with similar class labels are always near each other. It uses feature similarity (distance metrics) to predict the outcomes.

Let’s see this example:

New data point in KNN Classification (Image taken from WWC talk)

When a new data point enters (see the ？in a yellow box), the KN model measures the distance between it’s vector position and the other data points’s, these were stored before.
After these distances are calculated, an arbitrary number “K” is chosen, this would help us to determine the neighbors. K = 3 means that we are going to look at the 3 closest neighbors.

Giving a value to K in KNN Classification (Image taken from WWC talk)

Count how many neighbors belong to Class A and how many belong to Class B.
Now, we can predict the class that is more frequent in the example as Class B.

How KNN works? 🤔

During the training phase, the model stores the vector positions of all the data points in the training data set. When a new data point enters, the model finds the ‘k’ closest data points in the training data set, in this prediction phase, computation happens. As result we get the closest data points determined by distance metrics.

There are 2 important parameters to make KNN works:

The number of neighbors to consider in the prediction, known as K.
Distance, which can be computed by the Euclidean, Minkowski or Manhattan method.

How to choose the K value?

Setting the smallest value of K could influence on the result, making it less exact and setting a large value of K would make computation expensive.

Using our last example:

Finding the K values (Image taken from WWC talk)

> If we set k = 1 we will get as result that the new point is Class A.

> If we set k = 3we will get as result that the new point is Class B.

> If we set k = 5we will get as result that the new point is Class A.

Testing with different K values (Image taken from WWC talk)

To get the optimal value, different values of K are tested to get the highest percentage of accuracy, for this example, we can see that when K = 9 we get 91% of accuracy (minimum error rate).

Overall accuracy vs The value of K (Image taken from WWC talk)

How to get the distance?

There are three common distance metrics, which were mentioned lines above but Minkowski has shown that gets better results for this purpose.

👩🏽‍💻 Hands on with Excel

We will start working with an excel sheet, you can work in your local machine or using google docs.

Copy or download this this file.
You will find this information: height and weight of customers and their shirt size preference.

Height, Weight and Shirt Size of Old customers

You will have to develop a model to predict the preferred sizes of new customers! 🧠

Now it’s time to solve this! You can follow these steps:

Step 1: use the Euclidean distance by using this formula =SQRT($F$19-A2)^2 +($G$19-B2)^2) starting at row2 and drop it down until row 16.

Step 2: sort column D in an ascending order to get a better visibility.
Step 3: now for the first predicted size , we got K = 3 ,height = 161 and weight = 61 then you should take the 3 (k’s value) closest neighbors (to 161, 61)

3 closest neighbors for the first exercise

As we see, we got the size S repeating twice and the size L only once, so our predicted size should be S (SMALL).
Repeat Step 1, 2 and 3 for the two others incoming entry, remember you need to re do the calculation for a a new incoming entry.
This should be your result after calculation:

You made it, bravo! 🥳

👩🏽‍💻 Hands on with Python and Google Colab

Now that we are a little more familiar with KNN, we can move on to work with Python, we will use Google Colab.

Open Google Colab clicking here.
Select “GitHub” tab, copy this link and press enter.

Select the first page.

Download the googleplaystore.csv file from here.

And then upload it to your Google Colab’s workspace.

After this, you would be able to interact with the following exercise, remember to run step by step (by clicking on the ▶️ button)

You will find a definition of KNN and also the explanation of every step: Data Preparation as getting rid of the null values, Feature Analysis by looking at patterns of data using different graphics and Label Encoding. After this, you can continue with the Model Training! 💪🏽

This blog is a summary of a serie of event sessions run by the WomenWhoCode Silicon Valley on Sep 8, 2020. You can view the recording of the event below.

K-Nearest Neighbors’ Series

K-Nearest Neighbors’ Series by WWCode Data Science

🙋‍🙋🏽‍♀️ View and register for our upcoming events at http://bit.ly/siliconvalley_events