K-Nearest Neighbor with Practical Implementation

Published in

The Art of Data Scicne

9 min readJul 21, 2018

In this chapter, we will discuss the k-Nearest Neighbor Algorithm which is used for classification problems and its supervised machine learning algorithm.

This chapter spans 3 parts:

What is the k-Nearest Neighbors Algorithm?
How do k-Nearest Neighbors work?
Practical Implementation of k-Nearest Neighbors in Scikit Learn.This is one of the best introductions to kNN algorithm.

1. What is a K-Nearest Neighbor Algorithm?

kNN is one of the simplest classification algorithms and it is one of the most used learning algorithms. kNN falls in the supervised learning family of an algorithm. Informally this means that we are given a labelled dataset consisting of training observation (x,y) & would like to capture the relationship between x and y more informally our goal is to learn a function h: X implies to Y so that given an unseen observation x, h(x) can confidently predict the corresponding output y.

kNN is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (eg distance function). kNN has been used in statistical estimation and pattern recognition already at the beginning of 1970 as a nonparametric technique.

kNN is also non-parametric and instance-based learning algorithm,

kNN is a lazy algorithm.

kNN algorithm is based on feature similarity: How closely out of sample features resemble our training set determines how we classify a given data point. Let’s take an example to more closely to understand kNN better.

1.1 Real-Life Example

Let’s explain briefly in this example through the above figure in the above figure we have two classes. Class A belongs to the yellow family and Class B is belonged to the purple class according to the above figure. Then train dataset in kNN model which we discuss later but focus on just example here k=3 is three nearest neighbors a k=6 six nearest neighbors. so when we take k=3 then what happens and when k=6 then what happens.When k=3 then two belong to a purple class and one belongs to the yellow class majority vote of purple so here the purple class is considered similarly when k=6 then four belong to a yellow class and two belong to purple class so majority votes are yellow so consider the yellow class. So in this way kNN works.

Let’s explain briefly how kNN works.

2. How k-Nearest Neighbor Algorithm Works?

In the classification setting, the k-Nearest neighbor algorithm essentially boils down to forming a majority vote between the k most similar instances to given ‘unseen’ observation. The similarity is defined according to a distance metric between two data points. A popular choice is a Euclidean distance Very often, especially when measuring the distance in the plane, we use the formula for the Euclidean distance. According to the Euclidean distance formula, the distance between two points in the plane with coordinates (x, y) and (a, b) is given by

More formally given a +ve integer k an unseen observation x and a. Similarity metric d, kNN Classifier performs the following two steps.

It runs through the whole dataset computing d between x and each training observation. We’ll call k points in the training dataset to x the set A. Note that k is usually odd to prevent a situation.
It then estimates the conditional probability for each class, which is the fraction of points in A with that given class label.

‘’kNN searches the memorized training observation for the instances that most closely resemble the new instance & assign to it their most common class’’

An alternate way of understanding kNN is by thinking about it as calculating a decision boundary (i.e boundaries for more than two classes) which is then used to classify new points.

Dataset:

This dataset is about Iris was taken from the UCI Repository. In this dataset, we have 3 attributes which have sepal length, sepal width, and species. Species have a target attribute. In target attribute, we have three species (Setosa, Virginia, and Versicolor) and our target finds the nearest species which belong from three species using the k-Nearest Neighbors.

Target: New flower found, need to classify “Unlabeled”.

Feature of the new unlabeled flower:

Solution:

Step 1: Find Distance:

Our first step is to find distance using the Euclidean distance between Actual and Observed sepal length and sepal width. For the first instance dataset

X = Observed sepal length=5.2

Y = Observed sepal width=3.1

Now Actual which is given in the dataset

A = Actual sepal length = 5.3

B = Actual sepal width=3.7

Distance formula:

This is for first instances which I find the distance similarly find all the remaining instances distance as shown below in the table

Step 2: Find Rank:

In this step, we find the Rank after finding the Distance. Rank basically gives the number according to ascending order distance. As you can see below table:

If we see the above table then instance number 5 has a minimum distance 0.22 so gave him rank as below table.

Similarly, find the rank for all other instances as shown below the table

Step 3: Find the Nearest Neighbor:

Our last step finds the nearest neighbors on the basis of distance and rank we can find our Unknown on the basis of species. According to rank find the k Nearest Neighbor

for k=1

Feature Species is Setosa so K=1 is Setosa

for k=2

Feature Species is Setosa because no other species is found for so K=2 is Setosa

For k=5

Feature Species is Setosa because a majority vote for setosa=3 and virginica=1 and virginica =1 so on the basis of highest vote KNN for K=5 is Setosa.

So in this way k-Nearest Neighbors algorithm work.

Note: If you want this article check out my academia.edu profile.

3. Practical Implementation of k-Nearest Neighbors in Scikit Learn.

Dataset description:

Part 1: Data Preprocessing:

1.1 Import the Libraries

In this step, we import three Libraries in Data Preprocessing part. A library is a tool that you can use to make a specific job. First of all, we import the numpy library used for multidimensional array then import the pandas library used to import the dataset and in last we import matplotlib library used for plotting the graph.

1.2 Import the dataset

In this step, we import the dataset to do that we use the pandas library. After import our dataset we define our Predictor and target attribute. Our Predictor attributes are a Sepal Length and Sepal width as you can see in the sample dataset which we call ‘X’ here and Species is a target attribute which we call ‘y’ here.

1.3 Encoding the Categorical data

In this step, we Encode our categorical data. If we see our dataset then Geography & Gender attribute is in Text and we Encode these two attributes in this part use the LabelEncoder and OneHOTEncoder from the Sklearn.processing library.

1.4 Split the dataset for test and train

In this step, we split our dataset into a test set and train set and a 75% dataset split for training and the remaining 25% for tests.

1.5 Feature Scaling

Feature Scaling is the most important part of data preprocessing. If we see our dataset then some attribute contains information in Numeric value some value very high and some are very low if we see the age and estimated salary. This will cause some issues in our machinery model to solve that problem we set all values on the same scale there are two methods to solve that problem first one is Normalize and Second is Standard Scaler.

Here we use standard Scaler import from Sklearn Library.

Part 2: Building the kNN classifier model:

In this part, we model our kNN Classifier model using Scikit Learn Library.

2.1 Import the Libraries

In this step, we are building our kNN model to do this first we import a kNN model from Scikit Learn Library.

2.2 Initialize our kNN model

After import, the model initializes our model and take 5 nearest neighbors metric = Euclidean distance with power 2. Here note Minkowski is basically Euclidean distance

2.3 Fitting the kNN Model

In this step, we fit the training data into our model X_train, y_train is our training data.

Part 3: Making the Prediction and Visualizing the result:

In this Part, we make a prediction of our test set dataset and visualizing the result using the matplotlib library.

3.1 Predict the test set Result

In this step, we predict our test set result.

3.2 Confusion Metric

In this step we make a confusion metric of our test set result to do that we import confusion matrix from sklearn.metrics then in confusion matrix, we pass two parameters first is y_test which is the actual test set result and second is y_pred which predicted result.

3.3 Accuracy Score

In this step, we calculate the accuracy score on the basis of the actual test result and predict test results.

3.4 Visualize our Test Set Result

In the step, we visualize our test set result to do this we use a matplotlib library and we can see only 2 points are the incorrect map and the remaining 1 are the correct map in the graph according to the model test set result.

If you want dataset and code you also check my Github Profile.

End Notes:

If you liked this article, be sure to click ❤ below to recommend it and if you have any questions, leave a comment and I will do my best to answer.

For being more aware of the world of machine learning, follow me. It’s the best way to find out when I write more articles like this.

You can also follow me on Github for code & dataset follow on Aacademia.edu for this article, Twitter and Email me directly or find me on LinkedIn. I’d love to hear from you.

That’s all folks, Have a nice day :)