KNN Algorithm: A Practical Implementation Of KNN Algorithm In R

Published in

Edureka

17 min readApr 16, 2019

With the amount of data that we’re generating, the need for advanced Machine Learning Algorithms has increased. One such algorithm is the K Nearest Neighbour algorithm. In this blog on KNN Algorithm In R, you will understand how the KNN algorithm works and its implementation using the R Language.

The following topics will be covered in this KNN Algorithm In R blog:

What Is KNN Algorithm?
Features Of KNN Algorithm
How Does KNN Algorithm Work?
KNN Algorithm Use Case
KNN Algorithm Pseudocode
Practical Implementation Of KNN Algorithm In R

What Is KNN Algorithm?

KNN which stand for K Nearest Neighbor is a Supervised Machine Learning algorithm that classifies a new data point into the target class, depending on the features of its neighboring data points.

Let’s try to understand the KNN algorithm with a simple example. Let’s say we want a machine to distinguish between images of cats & dogs. To do this we must input a dataset of cat and dog images and we have to train our model to detect the animals based on certain features. For example, features such as pointy ears can be used to identify cats and similarly we can identify dogs based on their long ears.

What is KNN Algorithm? — KNN Algorithm In R — Edureka

After studying the dataset during the training phase, when a new image is given to the model, the KNN algorithm will classify it into either cats or dogs depending on the similarity in their features. So if the new image has pointy ears, it will classify that image as a cat because it is similar to the cat images. In this manner, the KNN algorithm classifies data points based on how similar they are to their neighboring data points.

Now let’s discuss the features of the KNN algorithm.

Features Of KNN Algorithm

The KNN algorithm has the following features:

KNN is a Supervised Learning algorithm that uses labeled input data set to predict the output of the data points.
It is one of the most simple Machine learning algorithms and it can be easily implemented for a varied set of problems.
It is mainly based on feature similarity. KNN checks how similar a data point is to its neighbor and classifies the data point into the class it is most similar to.

*Features of KNN — KNN Algorithm In R — Edureka*

Unlike most algorithms, KNN is a non-parametric model which means that it does not make any assumptions about the data set. This makes the algorithm more effective since it can handle realistic data.
KNN is a lazy algorithm, this means that it memorizes the training data set instead of learning a discriminative function from the training data.
KNN can be used for solving both classification and regression problems.

KNN Algorithm Example

To make you understand how KNN algorithm works, let’s consider the following scenario:

In the above image, we have two classes of data, namely class A (squares) and Class B (triangles)
The problem statement is to assign the new input data point to one of the two classes by using the KNN algorithm
The first step in the KNN algorithm is to define the value of ‘K’. But what does the ‘K’ in the KNN algorithm stand for?
‘K’ stands for the number of Nearest Neighbors and hence the name K Nearest Neighbors (KNN).

In the above image, I’ve defined the value of ‘K’ as 3. This means that the algorithm will consider the three neighbors that are the closest to the new data point in order to decide the class of this new data point.
The closeness between the data points is calculated by using measures such as Euclidean and Manhattan distance, which I’ll be explaining below.
At ‘K’ = 3, the neighbors include two squares and 1 triangle. So, if I were to classify the new data point based on ‘K’ = 3, then it would be assigned to Class A (squares).

But what if the ‘K’ value is set to 7? Here, I’m basically telling my algorithm to look for the seven nearest neighbors and classify the new data point into the class it is most similar to.
At ‘K’ = 7, the neighbors include three squares and four triangles. So, if I were to classify the new data point based on ‘K’ = 7, then it would be assigned to Class B (triangles) since the majority of its neighbors were of class B.

In practice, there’s a lot more to consider while implementing the KNN algorithm. This will be discussed in the demo section of the blog.

Earlier I mentioned that KNN uses Euclidean distance as a measure to check the distance between a new data point and its neighbors, let’s see how.

Consider the above image, here we’re going to measure the distance between P1 and P2 by using the Euclidian Distance measure.
The coordinates for P1 and P2 are (1,4) and (5,1) respectively.
The Euclidian Distance can be calculated like so:

It is as simple as that! KNN makes use of the simple measures in order to solve complex problems, this is one of the reasons why KNN is such a commonly used algorithm.

To sum it up, let’s look at the pseudocode for KNN Algorithm.

KNN Algorithm Pseudocode

Consider the set, (Xi, Ci),

Where Xi denotes feature variables and ‘i’ are data points ranging from i=1, 2, ….., n
Ci denotes the output class for Xi for each i

The condition, Ci ∈ {1, 2, 3, ……, c} is acceptable for all values of ‘i’ by assuming that the total number of classes is denoted by ‘c’.

Now let’s pretend that there’s a data point ‘x’ whose output class needs to be predicted. This can be done by using the K-Nearest Neighbour (KNN) Algorithm.

KNN Algorithm Pseudocode:

head(loan.subset)Creditability Age..years. Sex...Marital.Status Occupation Account.Balance Credit.Amount1             1          21                    2          3               1          10492             1          36                    3          3               1          27993             1          23                    2          2               2           8414             1          39                    3          2               1          21225             1          38                    3          2               1          21716             1          48                    3          2               1          2241Length.of.current.employment Purpose1                            2       22                            3       03                            4       94                            3       05                            3       06                            2       0

The above pseudocode can be used for solving a classification problem by using the KNN Algorithm.

Before we get into the practical implementation of KNN, let’s look at a real-world use case of the KNN algorithm.

KNN Algorithm Use-Case

Surely you have shopped on Amazon! Have you ever noticed that when you buy a product, Amazon gives you a list of recommendations based on your purchase? Not only this, Amazon displays a section which says, ‘customers who bought this item also bought this.. ‘.

Machine learning plays a huge role in Amazon’s recommendation system. The logic behind a recommendation engine is to suggest products to customers based on other customers who have a similar shopping behavior.

Consider an example, let’s say that a customer A who loves mystery novels bought the Game Of Thrones and Lord Of The Rings book series. Now a couple of weeks later, another customer B who reads books of the same genre buys Lord Of The Rings. He does not buy the Game of Thrones book series, but Amazon recommends it customer B since his shopping behaviors and his choice in books is quite similar to customer A.

Therefore, Amazon recommends products to customers based on how similar their shopping behaviors are. This similarity can be understood by implementing the KNN algorithm which is mainly based on feature similarity.

Now that you know how KNN works and how it is used in real-world applications, let’s discuss the implementation of KNN using the R language. If you’re not familiar with the R language, you can go through this video recorded by our Machine Learning experts.

Practical Implementation Of KNN Algorithm In R

Problem Statement: To study a bank credit dataset and build a Machine Learning model that predicts whether an applicant’s loan can be approved or not based on his socio-economic profile.

Dataset Description: The bank credit dataset contains information about 1000s of applicants. This includes their account balance, credit amount, age, occupation, loan records, etc. By using this data, we can predict whether or not to approve the loan of an applicant.

Logic: This problem statement can be solved using the KNN algorithm that will classify the applicant’s loan request into two classes:

Now that you know the objective of this project, let’s get started with the coding part.

Step 1: Import the dataset

#Import the dataset loan <- read.csv("C:/Users/zulaikha/Desktop/DATASETS/knn dataset/credit_data.csv")

After importing the dataset, let’s take a look at the structure of the dataset:

str(loan) 'data.frame': 1000 obs. of 21 variables: $ Creditability : int 1 1 1 1 1 1 1 1 1 1 ... $ Account.Balance : int 1 1 2 1 1 1 1 1 4 2 ... $ Duration.of.Credit..month. : int 18 9 12 12 12 10 8 6 18 24 ... $ Payment.Status.of.Previous.Credit: int 4 4 2 4 4 4 4 4 4 2 ... $ Purpose : int 2 0 9 0 0 0 0 0 3 3 ... $ Credit.Amount : int 1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ... $ Value.Savings.Stocks : int 1 1 2 1 1 1 1 1 1 3 ... $ Length.of.current.employment : int 2 3 4 3 3 2 4 2 1 1 ... $ Instalment.per.cent : int 4 2 2 3 4 1 1 2 4 1 ... $ Sex...Marital.Status : int 2 3 2 3 3 3 3 3 2 2 ... $ Guarantors : int 1 1 1 1 1 1 1 1 1 1 ... $ Duration.in.Current.address : int 4 2 4 2 4 3 4 4 4 4 ... $ Most.valuable.available.asset : int 2 1 1 1 2 1 1 1 3 4 ... $ Age..years. : int 21 36 23 39 38 48 39 40 65 23 ... $ Concurrent.Credits : int 3 3 3 3 1 3 3 3 3 3 ... $ Type.of.apartment : int 1 1 1 1 2 1 2 2 2 1 ... $ No.of.Credits.at.this.Bank : int 1 2 1 2 2 2 2 1 2 1 ... $ Occupation : int 3 3 2 2 2 2 2 2 1 1 ... $ No.of.dependents : int 1 2 1 2 1 2 1 2 1 1 ... $ Telephone : int 1 1 1 1 1 1 1 1 1 1 ... $ Foreign.Worker : int 1 1 1 2 2 2 2 2 1 1 ...

Note that, the ‘Creditability’ variable is our output variable or the target variable. The value of the credibility variable represents whether an applicant’s loan is approved or rejected.

Step 2: Data Cleaning

From the structure of the dataset, we can see that there are 21 predictor variables that will help us decide whether or not an applicant’s loan must be approved.

Some of these variables are not essential in predicting the loan of an applicant, for example, variables such as Telephone, Concurrent. Credits, Duration.in.Current.address, Type.of.apartment, etc. Such variables must be removed because they will only increase the complexity of the Machine Learning model.

loan.subset<-loan[c('Creditability','Age..years.','Sex...Marital.Status','Occupation','Account.Balance','Credit.Amount','Length.of.current.employment','Purpose')]

In the above code snippet, I’ve filtered down the predictor variables. Now let’s take a look at how our dataset looks:

str(loan.subset)
'data.frame': 1000 obs. of 8 variables:
$ Creditability : int 1 1 1 1 1 1 1 1 1 1 ...
$ Age..years. : int 21 36 23 39 38 48 39 40 65 23 ...
$ Sex...Marital.Status : int 2 3 2 3 3 3 3 3 2 2 ...
$ Occupation : int 3 3 2 2 2 2 2 2 1 1 ...
$ Account.Balance : int 1 1 2 1 1 1 1 1 4 2 ...
$ Credit.Amount : int 1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
$ Length.of.current.employment: int 2 3 4 3 3 2 4 2 1 1 ...
$ Purpose : int 2 0 9 0 0 0 0 0 3 3 ...

Now we have narrowed down 21 variables to 8 predictor variables that are significant for building the model.

Step 3: Data Normalization

You must always normalize the data set so that the output remains unbiased. To explain this, let’s take a look at the first few observations in our data set.

head(loan.subset)Creditability Age..years. Sex...Marital.Status Occupation Account.Balance Credit.Amount1             1          21                    2          3               1          10492             1          36                    3          3               1          27993             1          23                    2          2               2           8414             1          39                    3          2               1          21225             1          38                    3          2               1          21716             1          48                    3          2               1          2241Length.of.current.employment Purpose1                            2       22                            3       03                            4       94                            3       05                            3       06                            2       0

Notice the Credit amount variable, its value scale is in 1000s, whereas the rest of the variables are in single digits or 2 digits. If the data isn’t normalized it will lead to a baised outcome.

#Normalization
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x))) }

In the below code snippet, we’re storing the normalized data set in the ‘loan.subset.n’ variable and also we’re removing the ‘Credibility’ variable since it’s the response variable that needs to be predicted.

loan.subset.n <- as.data.frame(lapply(loan.subset[,2:8], normalize))

This is the normalized data set:

head(loan.subset.n)Age..years. Sex..Marital Occupation Account.Balance Credit.Amount1  0.03571429   0.3333333   0.6666667    0.0000000      0.043963902  0.30357143   0.6666667   0.6666667    0.0000000      0.140255313  0.07142857   0.3333333   0.3333333    0.3333333      0.032518984  0.35714286   0.6666667   0.3333333    0.0000000      0.103004295  0.33928571   0.6666667   0.3333333    0.0000000      0.105700456  0.51785714   0.6666667   0.3333333    0.0000000      0.10955211Length.of.current.employment Purpose0.25                0.20.50                0.00.75                0.90.50                0.00.50                0.00.25                0.0

After cleaning the data set and formatting it, the next step is data splicing. Data splicing basically involves splitting the data set into training and testing data set. This is done in the following code snippet:

set.seed(123)
dat.d <- sample(1:nrow(loan.subset.n),size=nrow(loan.subset.n)*0.7,replace = FALSE) #random selection of 70% data.
train.loan <- loan.subset[dat.d,] # 70% training data
test.loan <- loan.subset[-dat.d,] # remaining 30% test data

After deriving the training and testing data set, the below code snippet is going to create a separate data frame for the ‘Creditability’ variable so that our final outcome can be compared with the actual value.

#Creating seperate dataframe for 'Creditability' feature which is our target.
train.loan_labels <- loan.subset[dat.d,1]
test.loan_labels <-loan.subset[-dat.d,1]

Step 5: Building a Machine Learning model

At this stage, we have to build a model by using the training data set. Since we’re using the KNN algorithm to build the model, we must first install the ‘class’ package provided by R. This package has the KNN function in it:

#Install class package
install.packages('class')
# Load class package
library(class)

Next, we’re going to calculate the number of observations in the training data set. The reason we’re doing this is that we want to initialize the value of ‘K’ in the KNN model. One of the ways to find the optimal K value is to calculate the square root of the total number of observations in the data set. This square root will give you the ‘K’ value.

#Find the number of observation
NROW(train.loan_labels)
[1] 700

So, we have 700 observations in our training data set. The square root of 700 is around 26.45, therefore we’ll create two models. One with ‘K’ value as 26 and the other model with a ‘K’ value as 27.

knn.26 <- knn(train=train.loan, test=test.loan, cl=train.loan_labels, k=26)
knn.27 <- knn(train=train.loan, test=test.loan, cl=train.loan_labels, k=27)

Step 6: Model Evaluation

After building the model, it is time to calculate the accuracy of the created models:

#Calculate the proportion of correct classification for k = 26, 27
ACC.26 <- 100 * sum(test.loan_labels == knn.26)/NROW(test.loan_labels)
ACC.27 <- 100 * sum(test.loan_labels == knn.27)/NROW(test.loan_labels)
ACC.26
[1] 67.66667
ACC.27
[1] 67.33333

As shown above, the accuracy for K = 26 is 67.66 and for K = 27 it is 67.33. We can also check the predicted outcome against the actual value in tabular form:

# Check prediction against actual value in tabular form for k=26table(knn.26 ,test.loan_labels)
test.loan_labels
knn.26   0     1
0        11    7
1        90   192
knn.26
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1[51] 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1[101] 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1[151] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1[201] 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1[251] 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1Levels: 0 1# Check prediction against actual value in tabular form for k=27 table(knn.27 ,test.loan_labels) test.loan_labels knn.27 0 1 0 11 8 1 90 191 knn.27
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
[51] 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[101] 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[151] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
[201] 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
[251] 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1
Levels: 0 1

You can also use the confusion matrix to calculate the accuracy. To do this we must first install the infamous Caret package:

install.packages('caret')
library(caret)

Now, let’s use the confusion matrix to calculate the accuracy of the KNN model with K value set to 26:

confusionMatrix(table(knn.26 ,test.loan_labels))
Confusion Matrix and Statistics
test.loan_labels
knn.26   0   1
0  11   7
1  90 192
Accuracy : 0.6767
95% CI : (0.6205, 0.7293)
No Information Rate : 0.6633
P-Value [Acc > NIR] : 0.3365
Kappa : 0.0924
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.10891
Specificity : 0.96482
Pos Pred Value : 0.61111
Neg Pred Value : 0.68085
Prevalence : 0.33667
Detection Rate : 0.03667
Detection Prevalence : 0.06000
Balanced Accuracy : 0.53687
'Positive' Class : 0

So, from the output, we can see that our model predicts the outcome with an accuracy of 67.67% which is good since we worked with a small data set. A point to remember is that the more data (optimal data) you feed the machine, the more efficient the model will be.

Step 7: Optimization

In order to improve the accuracy of the model, you can use n number of techniques such as the Elbow method and maximum percentage accuracy graph. In the below code snippet, I’ve created a loop that calculates the accuracy of the KNN model for ‘K’ values ranging from 1 to 28. This way you can check which ‘K’ value will result in the most accurate model:

i=1k.optm=1for (i in 1:28){+ knn.mod <- knn(train=train.loan, test=test.loan, cl=train.loan_labels, k=i)+ k.optm[i] <- 100 * sum(test.loan_labels == knn.mod)/NROW(test.loan_labels)+ k=i+ cat(k,'=',k.optm[i],'')+ }1 = 60.333332 = 58.333333 = 60.333334 = 615 = 62.333336 = 627 = 63.333338 = 63.333339 = 63.3333310 = 64.6666711 = 64.6666712 = 65.3333313 = 6614 = 6415 = 66.6666716 = 67.6666717 = 67.6666718 = 67.3333319 = 67.6666720 = 67.6666721 = 66.3333322 = 6723 = 67.6666724 = 6725 = 6826 = 67.6666727 = 67.3333328 = 66.66667

From the output you can see that for K = 25, we achieve the maximum accuracy, i.e. 68%. We can also represent this graphically, like so:

#Accuracy plot
plot(k.optm, type="b", xlab="K- Value",ylab="Accuracy level")

*Accuracy Plot — KNN Algorithm In R — Edureka*

The above graph shows that for ‘K’ value of 25 we get the maximum accuracy. Now that you know how to build a KNN model, I’ll leave it up to you to build a model with ‘K’ value as 25.

This brings us to the end of this article where we have learned Classification in Machine Learning. I hope you are clear with all that has been shared with you in this tutorial.

If you wish to check out more articles on the market’s most trending technologies like Python, DevOps, Ethical Hacking, then you can refer to Edureka’s official site.

Do look out for other articles in this series which will explain the various other aspects of Data Science.

1.Data Science Tutorial
2.Math And Statistics For Data Science
3.Linear Regression in R
4.Machine Learning Algorithms
5.Logistic Regression In R
6.Classification Algorithms
7.Random Forest In R
8.Decision Tree in R
9.Introduction To Machine Learning
10.Naive Bayes in R
11.Statistics and Probability
12.How To Create A Perfect Decision Tree?
13.Top 10 Myths Regarding Data Scientists Roles
14.Top Data Science Projects
15.Data Analyst vs Data Engineer vs Data Scientist
16.Types Of Artificial Intelligence
17.R vs Python
18.Artificial Intelligence vs Machine Learning vs Deep Learning
19.Machine Learning Projects
20.Data Analyst Interview Questions And Answers
21.Data Science And Machine Learning Tools For Non-Programmers
22.Top 10 Machine Learning Frameworks
23.Statistics for Machine Learning
24.Random Forest In R
25.Breadth-First Search Algorithm
26.Linear Discriminant Analysis in R
27.Prerequisites for Machine Learning
28.Interactive WebApps using R Shiny
29.Top 10 Books for Machine Learning
30.Supervised Learning
31.10 Best Books for Data Science
32.Machine Learning using R

Originally published at https://www.edureka.co on April 16, 2019.