KNN
KNN

K-Nearest Neighbour(KNN) Implementation in Python

Harshita Yadav
Machine Learning with Python

--

In the last blog, we have learned about Logistic Regression and its implementation in python.

In this blog, we will learn about KNN and its implementation in Python.

K-Nearest Neighbour

K-Nearest Neighbour comes under the supervised learning technique. It can be used for classification and regression problems, but mainly, it is used for classification problems. It is a non-parametric algorithm, which means it does not make any assumptions about the distribution of data.

The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. It stores all the available data and classifies a new data point based on the similarity.

It is a lazy learner algorithm because it does not learn from the training data immediately. KNN algorithm at the training phase stores the dataset, and when it gets new data, it classifies that data into a category that is much similar to the new data.

Example: Suppose we have an image of a creature that looks similar to a cat and dog, but we want to know either it is a cat or a dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN model will find the similar features of the new data set to the cats and dogs images, and based on the most similar features, it will put it in either cat or dog category.

The K in KNN parameter refers to the number of nearest neighbors to a particular data point that is to be included in the decision-making process. This is the core deciding factor as the classifier output depends on the class to which the majority of these neighboring points belong.

Consider if the value of K is 5, then the algorithm will take into account the five nearest neighboring data points for determining the class of the new object. Choosing the right value of K is termed as Parameter Tuning. As the value of K increases, the prediction curve becomes smoother.

KNN Implementation in Python

Problem statement: The aim is to identify the customer segments to whom the loan can be granted.

Since this is a binary classification, KNN can be used to build the model.

Dataset source: https://www.kaggle.com/burak3ergun/loan-data-set

Importing the Libraries

#Importing the librariesimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

numpy: NumPy stands for numeric Python, a python package for the computation and processing of the multi-dimensional and single-dimensional array elements.

pandas: Pandas provide high-performance data manipulation in Python.

matplotlib: Matplotlib is a library used for data visualization. It is mainly used for basic plotting. Visualization using Matplotlib generally consists of bars, pies, lines, scatter plots, and so on.

seaborn: Seaborn is a library used for making statistical graphics of the dataset. It provides a variety of visualization patterns. It uses fewer syntax and has easily interesting default themes. It is used to summarize data in visualizations and show the data’s distribution.

Reading the Dataset

#Reading the datasetdataset = pd.read_csv("loanDataset.csv")

The dataset is in the CSV (Comma-Separated Values) format. Hence, we use pd.read_csv()to read the dataset.

dataset.head()
Loan Dataset
Loan Dataset

Dataset Column Description

  • Loan_Id: Loan_Id is the Id given to all the applicants to identify each individual uniquely.
  • Gender: This indicates the gender of the applicant.
  • Married: This indicates whether the applicant is married or not.
  • Dependents: This indicates an applicant depends on somebody else for money.
  • Education: This indicates the educational status of the applicant.
  • Self_Employed: This indicates whether the applicant is self-employed or not.
  • ApplicantIncome: This indicates the income of the applicant.
  • CoApplicantIncome: This indicates the income of the co-applicant. A co-applicant is a person who applies with the borrower for a joint loan.
  • Loan_Amount: This indicates the amount of loan the applicant borrows from the bank.
  • Loan_Amount_Term: This indicates the loan amount term for each applicant. A term amount loan is a loan amount issued by a bank for a fixed amount and fixed repayment schedule with either a fixed or floating interest rate.
  • Credit_Historty: This indicates the loan amount term for each applicant. Credit history is a record of a borrower’s responsible repayment of debts.
  • Loan_Status: It indicates whether the loan is approved or not (Y for approved and N for not approved).

Data Pre-Processing

  1. Checking for missing values in the dataset
#Checking for missing valuesdataset.isnull().sum()
Output
Output

The columns Gender, Married, Dependents, Seld_Employed, LoanAmount, Loan_Amount_Term, and Credit_History have missing values.

2. Imputation of missing values

#Filling Gender column by mode
dataset['Gender']=dataset['Gender'].fillna(dataset['Gender'].mode().values[0])
#Filling Married column by mode
dataset['Married']=dataset['Married'].fillna(dataset['Married'].mode().values[0])
#Filling Dependents column by mode
dataset['Dependents']=dataset['Dependents'].fillna(dataset['Dependents'].mode().values[0])
#Filling Self_Employed column by mode
dataset['Self_Employed']=dataset['Self_Employed'].fillna(dataset['Self_Employed'].mode().values[0])
#Filling LoanAmount column by mean
dataset['LoanAmount']=dataset['LoanAmount'].fillna(dataset['LoanAmount'].mean())
#Filling Loan_Amount_Term column by mode
dataset['Loan_Amount_Term']=dataset['Loan_Amount_Term'].fillna(dataset['Loan_Amount_Term'].mode().values[0] )
#Filling Credit_History column by mode
dataset['Credit_History']=dataset['Credit_History'].fillna(dataset['Credit_History'].mode().values[0] )

Now, we check whether the missing values are filled or not after imputation

#Checking missing values after imputation
dataset.isna().sum()
Output
Output

Now, there are no missing values present in the dataset.

3. Dropping unnecessary columns

#Dropping unnecessary columns
dataset.drop('Loan_ID', axis=1, inplace=True)

The column Loan_Id is unnecessary as it does not affect the target variable, i.e., Loan_Status. Therefore, we can drop the column from the dataset.

Final columns after pre-processing

dataset.head()
Output
Loan Dataset

Exploratory Data Analysis

  1. Dataset shape
#Number of rows and columns of train set
dataset.shape
Output
Output

There are 614rows and 12 columns in the dataset.

2. Dataset info

#Dataset info
dataset.info()
Output
Output

The data types of the columns are integer, float, and object.

3. Gender obtaining the maximum number of loans

sns.countplot(y = 'Gender', hue = 'Loan_Status', data = dataset)
dataset['Gender'].value_counts()
Output
Output

The above graph shows that males tend to get more loans than females.

4. Does marital status affect loan approval?

sns.countplot(y= 'Married', hue= 'Loan_Status', data= dataset)
dataset['Married'].value_counts()
Output
Output

The above graph shows that married people tend to get more loans than unmarried people.

5. Does education status affect loan approval?

sns.countplot(y = 'Education', hue = 'Loan_Status', data = dataset)
dataset['Education'].value_counts()
Output
Output

The above graph shows that graduated people tend to get more loans.

6. Does employment affect loan approval?

sns.countplot(y= 'Self_Employed', hue= 'Loan_Status', data= dataset)
dataset['Self_Employed'].value_counts()
Output
Output

The above graph shows that the number of people taking loans is more who are not self-employed.

7. Does credit history affect loan approval?

sns.countplot(y= 'Credit_History', hue= 'Loan_Status', data=dataset)
Output
Output

The above graph shows that people having good credit history tends to get loan more because they tend to pay back their loans.

Model Building

Before building the model, we need to perform label encoding for the categorical variables because categorical data must be encoded into numbers before using it to fit and evaluate a model.

Converting object into int

#Converting some object data type to intgender = {"Female": 0, "Male": 1}
yes_no = {'No' : 0,'Yes' : 1}
dependents = {'0':0,'1':1,'2':2,'3+':3}
education = {'Not Graduate' : 0, 'Graduate' : 1}
property = {'Semiurban' : 0, 'Urban' : 1,'Rural' : 2}
output = {"N": 0, "Y": 1}
dataset['Gender'] = dataset['Gender'].replace(gender)
dataset['Married'] = dataset['Married'].replace(yes_no)
dataset['Dependents'] = dataset['Dependents'].replace(dependents)
dataset['Education'] = dataset['Education'].replace(education)
dataset['Self_Employed'] = dataset['Self_Employed'].replace(yes_no)
dataset['Property_Area'] = dataset['Property_Area'].replace(property)
dataset['Loan_Status'] = dataset['Loan_Status'].replace(output)

Dataset after converting object data types into an integer

dataset.head()
Output
Output

Setting the values for independent (X) variable and dependent (Y) variable

#Setting the value for dependent and independent variables
x = loan_dataset.drop('Loan_Status', 1)
y = loan_dataset.Loan_Status

For the independent variable (x), we are just dropping the “Loan_Status” column and assigning it to the target variable.

Splitting the dataset into train and test set

#Splitting the dataset
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test= train_test_split(x, y, test_size= 0.25, random_state=38, stratify = y)

from sklearn.model_selection import train_test_split: It is used for splitting data arrays into two subsets: for training data and testing data. With this function, you don’t need to divide the dataset manually.

We need to split our dataset into training and testing sets. We’ll perform this by importing train_test_split from the sklearn.model_selection library. It is usually good to keep 70% of the data in your train dataset and the rest 30% in your test dataset.

test_size: This parameter specifies the size of the testing dataset. The default state suits the training size. It will be set to 0.25 if the training size is set to default.

randon_state: This parameter controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

stratify: The stratify parameter asks whether you want to retain the same proportion of classes in the train and test sets that are found in the entire original dataset.

Implementing the KNN Model

#Fitting the KNN model
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, Y_train)

from sklearn.neighbors import KNeighborsClassifier: It is used to implement the KNN algorithm in Python.

To build a KNN model, we need to create an instance of KNeighborsClassifier() class and use X_train, Y_train to train the model using the fit() method of that class. Now, the variable knn is an instance of the KNeighborsClassifier() class.

Prediction on the test set

#Prediction of test set
prediction_knn = knn.predict(X_test)
#Print the predicted values
print("Prediction for test set: {}".format(prediction_knn))
Output
Output

Once we have fitted (trained) the model, we can make predictions using the predict() function. We pass the values of X_test to this method and compare the predicted values called prediction_knn with Y_test values to check how accurate our predicted values are.

Actual values and the predicted values

#Actual value and the predicted value
a = pd.DataFrame({'Actual value': Y_test, 'Predicted value': prediction_knn})
a.head()
Actual and Predicted values

Evaluating the Model

#Confusion matrix and classification report
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
matrix = confusion_matrix(Y_test, prediction_knn)
sns.heatmap(matrix, annot=True, fmt="d")
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
print(classification_report(Y_test, prediction_knn))

metrics: It consists of the function that is used to evaluate machine learning algorithms in python.

confusion_matrix(): It is a table that is used to describe the performance of a classification model on a set of test data for which the true values are known.

classification_report(): It is used to measure the quality of predictions from a classification algorithm.

Classification Report and Confusion Matrix
Classification Report and Confusion Matrix

Accuracy: Accuracy represents the number of correctly classified data instances over the total number of data instances. The accuracy obtained from the classification report is 0.66, which indicates that the accuracy of the model is 66%.

Precision: It is the number of correct positive results divided by the number of positive results predicted by the classifier.

Recall: Recall gives a measure of how accurately our model can identify the relevant data.

f1-score: f1-Score is used to measure a test’s accuracy.

Conclusion

There were 614records in the dataset, out of which 75% of the data was given for training the model and 25% of the data, i.e., 154 records, were given for testing the model. And out of 154 records, 53 records were misclassified.

Hey guys! I’m Harshita. I’m a Data Science student and trying to contribute a bit to the community by sharing my knowledge. Please share this with someone you know who is trying to learn Machine Learning. I would appreciate your comments, suggestions, or feedback. Thank you.

Email Id: harshita.1128@gmail.com

LinkedIn: www.linkedin.com/in/harshita-11

Github: www.github.com/Harshita0109

--

--

Harshita Yadav
Machine Learning with Python

MSc Data Science student at Christ (Deemed to be University)