Machine Learning Classification using KNN, Decision Tree, SVM, Logistic Regression

Yash Vardhan Singh
7 min readMar 23, 2021

In this article I will show you how to solve classification problem using machine learning algorithms. Different classification algorithms are used and results are compared to see the best one for the specific dataset by accuracy evaluation methods.

Let’s get started. . .

Import all the required libraries

Dataset description

I came across this problem statement during my online course on Coursera for IBM machine learning certificate.
Dataset is about the past loans. The Loan_train.csv data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:

Let us download the data

!wget -O loan_train.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/FinalModule_Coursera/data/loan_train.csv

Read data from csv file

We see from the data that ‘due_date’ and ‘effective_date’ are not in the correct datetime format. I converted these values to date time object

Data Visualization and pre-processing

Next step is to visualize the dataset to get the better understanding. We will need to do some pre processing on the data before training any machine learning algorithm

Let us check number of examples for each class present in our dataset

260 people have paid off the loan on time while 86 have gone into collection

Plot some graphs to get better understanding of data

Pre-processing and Feature selection

Let us see at the day of the week when people get the loan. This will help us build our features list.

Here, we can see that people who get the loan at the end of the week don't pay it off. Therefore, I have used feature binarization to set the threshold values less than day 4.
Created another column ‘Weekend’ to with value 1 if ‘dayofweek’ is greater than 3 else it is kept as 0

Convert categorical features to numerical values

We see that 86% female pay loans and only 73% of males pay there loan.

Let us convert these values to numeric values. Male as 0 and Female as 1.

One Hot Encoding

In the above dataframe we see that education column has different values like ‘High School or Below’, ‘Bechalor’ etc.
Using one hot encoding we can create separate features for our training set.

Use one hot encoding technique to convert categorical variables to binary variables and append them to the feature set.

Feature selection process

Labels corresponding to each row from our feature set

Normalize Data

Classification

I will demonstrate the use of various classification algorithm to predict if customer is likely to pay the loan.

I have used the following machine algorithms to predict:

  • K Nearest Neighbor(KNN)
  • Decision Tree
  • Support Vector Machine
  • Logistic Regression

K Nearest Neighbor(KNN)

I have split dataset into train and test data.

In KNN algorithm, we need to decide which value of K will give the best result. I ran the KNN classifier for all the k for max value of 19. ‘metric_array’ is created to store the accuracy of the model against the actual values for every value of k.

From the metric array I see that best accuracy came with the k value of 7.
K=7 is the best suited value for the classifier.

Decision Tree

Support Vector Machine

Logistic Regression

Model Evaluation using Test set

We will evaluate each of the machine learning models. I used Jaccard Score, f1_scrore for all the above used algorithms. For logisitic regression, I have used log_loss also.

Let us now download the test data.

!wget -O loan_test.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv

Load Test set for evaluation

We will also have to build features from the test data

We will also need to do same preprocessing that was performed on the training dataset.

Now, let us report evaluation metrics for each of the used algorithms.

Based on the above evaluation report we can say that SVM performed best for this classification problem.

This problem statement is part of a coursera certification.
Link: https://www.coursera.org/learn/machine-learning-with-python/home/welcome

--

--

Yash Vardhan Singh

Data Scientist | NLP | Computer Vision | Deep Learning | LLMs |