Machine Learning Classification using KNN, Decision Tree, SVM, Logistic Regression
In this article I will show you how to solve classification problem using machine learning algorithms. Different classification algorithms are used and results are compared to see the best one for the specific dataset by accuracy evaluation methods.
Let’s get started. . .
Import all the required libraries
Dataset description
I came across this problem statement during my online course on Coursera for IBM machine learning certificate.
Dataset is about the past loans. The Loan_train.csv data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:
Let us download the data
!wget -O loan_train.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/FinalModule_Coursera/data/loan_train.csv
Read data from csv file
We see from the data that ‘due_date’ and ‘effective_date’ are not in the correct datetime format. I converted these values to date time object
Data Visualization and pre-processing
Next step is to visualize the dataset to get the better understanding. We will need to do some pre processing on the data before training any machine learning algorithm
Let us check number of examples for each class present in our dataset
260 people have paid off the loan on time while 86 have gone into collection
Plot some graphs to get better understanding of data
Pre-processing and Feature selection
Let us see at the day of the week when people get the loan. This will help us build our features list.
Here, we can see that people who get the loan at the end of the week don't pay it off. Therefore, I have used feature binarization to set the threshold values less than day 4.
Created another column ‘Weekend’ to with value 1 if ‘dayofweek’ is greater than 3 else it is kept as 0
Convert categorical features to numerical values
We see that 86% female pay loans and only 73% of males pay there loan.
Let us convert these values to numeric values. Male as 0 and Female as 1.
One Hot Encoding
In the above dataframe we see that education column has different values like ‘High School or Below’, ‘Bechalor’ etc.
Using one hot encoding we can create separate features for our training set.
Use one hot encoding technique to convert categorical variables to binary variables and append them to the feature set.
Feature selection process
Labels corresponding to each row from our feature set
Normalize Data
Classification
I will demonstrate the use of various classification algorithm to predict if customer is likely to pay the loan.
I have used the following machine algorithms to predict:
- K Nearest Neighbor(KNN)
- Decision Tree
- Support Vector Machine
- Logistic Regression
K Nearest Neighbor(KNN)
I have split dataset into train and test data.
In KNN algorithm, we need to decide which value of K will give the best result. I ran the KNN classifier for all the k for max value of 19. ‘metric_array’ is created to store the accuracy of the model against the actual values for every value of k.
From the metric array I see that best accuracy came with the k value of 7.
K=7 is the best suited value for the classifier.
Decision Tree
Support Vector Machine
Logistic Regression
Model Evaluation using Test set
We will evaluate each of the machine learning models. I used Jaccard Score, f1_scrore for all the above used algorithms. For logisitic regression, I have used log_loss also.
Let us now download the test data.
!wget -O loan_test.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv
Load Test set for evaluation
We will also have to build features from the test data
We will also need to do same preprocessing that was performed on the training dataset.
Now, let us report evaluation metrics for each of the used algorithms.
Based on the above evaluation report we can say that SVM performed best for this classification problem.
This problem statement is part of a coursera certification.
Link: https://www.coursera.org/learn/machine-learning-with-python/home/welcome