Loan Prediction Using selected Machine Learning Algorithms
In finance, a loan is the lending of money by one or more individuals, organizations, or other entities to other individuals, organizations, etc. The recipient (i.e. the borrower) incurs a debt, and is usually liable to pay interest on that debt until it is repaid, and also to repay the principal amount borrowed. To read more check out Wikipedia. The whole process of ascertaining if a burrower would pay back loans might be tedious hence the need to automate the procedure.
In this blog post, I’d be walking us through Loan prediction using some selected Machine Learning Algorithms.
Source of Dataset: The dataset for this project is retrieved from kaggle, the home of Data Science.
The problem at hand: The major aim of this project is to predict which of the customers will have their loan paid or not. Therefore, this is a supervised classification problem to be trained with algorithms like:
- Logistic Regression
- Decision Tree
- Random Forest
Note: The machine learning classifier that can be used is not limited to the aforementioned. Other models like XGBoost, CatBoost and the likes can be applied in the training of the model. The choice of these three algorithms is sequel upon the desire to keep the model explanatory of itself and also, the dataset is small.
Disclaimer:
- In this project, default hyperparameter values are employed.
- More visualization can be done beyond what’s executed in this post
- The training dataset provided is the focus because we are not making a submission to kaggle for scoring. Hence, we split the train into a validation set to get our evaluations estimated.
Now! Let’s get to work on the dataset (Data cleaning)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Missing Values: Check where there are missing values and fix them appropriately
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=[‘Total’, ‘Percent’])
missing_data.head(20)
Fill missing values
df_train[‘Gender’] = df_train[‘Gender’].fillna(
df_train[‘Gender’].dropna().mode().values[0] )
df_train[‘Married’] = df_train[‘Married’].fillna(
df_train[‘Married’].dropna().mode().values[0] )
df_train[‘Dependents’] = df_train[‘Dependents’].fillna(
df_train[‘Dependents’].dropna().mode().values[0] )
df_train[‘Self_Employed’] = df_train[‘Self_Employed’].fillna(
df_train[‘Self_Employed’].dropna().mode().values[0] )
df_train[‘LoanAmount’] = df_train[‘LoanAmount’].fillna(
df_train[‘LoanAmount’].dropna().median() )
df_train[‘Loan_Amount_Term’] = df_train[‘Loan_Amount_Term’].fillna(
df_train[‘Loan_Amount_Term’].dropna().mode().values[0] )
df_train[‘Credit_History’] = df_train[‘Credit_History’].fillna(
df_train[‘Credit_History’].dropna().mode().values[0] )
Exploratory Data Analysis: We want to show the power of visualizations
Encoding to numeric data; getting ready for training
code_numeric = {‘Male’: 1, ‘Female’: 2,
‘Yes’: 1, ‘No’: 2,
‘Graduate’: 1, ‘Not Graduate’: 2,
‘Urban’: 3, ‘Semiurban’: 2,’Rural’: 1,
‘Y’: 1, ’N’: 0,
‘3+’: 3}
df_train = df_train.applymap(lambda s: code_numeric.get(s) if s in code_numeric else s)
df_test = df_test.applymap(lambda s: code_numeric.get(s) if s in code_numeric else s)
#drop the uniques loan id
df_train.drop(‘Loan_ID’, axis = 1, inplace = True)
Dependents_ = pd.to_numeric(df_train.Dependents)
Dependents__ = pd.to_numeric(df_test.Dependents)
df_train.drop([‘Dependents’], axis = 1, inplace = True)
df_test.drop([‘Dependents’], axis = 1, inplace = True)
df_train = pd.concat([df_train, Dependents_], axis = 1)
df_test = pd.concat([df_test, Dependents__], axis = 1)
Separating Target from the feature for training
y = df_train[‘Loan_Status’]
X = df_train.drop(‘Loan_Status’, axis = 1)
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
Using Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)
ypred = model.predict(X_test)
evaluation =
f1_score(y_test,
ypred)
evaluation
Conclusion
From the Exploratory Data Analysis, we could generate insight from the data. How each of the features relates to the target. Also, it can be seen from the evaluation of three models that Logistic Regression performed better than others, Random Forest did better than Decision Tree.