Ernest Owojori
devcareers
Published in
5 min readNov 30, 2019

--

Loan Prediction Using selected Machine Learning Algorithms

In finance, a loan is the lending of money by one or more individuals, organizations, or other entities to other individuals, organizations, etc. The recipient (i.e. the borrower) incurs a debt, and is usually liable to pay interest on that debt until it is repaid, and also to repay the principal amount borrowed. To read more check out Wikipedia. The whole process of ascertaining if a burrower would pay back loans might be tedious hence the need to automate the procedure.

In this blog post, I’d be walking us through Loan prediction using some selected Machine Learning Algorithms.

Source of Dataset: The dataset for this project is retrieved from kaggle, the home of Data Science.

The problem at hand: The major aim of this project is to predict which of the customers will have their loan paid or not. Therefore, this is a supervised classification problem to be trained with algorithms like:

  1. Logistic Regression
  2. Decision Tree
  3. Random Forest

Note: The machine learning classifier that can be used is not limited to the aforementioned. Other models like XGBoost, CatBoost and the likes can be applied in the training of the model. The choice of these three algorithms is sequel upon the desire to keep the model explanatory of itself and also, the dataset is small.

Disclaimer:

  1. In this project, default hyperparameter values are employed.
  2. More visualization can be done beyond what’s executed in this post
  3. The training dataset provided is the focus because we are not making a submission to kaggle for scoring. Hence, we split the train into a validation set to get our evaluations estimated.
This table shows the variable names and their corresponding descriptions

Now! Let’s get to work on the dataset (Data cleaning)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Read the data and checked the shape. Oh! it has 614 rows and 13 columns. That’s 12 features

Missing Values: Check where there are missing values and fix them appropriately

total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=[‘Total’, ‘Percent’])
missing_data.head(20)

Variable: Credit_History, Self_Emoloyed, LoanAmount, Dependents, Loan_Amount_Term, Gender and Married have missing values

Fill missing values

df_train[‘Gender’] = df_train[‘Gender’].fillna(
df_train[‘Gender’].dropna().mode().values[0] )
df_train[‘Married’] = df_train[‘Married’].fillna(
df_train[‘Married’].dropna().mode().values[0] )
df_train[‘Dependents’] = df_train[‘Dependents’].fillna(
df_train[‘Dependents’].dropna().mode().values[0] )
df_train[‘Self_Employed’] = df_train[‘Self_Employed’].fillna(
df_train[‘Self_Employed’].dropna().mode().values[0] )
df_train[‘LoanAmount’] = df_train[‘LoanAmount’].fillna(
df_train[‘LoanAmount’].dropna().median() )
df_train[‘Loan_Amount_Term’] = df_train[‘Loan_Amount_Term’].fillna(
df_train[‘Loan_Amount_Term’].dropna().mode().values[0] )
df_train[‘Credit_History’] = df_train[‘Credit_History’].fillna(
df_train[‘Credit_History’].dropna().mode().values[0] )

Yes! Missing values skillfully treated

Exploratory Data Analysis: We want to show the power of visualizations

More males are on loan than females. Also, those that are on loan are more than otherwise
Married people collect more loan than unmarried
The category of those that take loans is less of self-employed people. That’s those are not self-employed probably salalary earners obtain more loan.
According to the credit history, greater number of people pay back their loans.
Semiurban obtain more loan, folowed by Urban and then rural. This is logical!
An extremely high number of them go for a 360 cyclic loan term. That’s pay back within a year
Males generally have the highest income. Explicitly, Males that are married have greater income that unmarried male. Sensible!
A graduate who is a male has more income
A graduate and married individual has more income
A graduate but not self-employed has more income
Not married and no one is dependent on such has more income. Also, Married and no one dependent has greater income with a decreasing effect as the dependents increases
No one is dependent and self-employed has more income
No one is dependent and a male tremendously has more income
A graduate with no one dependent has more income
No one is dependent and have property in urban, rural and semiurban has more income
Married and has a good credit history depicts more income. Also, Not married but has a good credit history follows in the hierarchy.
Educated with good credit history depicts a good income. Also, not a graduate and have a good credit history can be traced to having a better income than a fellow with no degree

Encoding to numeric data; getting ready for training

code_numeric = {‘Male’: 1, ‘Female’: 2,
‘Yes’: 1, ‘No’: 2,
‘Graduate’: 1, ‘Not Graduate’: 2,
‘Urban’: 3, ‘Semiurban’: 2,’Rural’: 1,
‘Y’: 1, ’N’: 0,
‘3+’: 3}

df_train = df_train.applymap(lambda s: code_numeric.get(s) if s in code_numeric else s)
df_test = df_test.applymap(lambda s: code_numeric.get(s) if s in code_numeric else s)

#drop the uniques loan id
df_train.drop(‘Loan_ID’, axis = 1, inplace = True)

Oops! need to convert ‘Dependents’ feature to numeric using pd.to_numeric

Dependents_ = pd.to_numeric(df_train.Dependents)
Dependents__ = pd.to_numeric(df_test.Dependents)

df_train.drop([‘Dependents’], axis = 1, inplace = True)
df_test.drop([‘Dependents’], axis = 1, inplace = True)

df_train = pd.concat([df_train, Dependents_], axis = 1)
df_test = pd.concat([df_test, Dependents__], axis = 1)

Yes! converted successfully
Heatmap: Showing the correlations of features with the target. No correlations are extremely high. The correlations between LoanAmount and ApplicantIncome can be explained.

Separating Target from the feature for training

y = df_train[‘Loan_Status’]
X = df_train.drop(‘Loan_Status’, axis = 1)

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

Using Logistic Regression

model = LogisticRegression()

model.fit(X_train, y_train)

ypred = model.predict(X_test)

evaluation = f1_score(y_test, ypred)
evaluation

Conclusion

From the Exploratory Data Analysis, we could generate insight from the data. How each of the features relates to the target. Also, it can be seen from the evaluation of three models that Logistic Regression performed better than others, Random Forest did better than Decision Tree.

--

--

Ernest Owojori
devcareers

Product Manager | Data Analyst | Statistician | Community Manager