Prediction of Loan Approval with Machine Learning

Loan Approval Data set from GitHub — classification with various models.

What is Machine Learning?

Machine learning (ML) is a category of an algorithm that allows software applications to become more accurate in predicting outcomes without being explicitly programmed. The basic premise of machine learning is to build algorithms that can receive input data and use statistical analysis to predict an output while updating outputs as new data becomes available.

Types of Machine Learning?

  1. Supervised Learning
  2. Unsupervised Learning
  3. Reinforcement Learning

Overview of Supervised Learning Algorithm

The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

As shown in the above example, we have initially taken some data and marked them as ‘Spam’ or ‘Not Spam’. This labeled data is used by the training supervised model, this data is used to train the model.

Once it is trained we can test our model by testing it with some test new mails and checking of the model is able to predict the right output.

Types of Supervised learning

  • Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.

Overview of Unsupervised Learning Algorithm

Example of Unsupervised Learning

In the above example, we have given some characters to our model which are ‘Ducks’ and ‘Not Ducks’. In our training data, we don’t provide any label to the corresponding data. The unsupervised model is able to separate both the characters by looking at the type of data and models the underlying structure or distribution in the data in order to learn more about it.

Types of Unsupervised learning

  • Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.

Overview of Reinforcement Learning

Example of Reinforcement Learning

In the above example, we can see that the agent is given 2 options i.e. a path with water or a path with fire. A reinforcement algorithm works on reward a system i.e. if the agent uses the fire path then the rewards are subtracted and agent tries to learn that it should avoid the fire path. If it had chosen the water path or the safe path then some points would have been added to the reward points, the agent then would try to learn what path is safe and what path isn’t.

It is basically leveraging the rewards obtained, the agent improves its environment knowledge to select the next action.

Introduction of our Problem

In this blog post, I’d be walking us through Loan prediction using some selected Machine Learning Algorithms.

Source of Dataset: The dataset for this project is retrieved from GitHub the home of Data Science.

The problem at hand: The major aim of this project is to predict which of the customers will have their loan paid or not. Therefore, this is a supervised classification problem to be trained with algorithms like:

  1. Logistic Regression
  2. Decision Tree
  3. Random Forest

Note: The machine learning classifier that can be used is not limited to the aforementioned. Other models like XGBoost, CatBoost and the likes can be applied in the training of the model. The choice of these three algorithms is sequel upon the desire to keep the model explanatory of itself and also, the data set is small.

Disclaimer:

  1. In this project, default hyper-parameter values are employed.
  2. More visualization can be done beyond what’s executed in this post
  3. The training data set provided is the focus because we are not making a submission to kaggle for scoring. Hence, we split the train into a validation set to get our evaluations estimated.

This table shows the variable names and their corresponding descriptions

Now! Let’s get to work on the dataset (Data cleaning)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Read the data and checked the shape. Oh! it has 614 rows and 13 columns. That’s 12 features

Missing Values: Check where there are missing values and fix them appropriately

total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=[‘Total’, ‘Percent’])
missing_data.head(20)

Variable: Credit_History, Self_Employed, Loan Amount, Dependents, Loan_Amount_Term, Gender and Married have missing values

Fill missing values

df_train[‘Gender’] = df_train[‘Gender’].fillna(
df_train[‘Gender’].dropna().mode().values[0] )
df_train[‘Married’] = df_train[‘Married’].fillna(
df_train[‘Married’].dropna().mode().values[0] )
df_train[‘Dependents’] = df_train[‘Dependents’].fillna(
df_train[‘Dependents’].dropna().mode().values[0] )
df_train[‘Self_Employed’] = df_train[‘Self_Employed’].fillna(
df_train[‘Self_Employed’].dropna().mode().values[0] )
df_train[‘LoanAmount’] = df_train[‘LoanAmount’].fillna(
df_train[‘LoanAmount’].dropna().median() )
df_train[‘Loan_Amount_Term’] = df_train[‘Loan_Amount_Term’].fillna(
df_train[‘Loan_Amount_Term’].dropna().mode().values[0] )
df_train[‘Credit_History’] = df_train[‘Credit_History’].fillna(
df_train[‘Credit_History’].dropna().mode().values[0] )

Yes! Missing values skillfully treated

Exploratory Data Analysis: We want to show the power of visualizations

More males are on loan than females. Also, those that are on loan are more than otherwise

Married people collect more loan than unmarried

The category of those that take loans is less of self-employed people. That’s those are not self-employed probably salary earners obtain more loan.

According to the credit history, greater number of people pay back their loans.

Semi-urban obtain more loan, followed by Urban and then rural. This is logical!

An extremely high number of them go for a 360 cyclic loan term. That’s pay back within a year

Males generally have the highest income. Explicitly, Males that are married have greater income that unmarried male. Sensible!

A graduate who is a male has more income

A graduate and married individual has more income

A graduate but not self-employed has more income

Not married and no one is dependent on such has more income. Also, Married and no one dependent has greater income with a decreasing effect as the dependents increases

No one is dependent and self-employed has more income

No one is dependent and a male tremendously has more income

A graduate with no one dependent has more income

No one is dependent and have property in urban, rural and semiurban has more income

Married and has a good credit history depicts more income. Also, Not married but has a good credit history follows in the hierarchy.

Educated with good credit history depicts a good income. Also, not a graduate and have a good credit history can be traced to having a better income than a fellow with no degree

Encoding to numeric data; getting ready for training

code_numeric = {‘Male’: 1, ‘Female’: 2,
‘Yes’: 1, ‘No’: 2,
‘Graduate’: 1, ‘Not Graduate’: 2,
‘Urban’: 3, ‘Semiurban’: 2,’Rural’: 1,
‘Y’: 1, ’N’: 0,
‘3+’: 3}

df_train = df_train.applymap(lambda s: code_numeric.get(s) if s in code_numeric else s)
df_test = df_test.applymap(lambda s: code_numeric.get(s) if s in code_numeric else s)

#drop the uniques loan id
df_train.drop(‘Loan_ID’, axis = 1, inplace = True)

Oops! need to convert ‘Dependents’ feature to numeric using pd.to_numeric

Dependents_ = pd.to_numeric(df_train.Dependents)
Dependents__ = pd.to_numeric(df_test.Dependents)

df_train.drop([‘Dependents’], axis = 1, inplace = True)
df_test.drop([‘Dependents’], axis = 1, inplace = True)

df_train = pd.concat([df_train, Dependents_], axis = 1)
df_test = pd.concat([df_test, Dependents__], axis = 1)

Yes! converted successfully

Heatmap: Showing the correlations of features with the target. No correlations are extremely high. The correlations between LoanAmount and ApplicantIncome can be explained.

Separating Target from the feature for training

y = df_train[‘Loan_Status’]
X = df_train.drop(‘Loan_Status’, axis = 1)

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

Using Logistic Regression

model = LogisticRegression()

model.fit(X_train, y_train)

ypred = model.predict(X_test)

evaluation = f1_score(y_test, ypred)
evaluation

Conclusion

From the Exploratory Data Analysis, we could generate insight from the data. How each of the features relates to the target. Also, it can be seen from the evaluation of three models that Logistic Regression performed better than others, Random Forest did better than Decision Tree.

Summary