A Study on Loan Prediction

Sudipta Banerjee
7 min readAug 13, 2021

--

INTRODUCTION

A company wants to automate the loan eligibility process (real time) based on customer detail provided, while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers.

It’s a classification problem. We have to predict whether a person is eligible for getting a loan or not based on the information we have been provided with about that person.

Data Source

Loan Predication (Kaggle)

METHODOLOGY

We build two classifiers for this data set: a Decision tree and a Random forest. Then, we choose a suitable metric to compare the performance of these models. An appropriate metric, in this case, would be ‘Precision’. Suppose a customer is classified as ‘Yes’ that is the person is eligible for getting a loan wheras in reality the person is not(False Positive). Another scenario can be, that a person who is eligible for getting a loan has been classified as not eligible(False Negative). The former scenario is more dangerous from a bank’s perspective.

Later (in Data Vizualisation), we’ll see that this dataset is imbalanced. For Decision Tree, we pass ‘balanced’ to the argument class_weight to account for the skewness in the data. To avoid overfitting and to obtain the best results, we do hyperparameter tuning using RandomizedSearch. The best parameters are: max_depth: 8, max_leaf_nodes: 49 and min_samples_leaf: 12.

For Random Forest, we use Balanced Random Forest to avoid the cost of imbalance in modelling and the best parameters obtained are criterion: entropy, max_depth: 7, max_leaf_nodes: 31, min_samples_leaf: 9, n_estimators:132

The precision of a model is given by

and the Recall is given by

Necessary Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from scipy.stats import chi2_contingency
from scipy.stats import chi2
from sklearn.model_selection import RandomizedSearchCV,train_test_split
from sklearn import metrics
from sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import (confusion_matrix, accuracy_score,recall_score,precision_score,f1_score)
from sklearn.metrics import roc_auc_score
from scipy.stats import randint
from imblearn.ensemble import BalancedRandomForestClassifier

We’ll read the data ‘Loan Prediction’ in as a csv file into an object named as df. Since the column ‘Loan_ID’ has no role in our analysis, so we drop this column and name the dataframe as ‘data’.

df=pd.read_csv(“credit_default.csv”)
data=df.drop(‘Loan_ID’,axis=1)

Now, let us look at the data type of the variables.

data.info()

Next, we look for whether there are missing values in the data.

feature_nan=[features for features in data.columns if data[features].isnull().sum()>1 and data[features].dtypes==’O’]
for features in feature_nan:
print(“{} has {}% missing values”.format(features,np.round((data[features].isnull().mean()*100),4)))
feature_nan_numerical=[features for features in data.columns if data[features].isnull().sum()>1 and data[features].dtypes!=’O’]
for features in feature_nan_numerical:
print(“{} has {}% missing values”.format(features,np.round(data[features].isnull().mean()*100,4)))

Missing value treatment

We have missing values both in categorical and numerical variables. For categorical variables we fill the missing values with their respective modes(the value with the highest frequency) including ‘Credit History’ and ‘Loan Amount Term’. For numerical variables we fill the missing values with the median (since it contains outliers).

feature_nan_numerical.pop(0)
feature_nan_numerical
for feature in feature_nan:
data[feature]=data[feature].fillna(data[feature].mode()[0])
for feature in feature_nan_numerical:
data[feature]=data[feature].fillna(data[feature].mode()[0])
data['LoanAmount']=data['LoanAmount'].fillna(data['LoanAmount'].median())
data.isnull().sum()

Data Visualization

We first check the proportion of ‘Y’ and ’N’ for the column Loan_Status which is our target variable.

x=data[‘Loan_Status’].value_counts()/len(data[‘Loan_Status’])
lst1=[x[0],x[1]]
lst2=[‘Y’,’N’]
plt.bar(lst2,lst1)

The ratio of ‘Y’ to ’N’ in the column ‘Loan_Status’ is almost 0.69 to 0.31. Since this column is our target variable so we can say that this is an imbalanced dataset.

Next, we see the column Loan_Amount_Term which is term of loan in months.

x=data[‘Loan_Amount_Term’].unique().astype(str)
height=data[‘Loan_Amount_Term’].value_counts().values
plt.bar(x,height,width=0.8)
plt.xlabel(‘Loan_Amount_Term’)
plt.ylabel(‘count’)
plt.show()

Most of the loan borrowers have loan amount term of 360 months that is 30 years.

Following is the visualization of the categorical variables when the column Loan Status is present.

categorical=[feature for feature in data.columns if data[feature].dtypes=='O']
categorical
Output:['Gender',
'Married',
'Dependents',
'Education',
'Self_Employed',
'Property_Area',
'Loan_Status']

data_copy=data.copy()
data_copy['Gender']=np.where(data_copy['Gender']=='Male','M','F')
data_copy['Married']=np.where(data_copy['Married']=='Yes','Y','N')
data_copy['Education']=np.where(data_copy['Education']=='Graduate','G','N.G')
data_copy['Property_Area'].replace(to_replace=['Rural','Semiurban','Urban'],value=['R','S.U','U'],inplace=True)
fig,axs=plt.subplots(2,3,figsize=(15,10))
i=0
j=0
for feature in categorical[:-1]:
table=pd.crosstab(data_copy[feature],data_copy[‘Loan_Status’])..apply(lambda x:x/x.sum(),axis=1)
table.plot(kind=’bar’,stacked=True,ax=axs[i][j])
j=j+1
if j%3==0:
i=i+1
j=0
plt.show()

Based on the graphs we can say that,

  1. More males have been approved to take loan than females though this difference is not that significant.

2. Married people have been more likely to get a loan.

3. People with 2 dependents have been most likely to get a loan, then people with no dependent and lastly people with one dependent and 3 dependents.

4. Proportion of getting a loan is greater for graduate people than non-graduate people.

5. Loan Status for both self-employed and not self-employed people are almost same.

6. Approval of loan for semiurban people have been more and then urban and rural people respectively.

Test for Proportion

We perform tests for checking whether there is any significant difference among the categories in each categorical variable. First we perform the tests for the columns Gender, Married, Education and Self_Employed respectively.

(Level of Significance for all the tests are 0.05)

from scipy.stats import statsdata1=data.copy()
data1['Loan_Status']=np.where(data1['Loan_Status']=='Y',1,0)
_,gender=stats.ttest_ind(a=data1[‘Loan_Status’][data1[‘Gender’] ==’Male’],b=data1[‘Loan_Status’][data1[‘Gender’] ==’Female’],alternative=’greater’,equal_var=False)
_,married=stats.ttest_ind(a=data1[‘Loan_Status’][data1[‘Married’] ==’Yes’],b=data1[‘Loan_Status’][data1[‘Married’] ==’No’],alternative=’greater’,equal_var=False)
_,education=stats.ttest_ind(a=data1[‘Loan_Status’][data1[‘Education’] ==’Graduate’],b=data1[‘Loan_Status’][data1[‘Education’] ==’Not Graduate’],alternative=’greater’,equal_var=False)
_,self_employed=stats.ttest_ind(a=data1[‘Loan_Status’][data1[‘Self_Employed’] ==’No’],b=data1[‘Loan_Status’][data1[‘Self_Employed’] ==’Yes’],alternative=’greater’,equal_var=False)
lst=[gender,married,education,self_employed]
for i in lst:
if(i<0.05):
print(‘the difference is significant’)
else:
print(‘the difference is insignificant’)
Output:the difference is insignificant
the difference is significant
the difference is significant
the difference is insignificant

Based on the tests performed above we conclude:

  1. There is no significance difference in proportion of loan approval for males and females.
  2. Proportion for loan approval for married people is more.
  3. Proportion for loan approval for graduate people is more than not graduate people.
  4. There is no significance difference in loan approval for those who are self employed and those who are not.

Next we consider the columns Property_Area and Dependents.

import statsmodels.stats.multicomp as mccomp = mc.MultiComparison(data1[‘Loan_Status’],data1[‘Property_Area’])
post_hoc_res = comp.tukeyhsd()
post_hoc_res.summary()

So, there is no significance difference in proportion of loan approval for between Rural and Urban population.

comp = mc.MultiComparison(data1[‘Loan_Status’],data1[‘Dependents’])
post_hoc_res = comp.tukeyhsd()
post_hoc_res.summary()

There is no significance difference among the proportion of loan approval based on dependents.

The following graphs show the distribution of Loan amount when other categorical variables such as Loan Status, Education and Self Employed are present.

fig,axs=plt.subplots(1,3,figsize=(15,3))
sns.histplot(data=data,x=’LoanAmount’,hue=’Loan_Status’,element=’step’,bins=100,ax=axs[0])
sns.histplot(data=data,x=’LoanAmount’,hue=’Education’,element=’step’,bins=100,ax=axs[1])
sns.histplot(data=data,x=’LoanAmount’,hue=’Self_Employed’,element=’step’,bins=100,ax=axs[2])
plt.show()

a. For both categories ‘Y’ and ’N’ in Loan_status the distribution of LoanAmount are almost similar.

b. Loan amount for graduate people have been greater than non-graduate people.

c. Loan amount for self-employed people are more than not self-employed people and have more outliers.

Another interesting variable for our study is credit history. We can check how it affects the Loan Status. We can turn it into binary then calculate it’s mean for each value of credit history .

data1=data.copy()
data1[‘Loan_Status’]=np.where(data1[‘Loan_Status’]==’Y’,1,0)
data2=data1.groupby(‘Credit_History’)[‘Loan_Status’].mean()
data2

People with a credit history is way more likely to pay their loan. This means that credit history will be an influential variable in our model.

Model Fitting:

Before modelling we need to turn all the categorical variables into numbers.

Label Encoding:

categorical=[feature for feature in data.columns if data[feature].dtypes==’O’]for feature in categorical:
label_encoder = preprocessing.LabelEncoder()
data[feature]=label_encoder.fit_transform(data[feature])
print(‘{} has unique values {}’.format(feature,data[feature].unique()))
data.info()

Train-Test split

Y=data[‘Loan_Status’].values
X=data.drop(‘Loan_Status’,axis=1)
X_train,X_test,y_train,y_test=train_test_split(X, Y, test_size = 0.2, random_state = 42)

Then we try out different models and we’ll create a function that takes in a model.

def result_disp(model, X_train, X_test, y_train, y_test, name):
y_pred = model.predict(X_train)
print(‘{}’.format(name))
print(‘\nTraining Data Scores:’)
print(‘Precision: {:0.3f}’.format(metrics.precision_score(y_train, y_pred)))
print(‘Recall: {:0.3f}’.format(metrics.recall_score(y_train, y_pred)))
print(‘F-Score: {:0.3f}’.format(metrics.f1_score(y_train, y_pred)))
print(‘Accuracy: {:0.3f}’.format(metrics.accuracy_score(y_train, y_pred)))

y_pred = model.predict(X_test)
print(‘\nTest Data Scores:’)
print(‘Precision: {:0.3f}’.format(metrics.precision_score(y_test, y_pred)))
print(‘Recall: {:0.3f}’.format(metrics.recall_score(y_test, y_pred)))
print(‘F-Score: {:0.3f}’.format(metrics.f1_score(y_test, y_pred)))
print(‘Accuracy: {:0.3f}\n\n’.format(metrics.accuracy_score(y_test, y_pred)))

DECISION TREE

model=tree.DecisionTreeClassifier(class_weight='balanced')
search= RandomizedSearchCV(model,
{
‘max_depth’: randint(2,10),
‘max_leaf_nodes’:randint(30,50),
‘min_samples_leaf’:randint(5,15)
},
scoring =’roc_auc’,random_state=42)
search.fit(X_train,y_train)
print(‘best parameters:’ ,(search.best_params_))
print(‘best score:%.4f’ % (search.best_score_))
Output:best parameters: {'max_depth': 8, 'max_leaf_nodes': 49, 'min_samples_leaf': 12}
best score:0.7596
result_disp(search,X_train, X_test, y_train, y_test,'Decision_Tree')

Output:

Decision_Tree

Training Data Scores:
Precision: 0.910
Recall: 0.772
F-Score: 0.835
Accuracy: 0.788

Test Data Scores:
Precision: 0.756
Recall: 0.738
F-Score: 0.747
Accuracy: 0.675

RANDOM FOREST

model=BalancedRandomForestClassifier()
search= RandomizedSearchCV(model,
{
‘n_estimators’: randint(100,150),
‘criterion’:[‘gini’,’entropy’],
‘max_depth’: randint(2,10),
‘max_leaf_nodes’:randint(30,50),
‘min_samples_leaf’:randint(5,15)
},
scoring =’roc_auc’,random_state=42)
search.fit(X_train,y_train)
print(‘best parameters:’ ,(search.best_params_))
print(‘best score:%.4f’ % (search.best_score_))
Output:best parameters: {'criterion': 'entropy', 'max_depth': 7, 'max_leaf_nodes': 31, 'min_samples_leaf': 9, 'n_estimators': 132}
best score:0.7754
result_disp(search,X_train, X_test, y_train, y_test,'Random_Forest')

Output:

Random_Forest

Training Data Scores:
Precision: 0.860
Recall: 0.912
F-Score: 0.885
Accuracy: 0.835

Test Data Scores:
Precision: 0.766
Recall: 0.900
F-Score: 0.828
Accuracy: 0.756

--

--