Bank Loan Approval Prediction using different Machine Learning Algorithms

20 min readJul 8, 2023

Introduction

A bank loan is when a bank offer to lend money to customers for a certain time period. As a condition of the bank, the borrower will need to pay a certain amount of interest per month, or per year. The bank will provide a loan for different purpose like business loan, home loan, education loan etc. One of the most challenging parts of a bank is whether the customers will pay back with interest or not because most of the bank comes for a interest. To overcome this problem, we are using machine learning model which help us to predict accurately whether customers are defaulter or not

Table of Content

Exploratory Data Analysis
Visualizing Outliers
Looking at unique value in data
Handling Imbalance data
Handing Missing/Null value
Feature Selection
Splitting the data for training and testing
Preprocessing the data for model
Learning Algorithm
Metrics Evaluation

Exploratory Data Analysis

Whenever we are start to do any data science projects. We need to understand what lies inside our data. Understanding about the features/attributes is very first step of the project. It helps determine how best to manipulate data sources to get the answers we need, making it easier for data scientists to discover patterns or check assumptions. The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables. We will look at the features of our dataset.

df.columns

we have 18 features where 17 are independent and 1 is dependent variable. i.e Loan Status in our dataset. The features are self explanatory.

df.dtypes

We can see in above figure. There are three types of dataset where there are three types of datatypes. They are float64, int64 and object.

int64: It represents the integer features in dataset. There are 5 there variables.
float64: It represents the features having numbers with a fractional part, containing one or more decimals. There are 8 floating variables
object: It represents the features having multiple datatypes including string. There are 6 object variables.

There are a variety of descriptive statistics. Numbers such as the mean, median, mode, skewness, kurtosis, standard deviation, first quartile and third quartile, to name a few, each tell us something about our data.

The minimum — this is the smallest value in our data set.
The first quartile — this number is denoted Q1 and 25% of our data falls below the first quartile.
The median — this is the midway point of the data. 50% of all data falls below the median.
The third quartile — this number is denoted Q3 and 75% of our data falls below the third quartile.
The maximum — this is the largest value in our data set.

df.describe()

From the above table we can find out an data entry/measurement error in credit score. The maximum value of credit score is 7510 which is impossible. The value of credit score lies within 250–850 or 300–900. So we have to deal with those wrong values (which will be discussed later). The maximum value of current loan amount is also way too higher than it’s median which can be a potential outlier in data. We can also represent our data graphically using histogram and pair plot.

df.hist(bins=50,figsize=(25,16))

A histogram is one of the most frequently used data visualization techniques in machine learning. It represents the distribution of a continuous variable over a given interval or period of time. Histograms plot the data by dividing it into intervals called ‘bins’.

It provides us a count of the number of observations in each bin created for visualization.
From the shape of the bin, we can easily observe the distribution i.e. weather it is Gaussian, skewed or exponential.
Histograms also help us to see possible outliers.

sns.pairplot(X,hue='Loan Status')

Pairplot are also very useful while visualizing the single variable or relationship between pair of variables. It provides really important insight on which variable separates our classes. But pairplot can’t be useful if we have large number of features because the number of pairplot becomes very large which makes it difficult to go through all the pariplots. We can use seaborn library for pairplot.

Visualizing Outliers

An outlier is an object that deviates significantly from the rest of the objects. They represent errors in measurement, bad data collection, or simply show variables not considered when collecting the data. Most data mining methods discard outliers noise or exceptions, however, in some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring one and hence, the outlier analysis becomes important in such case. Visualizing data with help of boxplot can be helpful to glance at the outliers. Boxplot is based on five point summary (minimum, maximum, median, first quartile and third quartile). There is also whisker on the both sides box which ranges up to 1.5 times IQR(difference between first and third quartile) from upper quartile and 1.5 times IQR below the lower quartile. All the data points which doesn’t fall within the whisker are considered as outliers.

sns.boxplot(x='Loan Status', y='Credit Score', data=df)

We see lots of outliers in credit score. As we have discussed earlier, the maximum limit of credit score lies below 900. We have to either remove all the value which contain credit score greater than 900 which will decrease the data point which is not the optimal solution while dealing with outliers or we have to replace the outliers with mean value of credit score. Carefully looking at the value higher than 900.we can see 0 at the rightmost position of value which seem to have been added accidently while entering the value. So we will remove 0 from the respective value. Now the box plot after removing 0. Similarly, we can also see outliers in Current Loan Amount, monthly debt and Maximum Open Credit.

sns.boxplot(x='Loan Status', y='Monthly Debt', data=df)

sns.boxplot(x='Loan Status', y='Maximum Open Credit', data=df)

We can see that alot of outlier in Maximum Open Credit are present in only single class Fully Paid i.e who had successfully paid the loan amount. while Monthly debt have also some outliers are present in Charged Off. we remove those outliers from the dataset IQR for detecting outliers

A box plot tells us, more or less, about the distribution of the data. It gives a sense of how much the data is actually spread about, what’s its range, and about its skewness. In the above figure,

minimum is the minimum value in the dataset,
and maximum is the maximum value in the dataset. So the difference between the two tells us about the range of dataset.
The median is the median (or center point), also called second quartile, of the data
Q1 is the first quartile of the data, i.e., to say 25% of the data lies between minimum and Q1.
Q3 is the third quartile of the data, i.e., to say 75% of the data lies between minimum and Q3. The difference between Q3 and Q1 is called the Inter-Quartile Range or IQR.

To detect the outliers using this method, we define a new range, let’s call it decision range, and any data point lying outside this range is considered as outlier. The range is as given below:

Any data point less than the Lower Bound or more than the Upper Bound is considered as an outlier. Now, we are finding the lower and upper bound of monthly debt, Maximum Open Credit to remove outlier.

min_threshold,max_threshold = df['Maximum Open Credit'].quantile([0.001,0.999])
min_threshold,max_threshold
df = df[(df['Maximum Open Credit']<max_threshold) & (df['Maximum Open Credit']>min_threshold])]

min_threshold,max_threshold = df['Monthly Debt'].quantile([0.001,0.999])
min_threshold,max_threshold
df = df[(df['Monthly Debt']<max_threshold) & (df['Monthly Debt']>min_threshold])]

Let’s talk about current loan amount, lower bound and upper bound values are 21538.0 and 999999999.0 so we only take data range between lower and upper bound. This is the method of removing outlier from dataset.

Looking at unique value in our data

df_dup = df[df.duplicated()]
df_dup.shape

df.drop(df_dup.index, inplace=True)

There are 10215 rows which are duplicated in our dataset which we have to remove. If we look at the unique value in different categorical features we find that the Purpose and Home Ownership feature have duplicate category. We see the other category is duplicated. One begins with small O and another begins with capital O. So, we can replace those features using excel find and replace option. We will also replace the minority category whose presence is less than 0.5% with other category.

df['Purpose'].value_counts()

Handling Missing/Null Values

We can find the total number of missing value in each feature using pandas function.

df.isna().sum()

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

The percentage of missing value in Months since last delinquent is 48%. Since the percentage of missing value is greater than 30% we will not use this feature in our model. Removing all missing value results in loss of data which we don’t want. We can impute the missing value with mean, median or mode. We can also try KNN imputation. But here we will use mean, median or mode. If our feature are categorical then we use the mode (maximum frequency) to fill missing value. For numerical feature we use mean but if there are outliers then we use median which isn’t sensitive to outliers.

df['Credit Score'] = df['Credit Score'].fillna(df['Credit Score'].median())
df['Annual Income'] = df['Annual Income'].fillna(df['Annual Income'].mean())
df['Years in current job'] = df['Years in current job'].fillna(df['Years in current job'].mean())
df['Months since last delinquent'] = df['Months since last delinquent'].fillna(df['Months since last delinquent'].median())
df['Maximum Open Credit'] = df['Maximum Open Credit'].fillna(df['Maximum Open Credit'].median())
df['Bankruptcies'] = df['Bankruptcies'].fillna(df['Bankruptcies'].mode()[0])
df['Tax Liens'] = df['Tax Liens'].fillna(df['Tax Liens'].mode()[0])

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Handling the Imbalance Data

Let’s look at the percentage of each class in dataset.

df['Loan Status'].value_counts(normalize=True)

df['Loan Status'].value_counts(normalize=True).plot.bar()

We see that our dataset is not balanced. We need to balance our data otherwise our model will be biased towards majority class. There are two techniques to handle imbalanced dataset.

Under Sampling
Over Sampling

In under sampling we reduce the number of data points of the majority class to make dataset balance and in over sampling we increase the number of data points of minority class by duplicating the data points. Most of the time we use over sampling because we don’t want to lose our data. We will use over sampling method to balance dataset. Duplicating data points to increase data doesn’t add any new information to the model. Instead, new data points are synthesized from the existing data points. This type of method is referred to as Synthetic Minority Oversampling Technique (SMOTE). SMOTE first select a minority class instance at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The best way of doing SMOTE is not applying SMOTE directly on all data. Once we split our data on train and test then, we only apply SMOTE on train data because all the synthetic data should only be used for training not for validation.

Features Selection

Feature Selection is the process to automatically select those features which contribute most to our model. Having irrelevant features in our data don’t increase the accuracy of our model. We want to remove those features which doesn’t have significant effect. We can use seaborn library to plot heat map of study the correlation between features.

plt.figure(figsize=(12,10))
sns.heatmap(df_corr.corr(), annot=True, cmap='RdYlGn')
plt.show()

We see our most of the features are not highly correlated with each others. Two features Number of Credit Problem and Bankruptcies are correlated with each other. Instead, if two features are perfectly correlated, then one doesn’t add any additional information. So removing either of the features don’t affect the accuracy of model. We can also use Variance Inflation Factor (VIF) for feature selection. VIF is the measure of the amount of multicollinearity in a set of features. Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model. We can find the VIF of respective feature and remove the feature having VIF greater than 10. VIF score of an independent variable represents how well the variable is explained by other independent variables. VIF is calculated by keeping one independent variable as dependent and other independent variable as it is and we fit a linear regression model and calculate coefficient of determination. Coefficient of determination represents the total variation for a dependent variable that is explained by an independent variables. R² value is determined to find out how well an independent variable is described by the other independent variables. A high value of R² means that the variable is highly correlated with the other variables. After calculating coefficient of determination we can find VIF by using the formula

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = [variance_inflation_factor(df1.values,i) for i in range (df1.shape[1])]
for i in range(0,11):
    print("The Vif for {} is {}".format(df1.columns[i],vif[i]))

Splitting the data

The fundamental goal of an ML model is to make accurate predictions on future data instances beyond those used to train models. Before using an ML model to make predictions, we need to evaluate the predictive performance of the model. To estimate the quality of an ML models predictions with data it has not seen, so we have to split our data into training and testing data.This allows us to evaluate our model on unseen data.

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(
    df_final_feature,y,test_size=0.1,stratify=df['Loan Status'])

Preprocessing the data

Data preprocessing is crucial in any data mining process as they directly impact success rate of the project. so, First of all we have to remove the unnecessary feature like Loan ID and Customer ID which are the unique values assigned to each customer. Those features doesn’t add any information to our model. Before feeding our data into model we need to encode the categorical value into numerical value. Categorical value refers to the information that has specific categories within the dataset. There are four categorical features in our dataset i.e Loan Status, Term, Home Ownership, Purpose. we use different technique for treating categorical variables i.e One Hot Encoding. One Hot Encoding creates an additional feature based on the number of unique values in the features. Then every unique value will be treated as feature consisting of binary data either 0 or 1.

One_hot = pd.get_dummies(df[['Term','Home Ownership', 'Purpose']],drop_first=True)
One_hot.head()

Tree based models doesn’t perform well with one hot encoding if there are too many unique value in feature. This is because they pick the subset of feature while splitting the data. If we have a lot of unique value, then the choose features will be mostly zero which doesn’t produce significant result. So we don’t perform one hot encoding while training tree based models. Linear models don’t suffer from this problem.

The next step for preprocessing is feature scaling. There are various method to scale the data like standardization, min-max scaler etc. In min-max scaler, we simply subtract the minimum value of feature from current value and divide it by the difference of maximum and minimum value. This technique re-scale a feature value between 0 and 1. Other, technique is to calculate the statistical mean and standard deviation of the attribute values, subtract the mean from each value, and divide the result by the standard deviation. This process is called standardizing a statistical variable and results in a set of values whose mean is zero and standard deviation is one.

os = SMOTE(sampling_strategy=0.8,random_state=42)
X_train_res, y_train_res = os.fit_resample(X_train,y_train)
X_train_res.shape, y_train_res.shape

Evaluation Metrics

Evaluation metrics are used to measure the performance of a machine learning model, which is an integral component of any data science project. It aims to estimate the generalization accuracy of a model on the future (unseen/out-of-sample) data. The most frequent classification evaluation metric that we use should be ‘Accuracy’. You might believe that the model is good when the accuracy rate is 99%! However, it is not always true and can be misleading in some situations.

Confusion Matrix

Evaluation of the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model. The confusion matrix provides a more insightful picture which is not only the performance of a predictive model, but also which classes are being predicted correctly and incorrectly, and what type of errors are being made. To illustrate, we can see how the 4 classification metrics are calculated (TP, FP, FN, TN), and our predicted value compared to the actual value in a confusion matrix is clearly presented in the below confusion matrix table.

True Positive (TP): Predicted True and True in reality. i.e Model predict that is positive and actually, they will repay the loan.
True Negative (TN): Predicted False and False in reality. i.e Model predict that is Negative and actually, they will not repay the loan.
False Positive (FP): Predicted True and False in reality. i.e Model predict that is positive and actually, they will not repay the loan.
False Negative (FN): Predicted False and True in reality. i.e Model predict that is Negative and actually, they will repay the loan.

Accuracy — Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model
Precision — Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.
Recall (Sensitivity) — Recall is the ratio of correctly predicted positive observations to the all observations in actual class — yes.
F1 score — F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall.

Learning Algorithm

It is obvious that algorithms have been designed to solve specific problems. So, it is important to know what type of problem we are dealing with and what kind of algorithm works best for each type of problem. Choosing the right algorithm is also a important phase of data science project. Since our project is a classification problem we can choose variety of classification algorithm to make prediction. Some of the algorithms we have used are listed below:

Logistic Regression
K Nearest Neighbors
Random Forest Classifier
Support Vector Machine

Logistic Regression

Logistic Regression is the supervised machine learning algorithm for classification problem. It is basically used for binary classification. i.e. when the target variable has two classes.Logistic Regression is usually applied to a problem statement where two classification problems can be linearly separable. Logistic regression is named for the function used logistic function. Logistic function is also called sigmoid function. The logistic function gives an “S” shape curve that can take any real value and map it into value between 0 and 1. If the output of the function is greater than 0.5, we classify the output as 1 and if the output is less than 0.5, we classify the output as 0. Below is the figure of sigmoid function.. Below is the figure of sigmoid function. Below is the figure of sigmoid function.

We will use sklearn library to implement logistic regression. Since our data set is imbalanced evaluating our model based on accuracy gives us misleading result. We use confusion matrix show the correct and incorrect prediction. We want to decrease the false positive while maintaining acceptable false negative. We will use different value of upsampling ratio for training and compare their result.

The accuracy using upsampling ratio of 0.8 is 65% and confusion matrix is shown in below figure.

The accuracy using upsampling ratio of 0.9 is 57% and confusion matrix is shown in below figure.

The accuracy using upsampling ratio of 1.0 is 51% and confusion matrix is shown in below figure.

We can see that while increasing the upsampling ratio the accuracy of the model decrease. The number of false positive also decreases while increasing the upsampling ratio at the same time the number of false negative also increase. In such case we can use some threshold value for false negative and choose the best performing model according to our need.

K Nearest Neighbors

KNN is a non-parametric classification method. It is used for classification and regression problem. It is one of the simplest algorithms. The key idea of KNN algorithm is nearby points belongs to same class. KNN algorithm makes no assumption about the data. It computes the distance between the testing point to every training example. Then select the k closest instances and their labels. It output the class which is most frequent in labels. The training time of KNN algorithm is zero while the testing time can become very large with large number of datapoints in dataset. This algorithm can be computationally expensive because we need to store all training example and compute distance to all training example for a single prediction. Below is the figure how KNN algorithm works.

The accuracy using upsampling ratio of 0.8 is 59% and confusion matrix is shown in below figure.

The accuracy using upsampling ratio of 0.9 is 59% and confusion matrix is shown in below figure.

The accuracy using upsampling ratio of 1.0 is 57% and confusion matrix is shown in below figure.

Random Forest Classifier

Random Forest is an ensemble learning method for classification and regression problem. It uses the bagging technique. In bagging, we have multiple base learner models and that is called as decision tree because we basically use decision tree in this model. We will take some sample of rows and feature sample then give it to the decision tree 1 and similarly, we do row sampling and feature sampling with replacement because some sample will be repeated but not all. In this way, we will give some sample of row and feature sample to every base learner models. After getting training data, every base learner model will be trained. Whenever my testing data is come and all my model will gives the output. So, this is call bootstrap. Now, according to the bagging we have to aggregate our output. Then, we are going to use majority of vote. If the more model will give the output as same class then that class will be the final result.

When I using decision tree and we create tree its completely depth. So, when we do that it basically has low bias and high variance. That mean if I am creating a decision tree to its completely depth then it will get properly execute in trained for our training data. So that training error will very less and high variance is basically say that whenever we get our new test data those decision tree they will give larger amount of error. That mean whenever we are creating a decision tree to its completely depth it leads to something called as over-fitting. When we will combined the multiple decision trees the high variance is convert into the low variance and it will solve the over-fitting.

We can use the same method described above.

The accuracy using upsampling ratio of 0.8 is 71% and confusion matrix is shown in below figure.

The accuracy using upsampling ratio of 0.9 is 70% and confusion matrix is shown in below figure.

The accuracy using upsampling ratio of 1.0 is 69% and confusion matrix is shown in below figure.

Support Vector Machine

Support Vector Machine is supervised learning model that analyze data for classification and regression analysis. It constructs a hyperplane in high dimension space which can be used for classification or regression. It finds a hyperplane which has the largest distance to the nearest training data points. The closest points are called support vectors and hence algorithm is termed as Support Vector Machine. The distance from the nearest training point to hyperplane is called margin. It tries to maximize the margin. So SVM are also called max-margin classifier. Max margin can reduce the effect of mislabelled data than min margin and it generalizes better in future for unseen data. SVM are effective in high dimensional space because data are likely to be linearly separable in high dimension than in low dimension. Below is the figure SVM algorithm.

We can use the same method described above. The accuracy using up-sampling ratio of 1.0 is 45% and confusion matrix is shown in below figure

Analyzing the above confusion matrix, we can see the performance of different algorithm varies significantly. SVM doesn’t perform well as it is predicting only a single class for all of the data. At the upsampling ratio of 1.0 logistic regression has the lowest number of false positive as compared to other algorithm which is also the desired objective of our problem. The decrease in false positive came at the cost of increase in false negative

Conclusion

This article aimed to explore, analysis, and build a machine learning algorithm to correctly identify whether a person, given certain attributes, has a high probability to default on a loan. This type of model could be used by different bank and financial institution to identify certain financial traits of future borrowers that could have the potential to default and not pay back their loan by the designated time. The Random Forest Classifier provided us with an accuracy of 72% while the Logistic Regression provided us with an accuracy of 65% but by using domain knowledge and observation of confusion matrix. Hence, the Random Forest model appears to be a better option for such kind of data. We hope such system can be deployed in real world situation after further refinement which can be very useful and interesting for financial sector.

Links

The code for the project can be found Here

The original dataset can be found Here