Bank Loan Status Prediction

Weichen Lu
Data Science is life
5 min readApr 18, 2018

Recently, I got a chance to work on the data related to bank loan. Differ from other datasets, the bank loan data are raw and unprepared. In other words, it requires me to explore the relationship between each tables and prepare the data in MySql at first. Once the datasets are prepared, I can import the data to python and complete the machine learning part.

Background

We should always aware of the business objective of the task before start coding and analysis. The banks are trying to predict the status of default customer as well as the importance features related to default status. For this task, the goal is to predict the customers who are more likely to default on loan and find the key predictors which have major impact on default.

Data Exploration

There are totally eight datasets where each of the dataset contains different information and can be connected together.

Datasets relations

You can always join all the tables together as your final dataset and explore the features later on. However, the goal is to predict the loan status so that the loan table should be our primary focus. I did some case analysis on each tables as well as feature transformation. For transaction table, I retrieved the number of orders, last transaction date, number of each operation types and average transaction amount of each types.

Modify transaction table

Furthermore, I created a temporary table which includes balance so that I can join to the transaction table later on.

Adding balance

Joining tables together based on account_id

Joining order, trans, and card_info with loan

Adding balance to complete the dataset

Final dataset

Once the dataset is prepared, we can move on to the machine learning section.

Import data to python

Firstly, we have to set up the connection between MySQL and python.

Configuration

After that, let’s see how data looks like

dataset

Data Preprocessing and Feature Engineering

For this task, the status column is our response variable. It contains four characters where each of them represent a meaning.

Status

Our goal is to predict the customer who are more likely to default. In this case, I set A and C to 0, B and D to 1.

Set up response variable

On the other hand, the account_id contains duplicate value since they made more than one transaction in the last day. However, the data does not have the exact date time of these transactions. In this case, I took the average of the balance and drop the duplicate values as well as dates.

Remove duplicates and redefine balance

Dataset contains 682 records, I split the data into 80 % training data and 20 % testing data by using train/test split. Then, I check if there is any missing value in my dataset. Unfortunately, I got two columns with missing values which are collection_from_another_bank_amount, and credit_card_withdrawal_amount. By exploring the features, I realize that the missing values might means no transaction for that account and I basically just filled them with zero. Moreover, I have also checked the size of my class.

Training data

Although the size of these two classes are not extreme imbalance, I still used SMOTE to balance the size of two classes in order to see if there is any improvement to the model.

Modeling and Evaluation

The dataset is not huge and I only ran two algorithms which are random forest and logistic regression. I used 10 fold cross validation to compare the accuracy of two models as well as predicting the customer who is most likely to become default.

Before sampling
After sampling

As we can see from the outputs, after using SMOTE to resample the data, the accuracy for Random Forest model has increased from 0.91 to 0.95 where the default probability of customer with account_id equal 1244 have 100 % chance to be default. On the other hand, the accuracy for Logistic Regression has significantly decreased from 0.90 to 0.78. In a nutshell, resampling the data size might not always increase the accuracy.

Feature Importance

By implementing the random forest model, we are able to check the importance of each features.

The above graph shows that balance has most impact on the performance of model which also indicate that it is the key predictor. In other words, the more balance you have in your account, the less chance for you to become a default customer.

Conclusion

Overall, it is depends on the business requirement to set the threshold for customer who are more likely to become default. In general, 0.5 might be the case. For those customer who are 50 % more likely to become the default should be the focus. Moreover, company should pay more attention on the balance of each customer since it is a key predictor to determine whether the customer are likely to become default or not.

--

--