Bank Loan Default Prediction with Machine Learning

Loans default will cause huge loss for the banks, so they pay much attention on this issue and apply various method to detect and predict default behaviours of their customers. In this blog, I am going to talk about the basic process of loan default prediction with machine learning algorithms.

Project Motivation

The loan is one of the most important products of the banking. All the banks are trying to figure out effective business strategies to persuade customers to apply their loans. However, there are some customers behave negatively after their application are approved. To prevent this situation, banks have to find some methods to predict customers’ behaviours. Machine learning algorithms have a pretty good performance on this purpose, which are widely-used by the banking. Here, I will work on loan behaviours prediction using machine learning models.

Data Exploration and Preprocessing

The data set I use contains several tables with plenty of information about the accounts of the bank customers such as loans, transaction records and credit cards. Here, my main purpose is to predict customer behaviours about loan for each account. Thus, the most important table here is table “loan”. And after checking the description of all the features, we think “order”, “trans” and “card” contain useful info for our purpose. And I also need to use account and disposition to combine them together. Finally, the tables required are highlighted in the following figure.

Figure-1 Data Sets Structure

The columns “status” in table “loan” is the target variable, which stands for the customers’ loan behaviours. Then I should select useful features for the model. The standard is based on relevant to our target and some business sense. For example, the columns “bank to” and “account to” have nothing to do with loan behaviours, so I drop them.

Next step is the data preprocessing. First, I have to convert categorical variable into numeric. Usually it is part of feature engineering in Python. But I need to do aggregation in MySQL and the aggregation function cannot be applied to categorical variable. Thus, I conduct this step in MySQL.

Figure-2 Categorical Variable Conversion with SQL

Then I join all useful tables together based on the connects among them. Since feature engineering and model fitting have to be conducted with Python, I connect Python with MySQL database and import the tables into the Pandas data frames.

Figure-3 Import data in MySQL into Python

After having all data in Python, I need to separate a holdout testing data set from the entire data, which is for avoiding the overfitting. Since the data size is small, I apply K-hold out cross validation here for generalization, which randomly splits the data set into two parts.

Feature Engineering

There are some missing values occur in the column “Loan monthly payment”, which means the customers didn’t make payment for their loans. In this situation, the missing values should be imputed with zero instead of mean or median.

Figure-4 Target Variable

In the original data, the target variable is categorical. It is grouped into four classes from A to D. To do the prediction, I need to encode the categorical variable into 1 or 0, which stands for binary classes.

Model Validation and Selection

Here I use K-Fold cross validation to split the data without holdout part into training data and validation data and then fit the model. Since the problem is a classification problem, I choose logistic regression, random forest and XG boosting. To compare the performance of these three models, I plot the ROC curve and calculate the AUC score.

Figure-5 ROC Curve

According to Figure-5, you can see the random forest has the best performance. Thus, I choose it as the loan default prediction model. Then the grid search method is utilized for tuning the hyper-parameters of the model.

Random forest also has an advantage that it can show the importance of the features. According to it, you can see which feature has more impact on the final outcome.

Figure-6 Feature Importance
Model Evaluation

Now I combine training data and validation data together to fit the random forest and use the holdout testing data set to do the prediction.

Figure-7 Classification Report

Here I generate a classification report to check the performance of my model. You can see the random forest works on “0” prediction very well, which means it can confidently tell you who is the good customer. By contrast, it has a pretty low recall when predicting the loan default behaviours. In laymen’s terms, recall means how many cases are predicted correctly among all the true conditions. Thus, although all the predictions on “1” are right, they only cover a small part of the total amount of customers with default behaviours.

Conclusion and Discussion

Actually, most of the binary classification models will give the prediction of probability first and then assign the probabilities to 1 or 0 based on the default threshold of 0.5. To improve the recall of the model, we can use the the probabilities predicted by the model and set threshold by ourselves. The threshold is set based on several factors such as business objectives. It is different case by case. In the bank loan behaviour prediction, for example, banks want to control the loss to a acceptable level, so they may use a relatively low threshold. This means more customers will be grouped as “potential bad customers” and their profiles will be checked carefully later by the credit risk management team. In this way, banks can detect the default behaviours in the earlier stage and conduct the corresponding actions to reduce the possible loss.