Credit Default Prediction based on Machine Learning Models
Link for all codes used in this essay:
Banks all around the world would receive countless applications for loans every day. Some of them are good and will be repaid, but there is a still high risk that one creditor defaults his/her loans. How could we prevent this problem from happening? Or, in another word, how could we know in advance which creditors are trustworthy? In the following passage, I would provide a machine learning solution to this problem
Data Set Introduction
I’ve used the dataset called Default of Credit Card Clients Dataset provide by UCI Machine Learning. This dataset includes 24 features, ranging from basic information like sex to bill and repayment statements, of around 30000 creditors. The features and their descriptions are listed below (similar names with different indexes are abbreviated using parenthesis)
- ID: ID of each client
- LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
- SEX: Gender (1=male, 2=female)
- EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
- MARRIAGE: Marital status (1=married, 2=single, 3=others)
- AGE: Age in years
- PAY_(0- 6): Repayment status in (September — April), 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
- BILL_AMT(1–6): Amount of bill statement in (September — April), 2005 (NT dollar)
- PAY_AMT(1–6): Amount of previous payment in (September-April), 2005 (NT dollar)
- default.payment.next.month: Default payment (1=yes, 0=no)
Data Exploration and Preparation
If we classify the data by default and non-default only, we would discover the data is highly unbalanced just as shown in figure 1. Nearly 77% of the data is non-default, while only around 22% of the data is classified as default. This would lead to a very high baseline accuracy and potentially mislead other predictive models (if one meaningless dummy classifier classifies all data to non-default it would have an accuracy of 77%). Thus to avoid this problem we need to undersample the dataset to make sure default and non-default creditors take roughly the same weights, which could ensure the potency of our predictive models. As shown in figure 2, I’ve employed the NearMiss algorithm to get training and testing partition with the same number of default creditor and non-default creditor.
After talking about the general perspectives about the dataset, let’s dive deeper into the relationships between different features and the target variable.
From this graph, we could roughly see those non-default creditors and their families tend to have higher given credits, and non-default creditors tend to be older. However, the effect is not very obvious because of the scale of figure 3. Thus to have a clear view, I enlarged the distributions of limited Balance classified by default and non-default which have been shown in figure 4. We could gauge that non-default creditors roughly have a 50 thousand higher given credits in comparison to default creditors.
To have more precise and mathematical measures of the relationship between different features and the target variable, I decided to use a correlation matrix to measure the linear relationships that existed. Figure 5 is the result I got after run a correlation matrix with all the columns in the dataset. There are two parts worth noticing: 1. features correlation with the target variable, 2.highly correlated BILL_ATM(1–6)s, and PAY_(0–6)s.
Let’s talk about point 1 first. For the target variable “default payment next month”, no features seem to have a very high correlation with the target variables. This implies most of the relationships embedded in the dataset would be non-linear and we need to use non-linear models to captures them.
Secondly, we should always try to avoid highly correlated variables coexisting inside a model. This is because highly correlated variables would flood the model with redundancy information and blurred the real pattern behind the scene. Thus, we would need to change and improve those highly correlated BILL_ATM(1–6)s, and PAY_(0–6)s. The strategy I deployed is to create new variables which contain the difference between every two consecutive highly correlated variables (code as attached below in figure 6). In this way, the redundancy information would be eliminated and new changes would remain for analysis.
After I’ve created those new variables, I rerun the correlation matrix which gave us the following chart, in which we could affirm that problems have been corrected.
Modeling:
From the descriptions above, we could know that our problem is a binary classification problem with lots of non-linear relationships. Thus, I believe the following models would fit to answer our question: 1)logistic Regression, 2)Decision Tree, 3) Random Forest, 4) AdaBoosting.
I’ve also developed the following procedures for each model to follow through 1. Model initiation, 2. Hyperparameter Tuning, 3. Cross-Validation of Model for both Test and Training partitions, 4. Check and fix overfitting. Let's take random forest as an example for this process (I would put other models’ code at the end of the document).
First I initiate a random forest classifier with some basic random parameter settings
Then, I employed grid search which could help us to find the best combination of hyperparameters from a large set of hyperparameters. Within the process, I actually tried to use very large hyperparameters. Because I wanted the model to fit as best as possible for the training set initially and then I could simplify it to improve its generalization for the testing sets. However, as I tried to enlarge hyperparameter parameters, for example, n_estimators for the random forest classifier, the performance of the models didn’t improve at all and actually started to decrease gradually.
Thus, instead of trying to fit large hyperparameters, I decided to follow the trend I learned from trails of hyperparameter tuning and finally narrow it down to the sets of best parameters.
Here is the best set of parameters for the Random Forest Model.
Finally, I used cross-validation to avoid data selection bias and get the “real” accuracy scores for how well our model fits the training and testing partition. The reason I also want to see how accurately our model fits the training partition is that I would use it to detect whether a model is overfitting. If one model overfits the training data, it would perform fantastic in the training dataset but very poorly on the testing data. Thus a large difference in the two accuracy score would indicate the model is overfitting, otherwise, it is not. If we apply this test to the results of the random forest model, we would find the difference between the two accuracy scores is rather trivial, only around 0.006. Thus, we could conclude the random forest model is not overfitting.
Because the random forest model shows no signs of overfitting, I would like to use the decision tree model as an example of how I fix overfitting problems. We could see, in the following two graphs, that the initial difference between the decision tree model’s accuracies for training and the testing dataset is around 3%. This indicates that the initial decision tree likely to overfit the training data. To reduce this disparity I started to try different combinations of smaller hyperparameters like max_depth, min_samples_leaf. After a number of trials, a better set of hyperparameter is found and reduces the disparity between the accuracy scores to around1%. This implies the problem of overfitting is resolved.
For other models, there may be better choices to fix the problem of overfitting other than trying different sets of hyperparameters. For example, logistic regression could use L1 regularization or L2 regularization to reduce overfitting problems by simply changing its ‘penalty’ hyperparameter. However, because the logistic model in our case doesn’t overfit, I would not cover such a process.
Modeling Comparison
To compare all five models we’ve developed, I’d made a table to display the accuracy scores of both the training and testing datasets for all five models in figure 8. The table is sorted using the “Accuracy for test data”. Compared with the accuracy score of the baseline classifier, we could see that our models increase the accuracy at least by 14%, and at most by 35%. We could also learn that the random forest model is the most accurate model we got. Thus it is the final model of us. However, this is not the only reason that we choose the random forest as our final model.
The other reason that we choose random forest as our final model is that it makes fewer false negative mistakes in comparison to other models we have. False-negative in this case means failure to detect a default, which is very costly and is the question that we tried to avoid the most. Adaboost model is the second best model we have, however, by comparison, the confusion matrix between the two, we could see the random forest is more sensitive to false-negative cases and thus is better.
Just to try how the random forest model works, I’ve found the data of a creditor who has default payments (target variable = 1). And I inputted the creditor’s data into our model; the prediction is [1.0] which correctly corresponded to the truth.
Insights from Modeling
From the random forest model, we could also derive the most important features of the model. As we could learn from those statistics, the changes in Repayment status from Aug to Sep and that from July to Aug are very important predictors. This makes sense because if pay_diff1 and pay_diff2 for a creditor get larger, which means the creditor continues to delay his repayments, the creditor is more likely to default his credit, vice versa. While this age and limit_bal are the other two important predictors as we discussed in the data preparation part. This also makes sense because as one gets older, one is more likely to accumulate more resource and cares more about his reputations, which makes credit default less likely. Also, if one and one’s family get more given credits, the person is more likely to live in a wealthier environment which also makes credit default less likely. Those insights that we see by analyzing the data could directly help bankers to judge whether a creditor is trustworthy.
Reflection
One important way that I think the predictive model could be improved is the enhancement of the data source. There are still lots of information that the data didn’t cover. For example, the current economic conditions of one person, like incomes and jobs, of creditors; the amount of non-liquid assets owned by the creditors, and so on. This imperfection of the dataset determined that the model would lose some predicted power and are facing more uncertainty.
Codes for other Models