Credit Scoring using Random Forest with Cross Validation

Paul Wanyanga
Analytics Vidhya
Published in
5 min readFeb 5, 2021
Image by Truddi Finnis Pixabay

Introduction

Credit risk is one of the most pressing issue in any lending institution. Its the ultimate goal of these type of institutions to reduce credit risk even if its with a small margin. This need pushes for the adoption of machine learning techniques in improving credit scoring methods. In addition, its becoming increasingly hard to rely on traditional methods of credit scoring given the influx of invisible clients who barely fit into traditional consumer groups and are easily miss-classified by the traditional scoring methods.

Given these challenges, this is a knowledge based article that demonstrates how a lending institution can leverage on the power of machine learning into predicting clients credit score.

In this article, we shall use several classification model to predict the likelihood of a customer defaulting on a loan based on past data. The outcome variable is a binary variable have good and bad as the possible outcomes. The feature used include, latitude, longitude, bank branch, employment status, level of education and variables relating to past loan history of the client.

The data used has been derived from a knowledge hackathon from the Zindi portfolio of challenges (All rights reserved). You can have a look at the complete code here.

Data Exploration

Data exploration refers to the process of using visualization techniques to better understand the data. This process can involve anything from identification of extreme values using box-plots to use of heat-maps to check for cases of perfect correlation. A quick look at the data using box-plots reveals that longitude and latitude variables are highly affected by extreme values while the rest of the continuous variables are pretty much okay.

Multiple Histograms are used to show the distribution of different variables with respect to the outcome variable;

The number of good loans are more than the bad loans, the distributions of loans is positively skewed with majority of the clients having below 10 loans in total.

The amount of loan due ispositively skewed over 2500 clients having an amount of around 10,000 due.

Feature Engineering

In this step, date objects are converted and new variables such as age are derived from ‘birth-date’. Repayment period is also derived from ‘first due date’ and ‘first repaid date’. The categorical variable are encoded.

Lastly, the latitude and longitude data are log-transformed to take care of the extreme outliers.

Cross Validation

Cross validation is applied to compare and select the best model. Three models are used with cross validation, that is, Random Forest, Logistic Regression and Decision Trees. Random Forest has the best average score of 0.92 and is selected for building the final model.

Feature Importance

Having chosen random forest as the model of choice, feature importance analysis are conducted to select the most relevant features to include in the model. Anything with a score of 0 or less is dropped.

Random Forrest with Cross Validation

With irrelevant variables dropped, a cross-validation is used to measure the optimum performance of the random forest model. An average score of 0.923 is obtained.

The final model

We use Gridsearch cross validation to obtain the best random forest model and with it we make predictions of the test data.

Using the resulting best model, we predict the client status using a new test data set.

The final prediction data is converted to a .csv file.

Conclusion

And that is a quick look at how a credit scoring project would look like. The project is not without its challenges but with them come opportunities. For example, the decision to use cross-validation is adopted after the model over-fitted.

Sources

Hyperparameter tuning

Random forest using K-fold validation

On feature importance

On Getting the best model after cross validation using gridsearchCV

--

--