Credit Scoring using Random Forest with Cross Validation

Published in

Analytics Vidhya

5 min readFeb 5, 2021

Introduction

Credit risk is one of the most pressing issue in any lending institution. Its the ultimate goal of these type of institutions to reduce credit risk even if its with a small margin. This need pushes for the adoption of machine learning techniques in improving credit scoring methods. In addition, its becoming increasingly hard to rely on traditional methods of credit scoring given the influx of invisible clients who barely fit into traditional consumer groups and are easily miss-classified by the traditional scoring methods.

Given these challenges, this is a knowledge based article that demonstrates how a lending institution can leverage on the power of machine learning into predicting clients credit score.

In this article, we shall use several classification model to predict the likelihood of a customer defaulting on a loan based on past data. The outcome variable is a binary variable have good and bad as the possible outcomes. The feature used include, latitude, longitude, bank branch, employment status, level of education and variables relating to past loan history of the client.

Data Exploration

Data exploration refers to the process of using visualization techniques to better understand the data. This process can involve anything from identification of extreme values using box-plots to use of heat-maps to check for cases of perfect correlation. A quick look at the data using box-plots reveals that longitude and latitude variables are highly affected by extreme values while the rest of the continuous variables are pretty much okay.

Multiple Histograms are used to show the distribution of different variables with respect to the outcome variable;

The number of good loans are more than the bad loans, the distributions of loans is positively skewed with majority of the clients having below 10 loans in total.

The amount of loan due ispositively skewed over 2500 clients having an amount of around 10,000 due.

Feature Engineering

In this step, date objects are converted and new variables such as age are derived from ‘birth-date’. Repayment period is also derived from ‘first due date’ and ‘first repaid date’. The categorical variable are encoded.

Lastly, the latitude and longitude data are log-transformed to take care of the extreme outliers.

Cross Validation

Cross validation is applied to compare and select the best model. Three models are used with cross validation, that is, Random Forest, Logistic Regression and Decision Trees. Random Forest has the best average score of 0.92 and is selected for building the final model.

Feature Importance

Having chosen random forest as the model of choice, feature importance analysis are conducted to select the most relevant features to include in the model. Anything with a score of 0 or less is dropped.

Random Forrest with Cross Validation

With irrelevant variables dropped, a cross-validation is used to measure the optimum performance of the random forest model. An average score of 0.923 is obtained.

The final model

We use Gridsearch cross validation to obtain the best random forest model and with it we make predictions of the test data.

Using the resulting best model, we predict the client status using a new test data set.

The final prediction data is converted to a .csv file.

Conclusion

And that is a quick look at how a credit scoring project would look like. The project is not without its challenges but with them come opportunities. For example, the decision to use cross-validation is adopted after the model over-fitted.

Sources

Tutorial: Exploratory Data Analysis (EDA) with Categorical Variables

Expanding beyond scatter plots for better exploratory data analysis

medium.com

Hyperparameter tuning

K Fold Cross Validation

There are multiple ways to split the data for model training and testing, in this article we are going to cover K Fold…

satishgunjal.com

Random forest using K-fold validation

End to End Project with Python

Explore and run machine learning code with Kaggle Notebooks | Using data from Titanic - Machine Learning from Disaster

www.kaggle.com

On feature importance

Explaining Feature Importance by example of a Random Forest

Learn the most popular methods of determining feature importance in Python

towardsdatascience.com

On Getting the best model after cross validation using gridsearchCV

Nested Cross-Validation for Machine Learning with Python - Machine Learning Mastery

The k-fold cross-validation procedure is used to estimate the performance of machine learning models when making…

machinelearningmastery.com