Classification Model for Loan Default Risk Prediction

Published in

Analytics Vidhya

10 min readDec 21, 2019

Image source: http://www.communitylinkfcu.com/secure-online-loan/

In finance, a loan is the lending of money by one or more individuals, organizations or other entities to other individuals, organizations etc. The recipient (i.e. the borrower) incurs a debt, and is usually liable to pay interest on that debt until it is repaid, and also to repay the principal amount borrowed.

In real world, a loan in time enables the borrower to meet financial goals. At the same time, the interest associated with the loan generates revenues for the lender.

However, there is always a risk associated with lending, especially in case of customers having insufficient or non-existent credit histories. If the borrower defaults a loan, then that loan becomes non-performing asset (or, NPA) for the lender. Any NPA hits the bottom-line of the lending organization.

Therefore, every lending organization strives to assess the risk associated with the loan. Primarily, they want to assess their clients’ repayment abilities well in advance before deciding on approval and disbursement of loans.

In this blog, we shall build a supervised classification model to predict the risk of loan default.

Now, when we talk about building a supervised classifier catering to certain use-case, for example, classifying risk of loan default, following three things come into our minds:

Data appropriate to the business requirement or use-case we are trying to solve
A classification model which we think (or, rather assess) to be the best for our solution
Optimize the chosen model to ensure best performance

Let us start with data

Certainly, we need data related to loan from a lending organization. Home Credit shared their historical data on loan applications for this Kaggle competition. A subset of data related to their loan applications were used to build our classifier. Icing on the cake is that Home Credit shared data dictionary for this data. That helped to understand the data to great extent.

Understand binary classification labels

Training labels are stored under ‘TARGET’ variable. Values are as follows:

‘0’: will repay loan on time

‘1’: will have difficulty in repaying loan or will default

We shall build a supervised binary classifier for our purpose.

Little bit pre-processing to use this data

Following steps have been performed on the data

Handling outliers
Handling missing values
Encoding categorical variables
Feature scaling
Reducing dimension of the data

Handling outliers: Here, I have checked for values which are functionally impossible. These outliers were replaced with NaN.

For example, a person can’t be employed for 365243 days !

Code snippet showing handling outlier values for DAYS_EMPLOYED

Similar way, some other columns were checked. Details of this work can be found here.

Handling missing values: This analysis was bit interesting.

As we can see, there were columns where more than 60% of data were missing. However, none of these columns were dropped.

Instead, I imputed missing values with median.

Encoding categorical variables: There were 16 categorical variables in the data set.

Following two strategies were used for imputation:

Apply Label Encoding, if number of categories in a categorical variable is equal to 2
Apply One-Hot Encoding, if number of categories in a categorical variable is greater than 2

As we know, one-hot encoding increased the dimension of the data. We shall reduce dimension soon to avoid curse of dimensionality.

Feature scaling: Feature Scaling ensures that all features will get equal importance in supervised classifier models. MinMaxScaler was used to scale all features in the data.

Reducing dimension of the data: Dimension of data, after encoding, was 240, excluding output label. Working with data in high-dimensional space often results in the problem, known as curse of dimensionality.

sklearn’s PCA was used to apply principal component analysis on the data. This helped in finding the vectors of maximal variance in the data. Corresponding scree plot is as follows:

From this scree plot, we can see that about 40 principal components can cover more than 80% of the variance of the data. Accordingly, we performed PCA:

Code snippet showing applying PCA to reduce dimension

This way, we could reduce dimension by 83.33%, still retaining 82.41% variability of the data.

For more details, please see here.

Let us explore our data a bit more

Class imbalance problem

The first challenge we hit upon exploring the data, is class imbalance problem.

As we can see, in the data, only about 8% labels are 1 and about 92% labels are 0.

pie-chart showing distribution of ‘TARGET’

Few more observations

We computed Pearson correlation coefficient between every variable and the target.

Top 4 positive correlations with ‘TARGET’: DAYS_EMPLOYED, REGION_RATING_CLIENT_W_CITY, REGION_RATING_CLIENT, NAME_INCOME_TYPE_Working

Top 4 negative correlations with ‘TARGET’: EXT_SOURCE_3, EXT_SOURCE_2, EXT_SOURCE_1, DAYS_BIRTH

This distribution shows that people who are newly employed are more likely to apply for the loan.

Kernel Density Estimation (or, KDE) plots of EXT_SOURCE_1 and EXT_SOURCE_3

The plots above show that risk of difficulty in payment is high when ratings from external sources 1 and 3 are low. Similarly, chances of repayment is higher when ratings from these sources are higher.

Kernel Density Estimation (or, KDE) plots of DAYS_BIRTH

The plots above show that tendency of younger applicants facing difficulties in repaying the loan is tad higher.

We are good with our loan-application data. Now, its time to select our classification model.

We need our metric

Selection of a model requires evaluation and evaluation requires a good metric. This is indeed important. If we optimize a model based on incorrect metric, then, our model might not be suitable for the business goals.

We have a number of metrics, for example, accuracy, recall, precision, F1 score, area under receiver operating characteristic curve, to choose from.

For classification of loan default risk, we shall choose Area under receiver operating characteristic curve (or, ROC AUC score) as our metric.

Now, the question is: why AUC? I would rather say, why not?

Let’s start with our problem context and confusion matrix:

In our context,

output = 1 means client will have payment difficulties

output = 0 means client will repay in time

ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes.

TPR or true positive rate is a measure for ‘recall’ or sensitivity. FPR or false positive rate is a measure for expectancy of the false positives from the model.

TPR = TP / (TP + FN)

FPR = 1 — specificity = 1 — TN / (TN + FP) = FP / (TN + FP)

AUC ranges in value from 0 to 1. Higher the AUC, better the model is at predicting who will repay the loan in time and who will have difficulties in repaying the loan.

AUC as a metric is appropriate for our problem, for following reasons:

AUC provides an aggregate measure of performance across all possible classification thresholds.
AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.
AUC is classification-threshold-invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.

Let us select our classification model

We have chosen our metric. It is time to choose our model.

Following models were evaluated:

LogisticRegression
RandomForestClassifier
GradientBoostingClassifier
DecisionTreeClassifier
GaussianNB
XGBClassifier
LGBMClassifier

Their ROC curve and AUC were as follows:

ROC curve and AUC of **LogisticRegression**

ROC curve and AUC of **RandomForestClassifier**

ROC curve and AUC of **GradientBoostingClassifier**

ROC curve and AUC of **DecisionTreeClassifier**

Quite a few classification model scored well on our data set. However, LGBClassifier was the best performer.

Therefore, we shall use light gradient boosting model (or, LGBMClassifier) for classifying risk for our loan application data.

Let us now optimize our chosen model

We have selected our classifier, which is LGBMClassifier, based on its ROC AUC score on our loan application data. However, we need to tune our model to perform even better.

Let us see the list of hyper-parameters that can be tuned:

List of hyper-parameters of LGBMClassifier

The objective function performs K-fold cross-validation of the parameters and uses ROC AUC score. It has the objective to maximize the score.

We also need early-stopping in the objective function to avoid overfitting.

Hyperparameter tuning involves selection of set of values. I implemented following two strategies for this selection process:

Grid search: The grid search method tries all combinations of values in the grid, evaluates each combination of hyperparameter with the objective function and records the set along with the score in the ‘record history’

Random search: As the name suggests, this algorithm randomly selects the next set of hyperparameters from the grid. This is uninformed search which means next selection of hyperparameters does not rely on past evaluation results. Rest of the process is same as the grid search method.

We did observe improvement in ROC AUC score, however marginal, after optimizing our LGBMClassifier model.

Details of this optimization can be found here.

Is our chosen model good for business

This is a logical question if any bank or financial institution approves budget to embark into a costly journey of datascience, in order to build a classifier model. Let us see.

How robust is our model

Here, the model has been tested on multiple sets of data, and for each set, ROC AUC score has been computed.

code snippet showing running LGBMClassifier on data split by KFold

We have checked for mean and variance of the score to see if the model is stable or not.

We have observed that the final LGBMClaissifer model is quite robust. We have tested the model over 5 sets of data. We received mean ROC AUC score as 0.680479 with variance of 0.000613.

Interpretation of AUC score

AUC score of nearly 0.7 is not that bad.

Distribution of positive and negative curves

AUC represents degree or measure of separability. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.

When AUC is 0.7, it means there is 70% chance that model will be able to distinguish between positive class and negative class. This means, our model will be correct on 70% times in identifying a borrower having risk of defaulting a loan.

Life is not a bed of roses

This is indeed true! There were multiple challenges faced in this project.

Class imbalance problem was there for chosen data set.

Optimization of chosen LGBMClassifier was found to be very computationally intensive.

Just as a hint, please see this:

Code snippet showing number of combinations for grid search

This is the number of combinations for the subset of hyperparameters I chose. Still, it will take years for an ordinary computer to finish off cross-validations for all of these combinations !

In order to tackle that, only a subset of data and subset of hyperparameters could be used. Similarly, only a limited number of iterations could be performed. That, in turn, resulted in only marginal improvement of the model.

Nevertheless, chosen model and related strategies will always work on full data, if required hardware resources can be procured.

Summary so far

The purpose was to build a classifier that can predict loan default risk based on loan application data.

We got data from Home Credit which could be used for our project.

We could evaluate various classifiers and choose most appropriate one for our data set.

Finally, we could demonstrate multiple optimization strategies for our model.

As we have seen in this blog and related work, key aspects of building successful classifier are:

selecting correct data according to the purpose or problem statement,
proper processing and understanding of the data,
selecting the model, and,
optimizing the model.
And yes, please do not underestimate the requirement of enormous computing power

Improvement

Few aspects of the implementation in the Jupyter notebook can be improved.

Improvement 1: imputing missing values

In case of imputing missing values, NaN were filled with ‘median’ values. ‘Median’ works well when the data is sorted. A better approach will certainly be using a supervised learning model to ‘predict’ the NaN values.

However, imputation of data means introduction of data by data analysis. This will always add some bias, however negligible its effect is. Certain model like XGBoost can handle NaN in data.

Improvement 2: selection of hyperparameters for LGBMClassifier

We selected a subset of hyperparameters, on which we performed grid search and random search to select best set values on the hyperparameters.

However, smarter technique will be combining random search and grid search as follows:

Use random search with large hyperparameter grid
Use the results of random search to build a focused hyperparameter grid around the best performing hyperparameter values
Run grid search on this reduced hyperparameter grid. We can limit number of iterations or not, depending on hardware resources available or maximum time allowed.

References

Github repository containing Jupyter notebook for related work

Data is provided by Home Credit