Mastering Loan Default Prediction: Tackling Imbalanced Datasets for Effective Risk Assessment

Using Synthetic Minority Oversampling Technique (SMOTE)

Utkarsh Lal

Published in

Geek Culture

10 min readApr 14, 2023

Problem Summary

A large portion of a retail bank’s profits come from the interests incurred on home loans. But this is incumbent upon the timely payments of premiums of the loans by the clients. If the client defaults on the loan repayment, then the bank incurs a huge loss. It is thus very important for banks to choose its clients carefully.

The loan approval process is extremely complex and has multiple stages. The bank calculates the creditworthiness of the client through various methods. One of the most important methods is the analysis of the credit history of the client. A lot of key indicators can be derived from the credit history that indicate the likelihood of loan default. Another method is the computation of credit risk through parameters like expected credit loss (ECL), Probability of Default (PD), Loss given Default (LGD), Exposure at Default (EAD) and many similar parameters that are used to assess the impact of a defaulted loan on the bank.

Conducting these processes manually is extremely tedious and time taking. Most importantly, it is subject to human error due to unconscious bias and technical errors. Therefore, there is a growing need for automated methods that can self-learn the various processes involved, perform them in parallel and evolve with the changing data. With the advent of Data Science and Machine Learning, it is now possible to build models that can identify key parameters in the data for predicting loan default. These Machine Learning models are based solely on the concepts of mathematics, making the process free from human bias and extremely efficient. These models can also be tuned to avoid any underlying biases in the data and can be modified as per the business requirements.

The data received by the bank is often severely imbalanced, with only a small minority of the data comprising loan default. This can prove to be problematic for training a machine learning model, since most models require balanced datasets for accurate predictions. There is also a need to minimize the false negative predictions by the classifier. If a client that is actually a loan defaulter gets misclassified as a non-defaulter by the classifier, then this model will not be able to accurately prevent loan default and consequently the bank will incur huge losses. Therefore, there is a need for a model that is robust to these pitfalls.

The data dictionary of this particular dataset cannot be released in its entirety, therefore, only a part of it that is sufficient for understanding the methodology will be disclosed. The first 5 rows of the data used in this project are given below: -

The target variable in this data is BAD, which contains binary values 1 and 0, representing whether client defaults on their loan, or doesn’t respectively.

Upon investigating the proportion of unique values in each categorical column (as shown in the below image), it was observed that the column, BAD contained 80% values as 0 and 20% values as 1.

As evident from the above image, there is a huge imbalance in the dataset, due to which most machine learning algorithms would not be able to accurately calculate the significance of the different features present in distinguishing between a client who defaults on their and a client who doesn’t. Different oversampling and undersampling techniques can be used to augment such data in order to make it suitable for being used by a machine learning model appropriately.

This project proposes an XGBoost — SMOTE model, as an automated method for the prediction of loan default among the applicants for home loans. As the historical data illustrated, the population ratio of defaulters against non-defaulters was found to be extremely disproportionate, thereby inducing an imbalance in the dataset. Consequently, the Synthetic Minority Oversampling Technique (SMOTE) was implemented to balance the dataset.

Choice of Primary Evaluation Metric

Formulae for Recall and Precision. (source: author)

Out of all the clients who are actually defaulting on their loans, if some of the clients are misclassified as non-defaulting clients (False negatives), then that can lead to a very large loss for the bank, if the particular client actually defaults in the future. This can be prevented by maximizing Recall, which will reduce the number of false negatives.

Out of all the clients who are predicted to default on their loans, if some are actually non-defaulters (False Positives), then that would not lead to any direct loss to the company when compared to the loss due to loan default. Therefore, maximizing Recall should be given a higher priority than Precision and Accuracy. Recall was chosen as the main metric to be maximized, for the evaluation and tuning of the model. The XGBoost model trained on the balanced dataset obtained by employing SMOTE, was found to have a recall greater than 90%, even when applied on the original imbalanced testing dataset.

Solution Design

Multiple algorithms were employed, and comparative performance analysis was conducted to find the optimal model. Since the dataset was highly imbalanced, SMOTE was implemented to synthetically balance the dataset. Based on the comparative performance analysis, XGBoost algorithm was observed to yield the best performance with respect recall, precision, f1-score and accuracy.

The figure below represents the performance of XGBoost with and without SMOTE across 10 folds of cross validation. X-axis represents the number of folds and y-axis represents the recall score.

*XGBoost performance across 10-fold validation. (S*ource: author)

The figure below represents the overall solution design of the model implemented in this project. The raw data was found to have several outliers and missing values, which were rectified in the pre-processing step (Fig2 A). Categorical variables were converted into dummy variables make them compatible as inputs into a machine learning model.

The bank requires a model that yields robust performance even in severely imbalanced datasets, since most of the data the bank receives comprises only a minority of loan defaults. There is also a need to minimize the number of false negatives in the classification as mentioned previously in the Problem Summary.

The model implemented in this project possesses robustness towards imbalanced datasets since it yields 90% recall for an unseen imbalanced dataset (as described in the next section). The model also maximizes recall, thereby reducing the loss due to false negatives and meets the requirements of the bank for accurate prediction of loan default.

For detailed steps taken in EDA and preprocessing of the data, the jupyter notebook available at the Github link can be referred.

One of the main steps in preprocessing is Outlier Removal. Boosting algorithms such as XGBoost are extremely sensitive to outliers, which makes it a mandatory practice to always deal with outliers beforehand. The code for the same is presented below: -

def treat_outliers(df,col):
    '''
    treats outliers in a varaible
    col: str, name of the numerical varaible
    df: data frame
    col: name of the column
    '''
    
    Q1 = np.nanquantile(df[col], 0.25) # 25th quantile
    Q3 = np.nanquantile(df[col], 0.75)  # 75th quantile
    IQR = Q3 - Q1   # IQR Range
    Lower_Whisker = Q1 - 1.5*IQR  #define lower whisker
    Upper_Whisker = Q3 + 1.5*IQR  # define upper Whisker
    df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker) # all the values samller than Lower_Whisker will be assigned value of Lower_whisker 
                                                            # and all the values above upper_whishker will be assigned value of upper_Whisker 
    return df

def treat_outliers_all(df, col_list):
    '''
    treat outlier in all numerical varaibles
    col_list: list of numerical varaibles
    df: data frame
    '''
    for c in col_list:
        df = treat_outliers(df,c)
        
    return df

XGBoost with SMOTE (code)

Let’s get started with balancing the dataset with SMOTE. In the below code snippet, the imblearn python package has been used to implement SMOTE.

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score,precision_score,recall_score,f1_score
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state = 123)
X_res, Y_res = sm.fit_resample(X, Y)
X_train_res, X_test_res, y_train_res, y_test_res = train_test_split(X_res, Y_res, test_size = 0.30, random_state = 1)


print("Shape of the training set: ", X_train_res.shape)   

print("Shape of the test set: ", X_test_res.shape)

print("Percentage of classes in the training set:")

print(y_train_res.value_counts(normalize = True))

print("Percentage of classes in the test set:")

print(y_test_res.value_counts(normalize = True))

Output of the above code. (source: author)

As seen in the above output, the shape of the training and testing datasets has gotten balanced after applying SMOTE.

Now we will create a function for evaluating the machine learning models trained on SMOTE-balanced dataset called the ‘metrics_score’ function in the code snippet below. The metrics_score function shows the performance of the models using the classification_report function that is obtained from the scikit-learn library. After that, GridSearchCV has been emplyed to perform grid search to find the best estimator from the parameters specified. GridSearchCV has also been used to perform cross validation across 10 folds. As discussed earlier, recall is the most important metric in the classification of Loan defaulters, therefore, recall_score has been specified as the type of scoring to be maximized.

#creating metric function 
def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))
    cm = confusion_matrix(actual, predicted)
    plt.figure(figsize=(8,5))
    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels=['Not Default', 'Default'], yticklabels=['Not Default', 'Default'])
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()


# Training XGBoost across 10 folds with GridSearchCV
xgb_estimator_tuned_smote = XGBClassifier(booster = "gbtree", random_state = 123, n_jobs=4)

# Grid of parameters to choose from
parameters = {'max_depth':range(3,10,2),
            'min_child_weight':range(1,6,2),
            'learning_rate': [0.30000012],
            'max_bin':[256],
            'n_estimators':[100, 200]
             }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(recall_score, pos_label = 1)

# Run the grid search
grid_obj = GridSearchCV(xgb_estimator_tuned_smote, parameters, scoring = scorer, cv = 10)

#fit the GridSearch on train dataset
grid_obj = grid_obj.fit(X_train_res, y_train_res)

# Set the clf to the best combination of parameters
xgb_estimator_tuned_smote = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
xgb_estimator_tuned_smote.fit(X_train_res, y_train_res)

# evaluating the model on the testing set
y_pred_test = xgb_estimator_tuned_smote.predict(X_test_res)
metrics_score(y_test_res, y_pred_test)

output for the above code (source: author)

As observed in the above output, the model gives 94% recall for Bad=1 (i.e clients who are loan defaulters). However, this recall is based on the balanced testing dataset formed after applying SMOTE. In the real world, new and unseen data that banks encounter tend to be severely imbalanced. Therefore, it is important to evaluate how the XGB model trained on balanced data, performs on real world unseen imbalanced data. The below code tests the same.

print("Percentage of classes in the smote test set:")
print(y_test_res.value_counts(normalize = True))

print("\nPercentage of classes in the original imbalanced test set:")
print(y_test.value_counts(normalize = True))

As seen in the output above, the original data is imbalanced with ~80% being non-defaulters and only ~20% constituting loan defaulters.


# Predicting values from the imbalanced X_test (i.e X_test without SMOTE)
y_pred_test = xgb_estimator_tuned_smote.predict(X_test)
metrics_score(y_test, y_pred_test)

The model gives 90% recall for Bad=1, which is fairly decent.

After obtaining and evaluating the results from the supervised classification, the feature importance array was derived from the XGBoost model to find out the most important features identified by the classifier. Fig 6 presents the feature importance bar plot derived from the XGBoost classifier.

*Feature Importance of the XGBoost classifier (source: author)*

As represented in the above figure, amongst the continuous variables in the dataset, NINQ (number of recent credit inquiries) and DEBTINC (debt-to-income ratio) were two of the most important features as identified by the classifier, followed by CLAGE (age of the oldest credit line). Among the categorical variables, the employment of the client also showed positive feature importance, as illustrated by ‘JOB_’ variables in the above figure.

*All metrics of all classifiers implemented in this project in decreasing order of Test Recall (source: author)*

The above figure illustrates the performance of all models implemented in this project, in the decreasing order of recall on testing dataset.

Limitations and Recommendations for Further Analysis

There is a big trade-off between interpretability and the performance of the model. Though XGBoost yields very high performance, its interpretability is much less than the Tree models implemented in the project. Model interpretability provides an insight into the relationship between the inputs and the outputs. Decision Tree classifier implemented in this project has the ability to output a Decision tree which illustrates the decision rules that can be interpreted by any human being in a step-by-step manner. However, such an interpretability tool is not available for XGBoost. Therefore, the main limitation of the proposed model is its decreased interpretability.

Some recommendations for further analysis are mentioned below: -

Instead of using a synthetically oversampling method like SMOTE that was employed in this project, an under-sampling approach can also be taken to balance the dataset. Methods like Near-Miss and condensed nearest neighbour algorithms can be implemented for the same. But the main challenge when employing an under-sampling method would be to prevent loss of relevant data. Therefore, the under-sampling algorithm must be carefully chosen.
Deep Learning algorithms can be utilized to potentially increase the performance of the model. The main challenge faced here would be to implement a deep learning model with high performance but low complexity. If the deep learning model is implemented with multiple hidden layers and has high time complexity, then the model will not be as efficient as the one designed in this project using state-of-the-art machine learning models. Therefore, deep learning will only be feasible when its complexity doesn’t increase too much.