Credit Risk Modeling Handbook

Credit Scoring: Going Beyond the Modeling Basics (Part 7)

Advanced techniques for model validation and dealing with unbalanced data in credit risk modeling.

Natasha Mashanovich

Published in

DataDrivenInvestor

6 min readMay 17, 2023

To satisfy the main hallmarks of scientific model development — rigor, testability, replicability, precision, and confidence — it is important to consider model validation and how to deal with unbalanced data. This article outlines an advanced validation framework that can be utilized to satisfy those hallmarks and provides a brief overview of methodologies frequently applied when dealing with unbalanced data.

Validation Framework

Too good to be true.

Any predictive model that fits data too well should be considered suspect. By building complex, high-performance predictive models, data scientists often make modeling errors, referred to as overfitting. Overfitting — which occurs when a model fits perfectly to the training dataset but fails to generalize on a training dataset — is a fundamental issue and the biggest threat to predictive models. The consequence is a poor prediction on new (unseen, holdout) datasets.

ROC cart indicating model overfitting created with Altair Analytics Workbench (image by author)

A number of validation frameworks exist for the purpose of detecting and minimizing overfitting. They differ in terms of algorithm complexity, computational power, and robustness. Two simple and common techniques are:

Simple validation — random or stratified partitioning into train and test partitions.

Nested holdout validation — random or stratified partitioning into the train, validation, and test partitions. Different models are trained on the training partition, mutually compared on the validation sample and the champion model is validated on unseen data that is the testing partition.

The main drawback of these two approaches is that the model fitted to a subset of the available data could still be subject to overfitting. This is particularly true with datasets containing a small number of observations.

Another problem with the simple validation arises when adjusting model parameters, and constantly testing the model performance on the same test sample. This leads to data leak as the model effectively “learns” from the test sample, meaning that the test sample is no longer the true holdout sample and overfitting may become a problem. Nested holdout validation could resolve the problem to a certain extent, however, this approach requires a large amount of data, which could be the issue.

Bootstrapping and cross-validation are two validation frameworks specifically designed to overcome problems with overfitting and more thoroughly capture sources of variation.

Bootstrapping is sampling with replacement. The standard bootstrap validation process randomly creates M different samples from the original data, of the same size. The model is fitted on each of the bootstrap samples and subsequently tested on the entire data to measure performance.

Cross-validation (CV) fits data on the entire population by systematically swapping out samples for testing and training. Cross-validation has many forms, including:

K-fold — partitioning the population into K equal-size samples and performing K times iteration over training/testing splits
Leave-one-out
Stratified
Nested cross-validation

Nested cross-validation is required if we want to validate the model in addition to parameter tuning and/or variable selection. It consists of an inner and an outer CV. The inner CV is used for either parameter tuning or variable selection while the outer CV is used for model validation.

With some modifications, both bootstrapping and cross-validation can simultaneously achieve three different objectives:

model validation
variable selection, and
parameter tuning, using grid search, or random search, for example.

Grid-search and CV for validation, selection, and tuning (image by author)

Modeling Unbalanced Data

When good isn’t good enough.

Model accuracy, defined as the ratio of correct predictions to the total number of cases, is a typical measure used to assess model performance. However, assessing model performance solely by accuracy may itself present problems as we could encounter an accuracy paradox. As an example, assume we have an unbalanced training dataset with a very small percentage of the target population (1%) for who we predict fraud or catastrophic events. Even without a predictive model, just by making the same guess “no fraud” or “no catastrophe” we reach 99% accuracy! Impressive! However, such a strategy would have a 100% miss rate, meaning that we still need a predictive model to either reduce the miss rate (false negative, a “type II error”) or to reduce false alarms (false positive, a “type I error”).

The right performance measure depends on business objectives. Some cases require minimizing the miss rate, while others are more focused on minimizing false alarms, especially if customer satisfaction is the primary aim. Based on the overall objective, data scientists need to identify the best methodology to build and evaluate a model using unbalanced data.

Unbalanced data may be a problem when using machine-learning algorithms as these datasets could have insufficient information about the minority class. This is because algorithms based on minimizing the overall error are biased towards the majority class, neglecting the contribution of the cases we are more interested in.

Two general techniques used to combat unbalanced data modeling issues are sampling and ensemble modeling.

Sampling methods are further classified into undersampling and oversampling techniques. Undersampling involves removing cases from the majority class and keeping the complete minority population. Oversampling is the process of replicating the minority class to balance the data. Both aim to create balanced training data so the learning algorithms can produce less biased results. Both techniques have potential disadvantages: undersampling may lead to information loss while oversampling can lead to overfitting.

A popular modification of the oversampling technique, developed to minimize overfitting, is the synthetic minority oversampling technique (SMOTE) which creates minority cases based on another learning technique, usually the KNN algorithm. As a rule of thumb, if a large number of observations is available, use undersampling, otherwise, oversampling is the preferred method.

The steps below outline a simple example of development steps using the undersampling technique.

Create a balanced training view by selecting all “bad” cases and a random sample of “good” cases in proportion, for example, 35% and 65%, respectively. If there is a sufficient number of “bad” cases, undersample from an unbalanced training partition, otherwise use the entire population to undersampling.
Using the standard modeling steps (i) select the best set of model predictors, followed by (ii) fine classing, (iii) coarse classing with optimal binning, (iv) weight of evidence or dummy transformations, and (v) a stepwise logistic regression model.
If not created in step 1, partition the full unbalanced dataset into train and test partitions, for example, 70% for the train partition, and 30% for the test partition. Keep the ratio of the minority class the same in both partitions.
Train the model with the model variables selected by the stepwise method in step 2-v using the train partition.
Validate the model on the test partition.

Ensemble Modeling

Ensemble modeling is an alternative to unbalanced data modeling. Bagging and boosting are typical techniques used to make stronger predictors and overcome overfitting without using undersampling or oversampling. Bagging is a bootstrap aggregation that creates different bootstraps with replacement, trains the model on each bootstrap, and averages prediction results. Boosting works by gradually building a stronger predictor in each iteration and learning from the errors made in the previous iteration.

As discussed above, accuracy is not the preferred metric for unbalanced data as it considers only correct predictions. However, by considering correct and incorrect results simultaneously, we can get more insights into the classification model. In such cases, the useful performance measures are sensitivity (synonyms are recall, hit rate, probability of detection or true positive rate), specificity (true negative rate), or precision.

In addition to these three scalar metrics, another popular measure that dominates across the industry is the ROC curve. The ROC curve is invariant of the proportion of “bad” vs. “good” cases, which is an important feature, especially for unbalanced data. Where there is a sufficient number of “bad” cases, rather than using unbalanced data methods, the standard modeling methodology can be applied and the resulting model tested using the ROC curve.

The Handbook Content

Credit Scoring: Why it Matters and How it Works? (Part 1)
Credit Scoring: Choose the Modeling Methodology Right (Part 2)
Credit Scoring: Prepare Your Data Right (Part 3)
Credit Scoring: Select Features that Matter (Part 4)
Credit Scoring: Build Your Model Right (Part 5)
Credit Scoring: Segmentation and Reject Inference — Tough Decisions (Part 6)
Credit Scoring: Going Beyond the Modeling Basics (Part 7)
Credit Scoring: Create Your Credit Strategy Right (Part 8)
Credit Scoring: Model Implementation (Part 9)
Credit Scoring: See the Bigger Picture (Part 10)

Subscribe to DDIntel Here.

Visit our website here: https://www.datadriveninvestor.com

Join our network here: https://datadriveninvestor.com/collaborate