Model Validation: A Science — and an Art

Learnings from model validation in credit risk for other domains

Karabirupa Dutta
GAMMA — Part of BCG X
8 min readMay 7, 2020

--

When data scientists consider the topic of model validation, a few names typically come to mind, such as Adjusted R-Square, Accuracy, Precision, Recall, Concordance/Discordance, GINI and Kolmogorov-Smirnov (KS). In this article, I will address the common pitfalls of GINI, a widely used measure for assessing performance of binary classification models. I will conclude with a discussion about why data scientists should adopt a holistic view of model validation instead of looking at single metric.

What is GINI?

GINI is calculated from AUC-ROC curve. It is a graph on which the cumulative percentage of the binary target variable (sorted from 1 to 0) is plotted against the cumulative percentage of the corresponding population. The business world is replete with target variables. They represent everything from the identity of patients with communicable diseases to conduct contact tracing, churners in telco to customers interested in a loyalty program, bank-loan defaulters, fraud transactions to leads for cross-selling.

The example below is from a BCG engagement with a South-East Asian bank. The purpose of the engagement was to help the bank develop credit scorecards for its portfolio of small-to-medium enterprise (SME) loan applications. Due to sub-optimal data collection processes and local regulations, most of the applicants had thin bureau- and non-audited financials. To establish a solid data foundation, the first six months of our BCG team’s work with the bank were invested in data extraction, data collection and manual digitization of reports. (Even with the use of sophisticated OCR of the available scanned pdfs, only 45–50% of the content of these documents was retrieved — despite painstaking correction of human error and elimination of corrupted pdfs.) As a result, our team had to start modelling with a very small data sample (N<2K) of bureau, financials and application data. The major challenges we were:

1. The small sample size

2. The overlap between sources

3. The absolute number of “Bad” events available to train models

Exhibit 1: Venn diagram showing population overlap between data sources

Modelling Methodology

Given the lack of reliable data, our team built customized or “bespoke” models to tap into all available data sources and minimize missing value imputation. We developed the models based on the entire available population, using k-fold cross validation techniques. The final model resulted in approximately 40% Mean GINI (the results of in-sample validation). These results were not as good as we might have hoped, but were sufficient given the situation we faced.

Exhibit 2: Ensemble model results

Out-of-Time Validation

The task of out-of-time (OOT) validation was given to the client’s internal validation team. As soon as GINI dropped by approximately 50% compared to development, the internal team determined that the models would be ineffective. Given that we had anticipated OOT GINI to be in the range of 35–45%, we did not expect this result. So we did some digging.

What Went Wrong: The Math Behind GINI

GINI is essentially a rank-ordering metric that measures how accurately a model or scorecard can predict “default” events. The typical approach is to divide the population into 10 bins (deciles) after sorting by descending probability of default (PD), and then calculate the cumulative percentage of “Good” (0) and “Bad” (1) captured as we move from rating classes 1 to 10.

Exhibit 3: AUC-ROC-GINI

In the example below, we score 1,500 customers (N) with a Bad rate of 3% (in other words, there were 45 “Bad” customers). Without using any model and applying just the law of probability (PD), we can assume that for every 150 customers there will be 4–5 defaulters. When we train the model, we can calculate that the capture rate is 60% in the top 3 buckets and that the GINI is 42%.

Exhibit 4: How propensity scores work

Here are the three steps involved in the calculation:

Step 1: Sort by descending order of PD and make deciles/bins.

Step 2: Calculate the number of Good and Bad in each bin.

Step 3: Calculate AUC and KS (difference between cumulative percentage of Good and Bad) for each bin.

Table 1: Development (Train) model results

In the top 3 deciles, 27 Bad customers out of total of 45 Bad customers = 60% of the total defaulters captured. When you do the math, the model GINI comes to 42%.

Fig 1: GINI curve for development model

For OOT validation, there are 1,200 customers to be scored using the same model. Once we run the model and sort by descending PD, the results are:

Table 2: Validation (Test) model results
Fig 2: GINI curve for validation model

One might immediately notice that the GINI drops by almost 50% and, hence, be inclined to disapprove the model — as our bank client initially did. Interestingly, if this model was doing proper justice to the last 2–3 buckets, we would expect no or very few defaulters. Instead, we see similar default rates in the last 2 deciles compared to population default (3–4 defaulters for every 120 randomly selected customers). This could be due to a number of reasons:

1. The model is not working well with new data.

2. Defaulters have counterintuitive parameters, perhaps a combination of low utilization and high past-delinquency rates.

3. Only partial information is available for some features, causing the model to create probabilities based on imputation/treatment (such as for missing value imputation/ Winsorization).

4. The variables may show new trends, leading to fluctuations in the population stability index (PSI).

This is, of course, a model and not a crystal ball. As such, we can’t perfectly predict each and every default correctly. On further investigation we found that of the 8 defaulters classified into the last 2 buckets (lowest risk class), 6 of them either had incomplete data or there was an error in calculation of some transformed variable. If we re-examine the metrics by removing these 6 cases from the evaluation population, we would have only 2 defaulters in the last 2 bins, with a revised sample size of 1,200–6=1,194 (N) and Bad rate 30/1,194=2.5%. In this case, the results would be:

Table 3: Revised Validation model results
Fig 3: GINI curve for revised validation model

Following this re-examination, the GINI is back to 40+. One must ask: Is it fair to conclude that the revised sample passed the validation test by discarding only 0.5% of the test dataset?

What We Can Infer

This question brings up my final point, which is that, clearly, GINI should not be the only metric on which you evaluate model performance. In general, data scientists understand that while certain metrics may be optimal in some scenarios, they may be suboptimal in others. For example:

1. When dealing with imbalanced datasets such as credit card fraud or patients with chronic diseases where the majority class (0) is ~99% of the population, the metric of “accuracy” would be a naïve measure: Even a blunt model classifying everyone as “0” would have a 99% accuracy. False negatives (FN), especially when they concern someone’s health, can have serious outcomes hence modeller should aim at increasing “recall.”

2. Consider another example, this one involving a test in which predicted criminals are punished. In this example, our null and alternate hypotheses are H0 : Person is innocent, and HA : Person committed crime. Given that an innocent being punished (type-I error; reject null hypothesis when it’s true) is a far worse outcome than a criminal being set free (type-II error; accept null hypothesis when its false); the modeller must focus on increasing “Precision” — he or she must reduce False positives (FP).

These errors are inversely proportional, so one should fix a threshold for type-I error (<=α, level of significance/size of test) and then chose a test that optimizes type-II errors (β, 1-Power).

Table 4: Confusion matrix

Various measures at play (non-exhaustive):

** TPR: True positive rate; FPR: False positive rate; TNR: True negative rate; FNR: False negative rate

As we saw when we revised the default model, we could gain a higher or lower GINI (and in this example, the magnitude of change varies by approximately 200%) simply by ignoring six customers. Ideally at the beginning we should lay out our evaluation criterion and define what we mean by success, such as the model correctly identifying the percentage of defaulters with a threshold of a performance metric that exceeds a certain percentage.

Even given the low GINI, when we look at the entire 1,200 OOT population the model still holds value because:

1. Rank ordering holds.

2. There are comparable capture rates in the top deciles for train and test samples (When you compare Tables 1 and 2, the capture rate is still 60% for the top 3 deciles).

3. Recall and risk appetite are balanced.

4. Precision and opportunity cost are balanced.

Consider the Specific Scenario, But Keep Moving Forward

This points to the fact that it is not always necessary to discard results based on a single metric. In the final analysis, having a model in place is far better than having no model at all — and far better than going back to a legacy “Application checklist”. GINI is excellent at detecting rank ordering and giving an indication of performance but it, like all measures, has its limitations. When it comes to model validation, data scientists needs to adopt a holistic view. Finding the right metric is a matter of taking into consideration all scenarios in play (sample size, event rate, fill rate of variables, data availability, reliability, macroeconomic situations and so on) and finding the best solution to the problem at hand.

If you have questions or comments, please contact me at https://www.linkedin.com/in/karabirupa-dutta-430a7624/.

--

--