Assessing performance of a predictive model using the data censored by itself

dimalvovs
Riga Data Science Club
3 min readAug 6, 2020

At work I and my team build pipelines that create and deploy machine learning (ML) models. Very often these models are used to estimate probability of a loan to be repaid on time by a potential borower. Since the models would have the final say in the important decisions of whether to extend credit or not, large part of the pipeline code base exist solely to perform various checks on the models, during training and also afterwards, when the model is in production. It has been always tempting to use the popular AUC [1] score to measure the predictive accuracy of the model online to understand whether immediate actions are needed, but now I tend to think that this is nonsense, at least without accounting for some issues, here is why.

When we assess the model online, we would measure AUC on a censored sample of the population (since the model rejected applicants with lower scores), this may easily introduce bias and misrepresent the true performance of the model. To study how exactly censoring predictions influences bias in AUC estimation, we construct synthetic prediction data for a given AUC, and measure AUC after filtering out bottom 10%, 20% of the predictions and so on.

One nice way to synthesize predictions with a given AUC is to create two normal distributions, that are located as far away from each other as is needed to achieve a given level of accuracy. For example so that one distribution would have a mean of 0, and the second — mean of the needed distance. Luckily there are ways to convert from AUC to different other measures, and we will use a python implementation of the solution described at stats.stackexchange [2] that converts AUC to z coeficient, and then to Cohen’s d [3], the latter being used as the distance between distribution means that yields the given accuracy.

Next we create a function to produce a set of censored/uncensored AUCs, where censorship means assessment of the model performance on the data where the bottom prediction/outcome pairs are dropped by “acceptance rate”, (ar) perameter, and uncencensored is the AUC run on the whole dataset at a given iteration. We also provide controls for good/bad population structure, but ratio will remain 0.5 in this experiment.

Finally, run the simulation multiple times with different censorship (acceptance) coefficients and plot results.

Honestly it was quite astounding to see how the censored AUC drops as we increase the censorship rate. For high-accuracy models (AUC > 0.9) and high censhorship (low accept) rate in some simulations the accuracy dropped by more than 20 percent points.

In chart: population AUC — the metric computed on the whole dataset, accepted AUC — only predictions reaching a certain quantile (as specified by accept rate) have been taken into account.

The results may be dependant on the way we simulate the predictions, but that’s a topic of another small research.

  1. https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
  2. https://stats.stackexchange.com/questions/422926/generate-synthetic-data-given-auc
  3. https://en.wikipedia.org/wiki/Effect_size#Cohen's_d

--

--