Model Validation and Monitoring

How to evaluate and monitor performance of AI models for Financial Risk Management— a practical guide

Let’s talk Precision, Recall, F1 score, Reject Rate, PSI, CSI, KS and Gini along with their python implementation

Indraneel Dutta Baruah
ANOLYTICS

--

Source: https://unsplash.com/photos/ffH_GkINfyY

One of the most important aspects of using AI models in Financial Risk Management is to ensure its quality at all times. There are two aspects to it. First, we need to ensure the model is accurate at the time of development i.e model validation. Second, we need to keep doing the same at regular intervals post-deployment in production environment i.e model monitoring. This blog is focused on introducing the most important model validation/monitoring metrics. We will also cover practical matters like what kind of threshold to keep to consider a model’s performance as good, okay or bad. Finally, there will be a separate Jupyter notebook on its python implementation. Here are the metrics we will be covering:

  1. Accuracy, Precision, Recall and Reject Rate
  2. F1 Score
  3. ROC AUC
  4. Gini Coefficient
  5. Population Stability Index (PSI)
  6. Character Stability Index (CSI)
  7. Kolmogorov Smirnoff Test (KS)

Let the learning begin!

Accuracy, Precision and Recall

To understand accuracy, recall precision and F1 score, we need to understand the confusion matrix first. It can be shown as:

Source: Image By Author

Let’s understand the table above. What do we mean by positive/negative:

  • Positive: A data point comes under the positive category if it belongs to the class we are trying to predict. For example, if we are building a fraud model, all frauds will come in this category
  • Negative: Continuing from the example above, all the non-fraud are negative cases

Next, let’s continue with the fraud model example to understand True negative/False Negative/ True Positive/ False Positive:

  1. True negative (TN): A data point which is actually non-fraud and is predicted by the model as non-fraud
  2. False Negative (FN): A data point which is actually fraud and is predicted by the model as non-fraud
  3. True Positive (TP): A data point which is actually fraud and is predicted by the model as fraud
  4. False Positive(FP): A data point which is actually non-fraud and is predicted by the model as fraud

Confusion Matrix is a table with the counts of True negative, False Negative, True Positive and False Positive. Let’s take an example of our fraud model:

Source: Image By Author

Now we can discuss the three basic metrics used for evaluating classification ML models.

Accuracy

The basic metric used for model evaluation which calculates number of correct predictions over all predictions.

Source: Image by Author

Typically, accuracy below 70% is rejected, between 70–80% is considered acceptable based on business context and above 90% is considered good. Accuracy on the confusion matrix above is calculated as follows:

(TP = 50 + TN = 150)/(Total predictions = 300) = 67%

Accuracy may not be a good measure if the dataset is not balanced (which is common in use cases like fraud). Let’s look at recall and precision now.

Source: Image By Author

Precision tells us how many cases our model tagged as fraud are actually frauds. On the other hand, recall tells us how many frauds our model was able to capture (hence it is also called capture rate). Recall is also known as sensitivity. Precision below 70% is usually rejected, over 80% is considered acceptable and above 90% is considered good. Anything between 70–80% is grey area based on business context. For recall rate, 50–70% is acceptable based on business context (for example for fraud cases, usually 60% is acceptable), above 70% is good and below 50% is rejected.

In our example, precision is 67% (TP=50/(TP=50+FP=25)) and recall is 40% (TP=50/ (TP=50 + FP=75))

Precision is the go to metric when the costs of False Positive is high. For instance, in email spam detection the email user might lose important emails if the precision is not high for the spam detection model. In fraud/default models built in financial risk management, recall is more important as you need to catch maximum possible frauds.

Another key metric used in financial risk management is reject rate. It tells us what % of the population is tagged as fraud/default. Too high a number will indicate that we will be rejecting a large % of customers leading to customer dissatisfaction.

Source: Image By Author

In our case, we have a reject rate of 25% (TP=50 + FP = 25/Total Prediction =300). Typically, reject rate below 5% is good, between 5–10% is acceptable and anything above 20% is rejected. In some extreme situations, 10–20% might be accepted (for example, there is a huge push to prevent fraud in some new products from regulatory bodies). Finally, the same formula works for multi-class classification problems.

F1 Score

F1 Score is the preferred choice when we want to balance between Precision and Recall and the dataset is unbalanced. It is calculated as harmonic mean of the recall and precision as harmonic mean is suitable for ratios.

Source: Image By Author

As calculated above, our recall is 40% and precision is 67%. Thus the F1 score is 50%. F1 score above 80% is good, 60–80% is cosidered acceptable and below 60% is rejected.

For multi-class classification, there are different variations of F1 score like macro-average, weighted-average and micro-weighted F1 score. For example, for macro-average F1 score, we first calculate F1 scores for each class separately by considering one class and clubbing the other classes as one class. Then macro average F1 score is calculated as arithmetic mean of the per-class F1 scores. The weighted-averaged F1 score is calculated similarly but each per class F1 score is weighted by the number of data points associated with the class. Micro-average F1 score uses the global precision and recall scores to calculate F1 score using the formula shown above. For more details, please go through this blog.

AREA UNDER THE ROC CURVE (AUC)

Receiver Operator Characteristic curve plots the true positive rate(TPR) on y-axis and false positive rate(FPR) on x-axis. They are calculated as follows:

Source: Image By Author

First thing to note is that the It can only be used for models like decision trees and neural networks where probability of prediction is available. Let us understand how the plot is made by using our fraud model example. For different thresholds of probability of being a fraud, we calculate the TPR and FPR. The ROC curve begins at (0, 0) corresponding to a threshold of 100% where all the customers are tagged as non-fraud. Similarly, for the threshold of 0%, all the customers are tagged as fraud and it is represented in the ROC curve as (1,1). So, the threshold goes down as we move from (0, 0) to (1, 1).

Source: Image By Author

The Area Under the Curve (AUC) is a summary of the ROC curve. The larger the area under ROC curve the better the ability to demarcate the classes (fraud vs non-fraud in our case). The highest possible score is 1 which can be seen when the model is able to prefectly label each class (i.e TPR = 100%, FPR = 0%). AUC under 60% is usually rejected (or considered red zone), between 60% to 70% is acceptable (amber zone) and above 70% is good (green zone). Key benefits of AUC score is that it captures both true and false positives and we can visually compare models as shown in the image above. For more details on how ROC curve for different models look like, read this blog.

Gini Coefficient

Gini coefficient used for model performance evaluation is often confused with Gini index. Gini index was devised by Italian statistician Corrado Gini in 1912 and is the most popular measure of socioeconomic inequality using the lorenz curve. But gini coefficient is a measure of the ordinal relationship between two variables (also known as Somers’ D which was proposed by Robert H. Somers in 1962) . In the context on financial risk management, it calculates the ordinal relationship between probability of fraud/default and the actual outcome.

Source: Image By Author

Gini coefficient informs how close our model is to a “perfect model” and how far from a “random prediction model.” It is calculated as ratio between the area within the model curve and the random prediction model line (A) and area between perfect model curve and random prediction model line (A+B). Schechtman & Schechtman showed that AUC and Gini are related using the following formula: 2AUC — 1. Gini is prefered over AUC as Gini ranges between 0 and 1 while AUC ranges between .5 to 1. Gini above 40% is considered good, between 40% and 20% are acceptable and below 20% is rejected.

To learn more about using Gini Coefficient in financial risk management, kindly go through this blog.

Population Stability Index (PSI)

It is a metric to measure how much a variable has shifted in distribution between two samples over time. It is widely used for monitoring changes in the characteristics of a population and for diagnosing possible problems in model performance — many a times, it’s a good indication if the model has stopped predicting accurately due to significant changes in the population distribution.

The above definition from this research paper is a perfect way to explain the monitoring metric. These are the steps for calculating PSI:

Step 1: Determine which population to use as reference. The most common choice is to use the training data used for building the model.

Step 2: Rank order the prediction (prediction probabilities for classification models) and divide them into bins (usually 10 bins of equal size).

Step 3: Calculate number of data points in reference population and % of data in each of reference bins. With equal sized bins, the % distribution will be equal as well.

Step 4: Store the cutoff point (probability of default for classification models) used for creating the bins for reference population and apply the same cutoff on the OOT/production data.

Step 5: Calculate number of data points in OOT/production population and % of data in each of the bins from step 4.

Step 6: Calculate index for each bins using the formula:

Source: Image By Author

There are two components in the formula:

1. Difference of the values in Step 3 and Step 5
2. Natural log of the division of the values in Step 3 and Step 5

Step 7: Sum the index for all the bins to get the PSI

Here is an example of the above steps:

Source: Image By Author

PSI less than 10% means there has been minor change in population distribution, between 10–25% is acceptable and above 25% indicates major shift in population and the model needs redevelopment and/or variable selection.

Characteristic Stability Index (CSI)

If PSI deteriorates, we can identify causes for poor model performance by checking distributional changes in the input features. This is exactly what CSI does by comparing distribution of input feature in the production/OOT data to the distribution of the same feature in development data set.

Calculating CSI has the same steps as PSI. The only difference is that the cutoff for bins is based on the values of the input feature in the reference population.

Kolmogorov-Smirnov (KS) statistic

“Kolmogorov-Smirnov” (KS) statistics measures the discriminatory power of a model like Gini Coefficient. It is calculated as the maximum value of the difference between distribution of cumulative events (fraud in our example) and cumulative non-events. The KS statistic is applicable on all types of distributions. Another benefit is that it goes from 0 to 1 like Gini.

The KS test statistic Here is a basic example on how it’s calculated:

Source: Image By Author

If KS is in top 3 decile and above 25%, the model is considered good, between 20% and 25% is acceptable and below 20% is usually rejected.

Python Implementation

I have calculated all the metrics in python inside this Kaggle notebook. The same code is hosted on github as well. Please feel free to go through the same!

Conclusion

In this blog we have covered most of important model evaluation and monitoring metrics using in financial risk models. In fact these can be used for any classification model. We have also touched on the practical thresholds used for each metric for checking the model’s performance. For their practical implementation, the jupyter notebook has been hosted and the link is available above.

Hope this helps the readers pick the best model and maintain them as well! In my next blog we discuss about some post-hoc model explainability metrics like SHAP!

https://unsplash.com/photos/CjI_2QX7hvU

References

[1] Tafvizi, Arya & Avci, Besim & Sundararajan, Mukund. (2022). Attributing AUC-ROC to Analyze Binary Classifier Performance. 10.48550/arXiv.2205.11781.

[2] Somers, R. (1962). A New Asymmetric Measure of Association for Ordinal Variables. American Sociological Review, 27(6), 799–811. Retrieved from www.jstor.org/stable/2090408

[3] Schechtman, E., & Schechtman, G. (2016). The Relationship between Gini Methodology and the ROC curve (SSRN Scholarly Paper No. ID 2739245). Rochester, NY: Social Science Research Network.

[4] Castro Vargas, John Alejandro & Zapata-Impata, Brayan & Gil, Pablo & Rodríguez, José & Medina, Fernando. (2019). 3DCNN Performance in Hand Gesture Recognition Applied to Robot Arm Interaction. 10.5220/0007570208020806.

[5] Fang, Fang & Chen, Yuanyuan. (2018). A new approach for credit scoring by directly maximizing the Kolmogorov–Smirnov statistic. Computational Statistics & Data Analysis. 133. 10.1016/j.csda.2018.10.004.

[6] Gini, C. (1914). Reprinted: On the measurement of concentration and variability of characters (2005). Metron, LXIII(1), 338.

--

--

Indraneel Dutta Baruah
ANOLYTICS

Striving for excellence in solving business problems using AI!