Robust Machine Learning Model Evaluation- Part I

3 min readAug 24, 2021

In machine learning projects, one of the the biggest problems is drop in model performance as soon as it is deployed. Some times data science teams get into panic mode as it is quite difficult to find out the root cause while model is in production.

One way to mitigate this risk is to have more robust model evaluation in the lab, so that, we know what all could go wrong if model is showing lower accuracy on production immediately after deployment.

One can think of this as evaluation equivalent to software testing. A streamlined test case based evaluation.

Prerequisite for this evaluation is model predicted score and actual class values for following datasets:

Test Dataset: This is the dataset used in-lab for testing and not used in model training.
Out-of-time Dataset: This is an out-of-time dataset not at all used in any model training or testing.

In order to perform a comprehensive evaluation, we will look at 5 different accuracy metrics namely: F1 Score, FPR, FNR, Precision, Recall. I have created 10 Test cases which every model should be tested against. Now lets explore each test case in detail:

Test Case 1: Metric Comparison @various probability threshold

Generate F1 Score, FPR, FNR, Precision, Recall between Test set and OOT set various probability threshold. Create a table as below:

Key patterns to look for:

a) In general, if different evaluation metric’s are close to each other for both the dataset then model is stable and robust.

b) Metric consistency between two datasets across different probability thresholds is a concrete indication of model robustness.

c) If there are one or more thresholds for which evaluation metric are way different between two datasets then model might not do well in production setting for long.

Test Case 2: Using T-Test for model evaluation on Bootstrapped Samples

Following are the steps to run a T-Test for this test case:

Step 1: Select Bootstrapped samples from Test data and OOT data. The bootstrapped sample size should equivalent to size of the Test and OOT dataset. Lets say we take 5000 different bootstrapped samples of the size of test data.

Step 2: For each Bootstrapped sample calculate F1 Score, FPR, FNR, Precision, Recall for Test set and OOT set.
Step 3: Now you have two pairs of each accuracy metric and 5000 such records.
Step 4: Run a two tailed t-test to evaluate if F1_test and F1_OOT come from the same population or not @0.05 p-value.

Step 5: Create a table as below:

**Table 1: Model is consistent Table 2: Model is inconsistent**

Key patterns to look for:

If for each evaluation metric, T-Test p-value > 0.05, this indicates that model is robust and consistent. As shown in table 1 above.
If for one more evaluation metric, T-Test p-value ≤0.05, this indicates model might be inconsistent. As shown in table 2 above.
Depending on domain one can decide if we should go ahead with this model. For example in Cyber Security and Banking domain, high false positive is not acceptable where as in Marketing domain it may not be an issue.

Robust Machine Learning Model Evaluation- Part I

Written by Sudhir Kumar Rai