Machine Learning Model Testing for Production

Why ML testing for production is important?

Published in

Bright AI

5 min readMay 20, 2023

Most AI/ML models are now used in automation and adding decision-making but how do we know the ML model is reliable for decision-making?

Secondly, is a good model evaluation/accuracy enough for the real world? How can we test other metrics to care about — Business KPIs e.g- response speed (latency), data load

Current scene in ML testing in most companies

Model development by Data Scientists may not always have the best production practices are they can be focused on model quality rather than integration needs. Therefore most team hand off the trained model to another team (Software Engineering) for shipping isn’t enough

ML testing flow

Model Evaluation: Imp Metrics & Tools

Model evaluation is stage 0 of model testing and is only limited to the functionality of the model. Below is the metric for each type of model to evaluate its quality. For imbalanced datasets, F1 scores, and AUC scores are best for classification and for outlier-heavy data, MAE is better for regression. For some models like the ensemble-decision tree-based supervised model (XGBoost, RF). SHAP can be used to understand what is the logic of prediction as such a black box model at the overall view and LIME for specific cases inspection

Pre-Training Test/ Unit Testing

When we take an online learning or batch learning approach, before the training process, some unit testing is done to meet the software/model requirements like the right format of data, enough data, etc as below

Post Training Test — Batch Vs Online Learning

Post Training Test is done after the training in batch learning but during training for online learning

Latency test: Check if the prediction is made within a fraction of a second so that the model can be scalable and handle the traffic. If it takes ≥ a minute, the model design mostly needs to be changed. This test is very important if it is an online machine-learning approach. One technique to tackle this in online learning is to set hyperparameters as static values and not use grid search cv to tune the parameter every time new data comes, as it may increase the latency. Therefore the static value of hyperparameters can be found by running sampled/subset dataset and training the model on techniques like random-search cv to get the best parameters based on majority validation data
Load test: Check how much test data can model handle at a time. This is important for both batch and online learning. One technique to use SQLalchemy to test if all DB (database) is accessed parallelly, this will enable handling of large loads of data. Also, the AWS container helps with CPU/memory monitoring. Locust is a tool to automate this test

A/B testing: Model retraining

In some ML/AI models, with time, the data characteristics change, so the model trained on old data might not perform well in the new data. This is called data drift. Similarly, concept drift occurs when the assumptions made by a machine learning model no longer hold true in the real-world data it encounters during deployment.

Therefore retraining is required after ML deployment. A/B test helps decide whether the new version of the model should replace the old model.

In A/B testing (automatically facilitated by AWS Sagemaker) 80% of traffic is provided to the current/old ml model while 20% of traffic (challenger) is provided to the new model. Based on baseline metric(s), the new model can replace the old one if the new performs better.

Stage test/ Shadow test

The Stage test is one of the final tests to check whether the model will give the desired output. This test happens after dockerization and AWS containerization for automatic deployment in pipelines like Bitbucket and GitLab. This includes diverse test data (covering extensive scenarios) that capture real-world data characteristics and is given as input to the model at the stage environment.

Shadow test ( Safe way for a sanity check on large ML/AI models deployment):

Deploy model in parallel with existing model ( if one is already present).
For each request, route it to both models to make a prediction but only serve the existing one to the user
Use the prediction on a new model for evaluation/analysis

API Testing

This is the last testing, where we check how the actual user will see the response from the model. Therefore it captures all the possible cases in user input that can give errors in response. Each error is encoded as a unique Id (as status_code, error code). And the security aspect of input and output is checked for each unique customer.