Model Retraining Bible, Part 2- Model Monitoring

9 min readMar 22, 2023

Model Monitoring is an operational stage in the machine learning lifecycle that comes after model deployment. It entails monitoring your ML models for changes such as model degradation, data drift, and concept drift, and ensuring that your model is maintaining an acceptable level of performance.

The following table below illustrates briefly what kind of challenges our model will experience during production. Now, remember that your machine learning application is not just the ML model but everything that helps your model in production. This includes all the relevant resources and infrastructure, input data, and other services.

Courtesy — https://neptune.ai/blog/how-to-monitor-your-models-in-production-guide

These are the questions that need to be sorted out during production. The most important thing now is to determine, how well our model is performing. For this, we need to set some evaluation metrics. Now based on the product, the metrics can either be pre-existing or custom-made. Now we can’t choose any metric that we see first. The metric should correspond with the user's perspective. An increase in the metric should mean a better user experience. Also, the metric should be able to capture extremes instead of just typical experiences.

These metrics can either live inside an analysis notebook or act as a real-time service. But remember: There will be bugs in your inference service code, client code, configurations, inference input data, training data, the model library, or training code with time if you use an analysis notebook. Therefore, using a real-time service would be a better idea.

Implementing the monitoring infrastructure

The first step is to create your metric(s) and evaluate them for several weeks. This analysis can be done in a notebook (if possible) or by implementing the metric(s) in a live service. Are they stable? Do they go down during the times when you know you had a quality issue?

If a metric is relatively stable over time and confirmed (or at least expected) to go down when an issue occurs, you can alert with a reasonable threshold. You define the threshold based on your analysis of the “typical level”. If the metric is a bit noisy, you can clean it up e.g., apply smoothing, consider changing its definition slightly (e.g., percentage of the total, compared to the same time last week/yesterday, …), or alert only for big differences.

Metric when you have access to ground truth — F1 Score

A common metric used to evaluate the accuracy of a model among data scientists is the F1 score, mainly because it encompasses both the precision and recall of the model. It is the harmonic mean of precision and recall. Refer the diagram below for more information.

Metrics when you do not have access to ground truth.

Sometimes monitoring the accuracy of a model isn’t always possible. In certain instances, it becomes much harder in obtaining the predicted and actual paired data. For example, imagine a model that predicts the net income of a public firm. This means that you would only be able to measure the accuracy of the model’s predictions of net income 4 times a year from the firm’s quarterly earnings reports. In the case where you aren’t able to compare predicted values to actual values, there are other alternatives that you can rely on:

⦁ Kolmogorov-Smirnov (K-S) test: The K-S test is a nonparametric test that compares the cumulative distributions of two data sets, in this case, the training data and the post-training data. The null hypothesis for this test states that the distributions from both datasets are identical. If the null is rejected, then you can conclude that your model has drifted.

⦁ Population stability Index (PSI): The PSI is a metric used to measure how a variable’s distribution has changed over time. It is a popular metric used for monitoring changes in the characteristics of a population, and thus, detecting model decay.

⦁ Z-score: Lastly, you can compare the feature distribution between the training and live data using the z-score. For example, if a number of live data points of a given variable have a z-score of +/- 3, the distribution of the variable may have shifted.

Things to monitor after deploying your model

⦁ Data Validation: One rule → Garbage In, garbage out. Your model is sensitive to the input data it receives. If there’s a change in the statistical distribution of the trained data from the data in production, model performance will decline significantly. Data schema and data drift have to be taken care of here.

⦁ Model Validation: It’s important to monitor model performance in production. If it falls below the threshold, a retraining job can be triggered. The re-trained model can be tested. After running a series of experiments when retraining your machine learning model, you need to save all the model metadata for ⦁ reproducibility. Your retraining pipeline should log different model versions and metadata to a metadata store alongside model performance metrics. We also need to monitor our model infrastructure security, compatibility, and consistency with the prediction service API before we deploy our model into production.

Approaches for dealing with new data

Retraining the model from scratch

Once you receive the new data, which you think is enough to breach the performance threshold for the worst, you can combine it with your old data (find the ways of combination here) and begin training your model from scratch. This way you will capture the new relations which were missing in the old data.

Batch Update

The periodic batch update can also be an option where you wait for the accumulation of the data, form batches, and then pass it through your model. Here also you are doing offline training, but you are also keeping up with the new data.

Updating the model with each new data point

Here comes the realm of Online and Continual Learning, where instead of retraining the model from scratch we update it as we come across new data. Find more information on this here.

When should the model be retrained?

The question of the hour. When?

This depends significantly on when we receive the new training data. The new data might be coming at a regular time interval (like per hour or per week), or it might come once a month or year. The new data might come from an interaction with the client side, or it might be generated from an experiment, or it may be fabricated to represent a specific scenario. The frequency by which this new data comes in and how much drift it introduces, helps us decide which learning technique we should go with. You can find more information on the learning techniques in a separate article here. So considering the case of model retraining, one can periodically schedule it, as per the new incoming data while keeping the speed of distributions changing in mind.

The following exemplary triggers can be set to initiate model retraining.

⦁ Based on output distribution: Suppose your model makes a mean prediction of X, when there is an increase or decrease of ten percent in this mean value, one can infer that there has been a significant model drift and retraining is required as per the new data distribution.

⦁ KPI (Key Performance Indicator) based model training. Here we need to decide on a KPI and determine the threshold of divergence which will trigger model retraining. But there is a catch, a threshold too low will lead to often retraining adding to compute cost and a threshold too high will refrain from retraining and we will have suboptimal models running in production. Find more on this, ⦁ here.

⦁ Seasonality: Some data distributions only flicker during particular seasons, so scheduling retraining during that time seems like a good idea. For example, changes in humidity and temperature during season changes can induce unwanted divergence in predictions by industrial agents deployed for particular tasks.

You can also automate model drift detection to trigger model retraining with the help of tools like Jenkins, and Kubernetes Jobs. So, whenever you find yourself in this situation of model retraining, just ask yourself,

‘How much new training data can be collected before representing the new state of the environment?’

Automatic Model Testing

If the retraining of a model is triggered automatically, also the tests if model training was successful should be at least partly automatized. ML projects have a lot more uncertainty than traditional software. In many cases, we don’t know if the project is even technically possible, so we have to invest some time to conduct research and give an answer. This uncertainty harms good software practices such as testing because we don’t want to spend time on testing ML projects in an early stage of development, which might not receive the green light for further continuation.

Neptune — Tool for continuous model monitoring

Neptune is a model monitoring platform that allows you to run a lot of experiments and store all your metadata for your MLOps workflow. With Neptune, your team can track experiments and reproduce promising models. It’s easy to integrate with any other framework. With Neptune AI, you can monitor your model pipeline effectively. Every model experiment/run can be logged.

ML Model Testing: 4 Teams Share How They Test Their Models — neptune.ai

Some things to care about during the production stage

Smoke test

It is about running a series of fast and basic tests on the core functionality of your production environment (as well as other environments), to ensure deployments are successful and that you catch any major issues before end-users run into them.

Usually, ML projects include a lot of packages and libraries. These packages provide new updates from time to time. The problem is that sometimes new updates change some functionalities in the package. Even if there is no visible change in the code, there might be changes in the logic that present a more significant issue. Also, maybe we want to use some older releases of more stable and well-tested packages.

Because of that, a good practice is to create a requirement.txt file with all dependencies and run a smoke test using a new test environment.

Unit Testing

After smoke testing, the next logical testing to implement is unit testing. As it’s mentioned above, unit tests isolate one specific component and test them separately. Basically, the idea is to split the code into blocks or units and test them one by one separately.

Using unit tests is easier to find bugs, especially in the earlier development cycle. It’s way more convenient to debug the code since we are able to analyze isolated pieces rather than the whole code. Also, it helps design better code because if it’s hard to isolate some pieces of the code for unit tests, it might mean that the code is not well structured. The rule of thumb is that the best moment to start writing unit tests is when we are beginning to organize the code into functions and classes.

Human in the Loop

Human-in-the-Loop aims to achieve what neither a human being nor a machine can achieve on their own. When a machine isn’t able to solve a problem, humans need to step in and intervene. This process results in the creation of a continuous feedback loop. With constant feedback, the algorithm learns and produces better results every time.

Typically, there are two Machine Learning algorithms where you can integrate HITL approaches. These include supervised and unsupervised learning. Used to improve the accuracy of rare datasets, while taking care of safety and precision.

So, this was some theoretical background on model retraining. In the next blog, we will have a look at different methods to address this problem of model drift.

Till then, Happy Learning. 😊

Model Retraining Bible, Part 2- Model Monitoring

Written by Hrithik Rai Saxena