Model Monitoring: A Comprehensive Introduction

Published in

Machine Learning Reply DACH

6 min readMar 24, 2022

Machine learning (ML) models are becoming an inseparable part of any information system. They are not proof-of-concepts anymore but are being used in the production environment, where a living version of our product is being used by real users Any component in production is required to be monitored to make sure that they meet the required standards and expectations (defined by engineers and stakeholders). Any system is built to satisfy a user’s need and we should remember that ML systems are not an exception. There is a very high chance that an ML model will not deliver value if it does not get updated (Figure 1). The circumstances in which ML models are functioning change rapidly, so must ML models.

Figure 1 — Monitored vs. Unmonitored Model

Monitoring is more critical for ML pipelines. Because data, as the main ingredient of an ML model, is difficult to be complete and up to date. Moreover, models usually consume more resources making them expensive components. A faulty model is usually bigger of a trouble than a faulty software piece. In addition, the development and production environments differ more widely in ML pipelines.

Figure 2 — Large Difference between Development and Production Environment of ML Systems

What is model monitoring?

Figure 3 — Feedback Loop of Model Monitoring

What do we mean by model monitoring? Model monitoring is the continuous tracking of clues and evidence on how well an ML system is performing, which also includes visualizing and alerting. Model monitoring consists of three main activities: tracking evidence, visualizing information, and alerting critical events. Accordingly, we need to tackle these three activities in the implementation of a monitoring system.

Feedback from model monitoring enables us to continuously adapt and improve our models. Such feedback can provide insight into the platform on which the model is running, the model performance, and the data. Traditional DevOps practices provide a plethora of methods and tools to monitor the software and hardware platforms. Monitoring the model performance and data is the new challenging task that we need to tackle.

What should we monitor?

To design the right model monitoring system, we need to ask the right leading questions. The questions should tackle two main under-investigation components, namely, model and data. Monitoring the model is concerned with the verification and validation of the model concept and model performance. The main questions to investigate the model are as the following:

Does the model perform as we expected (similarly to the performance during the training phase)?
Is the model solving the correct problem?
Is the model built based on the correct concept (the relationship between input and output)?
Does the model perform fair and ethical?

Monitoring the data might be even a more challenging task since there are many factors to investigate and analyze. In general, we face three situations regarding the data. Firstly, there are new patterns and conditions in the production, which the model has not seen and did not expect (unseen data scenario). Secondly, the model is trained on data with inaccurate distributions and characteristics (skewed data scenario). Finally, the circumstances and dynamics of the data have changed since the training of the model (outdated data scenario). Usually, all three scenarios are common in practice. Nevertheless, we can create a monitoring solution led by questions such as:

Are our data still relevant?
Do we have a data drift?
Should we include new features?
Should we exclude some old features?

How should we monitor?

To answer these questions, there are two general approaches for collecting information: collecting metrics and collecting logs. Table 1 presents short definitions and a pro-cons analysis of metric versus log. Both metrics and logs are necessary for having a complete monitoring system. Metrics are numerical measurements of property — system, model, or data property. We usually collect and store metrics in a time-series format which is suitable for analyzing the trends. Logs are detailed records of events consisting of both numerical and non-numerical data. In most cases, we can answer WHAT and WHEN questions using metrics. However, we need logs to answer WHY questions.

There are many standard metrics for evaluating and monitoring models and data. For example, we can monitor a model’s accuracy, precision, recall, RMSE, and so on. Similarly, we monitor the data using Population Stability Index (PSI), Characteristic Stability Index (CSI), Kullback-Leibler divergence, and so on.

How to set up a model monitoring system?

A monitoring system consists of three main components namely, metric/log collector, persistent storage, and visualization/dashboard component. These components are illustrated in Figure 4. The metrics are pulled by or pushed to the collector. To monitor continuous job pulling is a better strategy. Because pushing the metrics from the ML system can become a bottleneck while pulling does not affect the flow of a job. However, for the collector to be able to pull the data, the ML system needs to be instrumented. In other words, the ML system must provide the measurements for calculating the metrics.

Persistent storage keeps the track of the data over time. We need to decide how long metrics and logs should be persisted. Metrics are mostly quantities with a timestamp that we apply aggregation functions such as sum and average on them. Therefore, a time-series database would be the most efficient way to store them. Although logs contain information about the time of an event, we usually store logs in a NoSQL Document/Object database, which provides flexibility for information that we collect for different events (e.g. MongoDB or Amazon S3).

Finally, we need to analyze the data and visualize the results. To this end, a monitoring system must have a dashboard tool to provide valuable insight in a clear and real-time manner. The dashboard either receives the data in real-time from the metric/log collector or retrieves the data from the storage. Furthermore, there are critical situations where urgent action is needed. For example, if the data shows a strange distribution that might indicate fraud. For such situations, we need an alerting system. The alerting system sends the technical or business team notifications via mechanisms such as emails or instant messages. Accordingly, the team can investigate the circumstances and make a well-timed decision avoiding a potential loss.

Conclusion

Developing an ML model is just the start! Data is ever-changing and evolving, therefore models must evolve. To ensure that ML models are always delivering value, ML models must be monitored. No ML pipeline is complete without a monitoring system, which provides the feedback loop to the model. We can figure out in which situations they perform poorly and are costly. Model monitoring enables us not only to keep our models healthy but also to improve them continuously. A complete model monitoring system assures us that our models deliver value and continue to do so.