Model monitoring

Theofilos Papapanagiotou
Prosus AI Tech Blog
9 min readDec 31, 2021

In our previous post about the technical capabilities of an ML platform, we’ve highlighted the problem of model degradation over time. In Prosus, we embrace MLOps practices across our businesses to deal with that problem. One of the great things that MLOps took from DevOps is the continuous effort to improve the monitoring, so in this post, we are going to talk about model monitoring techniques. In other words: metrics, metrics, and metrics!

Our goal is to discuss the relevant metrics to help you monitor your models and ensure that the performance and thus impact of your models stays constant over time. In addition to explaining the main metrics, we’ll list practical tools to allow you to immediately apply model monitoring into your ML workflow.

inspired by the software engineering workflow, monitoring closes the cycle of the continuous training pipeline

Model monitoring extends beyond delivering a model to operation. It closes the circuit of the workflow by connecting the deployment with the next runs of the continuous training pipeline. It’s a continuous effort that experienced ML professionals practice in order to keep high quality of results especially when dealing with non-stationary data domains.

Meten is weten (In Dutch, means Measuring things brings knowledge)

Service metrics

Service metrics are the traditional metrics that any like application owner and operator should track, like the latency, the success rate, and the number of invocations of an inference service, aka model serving.

The latency or response time tells us how fast an inference request is served. This is a metric of the user experience quality. It reflects the amount of time it took for the input data points to be received, transformed, multiplied by the weights of our model, and sent back to the user. It is measured in milliseconds and less is better. The response time increases proportionally to the number of parameters of the model, as it will have to multiply a larger weight matrix with the input vector. Moving this multiply operation to the GPU, performing model quantization, are good ways to deal with this problem. If the input vector needs to be transformed or enriched by external features, it also adds up to the latency. Scaling out such a transformer component independently from the predictor is a good way to overcome this problem.

latency of a model response observed in different percentiles

The success rate or availability is the percentage of the successful over all the received inference requests. For many reasons some requests might fail, so we strive to maximize this metric, with the tradeoff being the operational cost. Typically the infrastructure design is crucial to optimize this metric, so scaling the components properly can help deal with timeouts thrown when limits are reached across the serving pipeline: Timeouts on the load balancer, the service mesh, the input transformer, the third-party components that enrich a request, the model server itself. Another reason is that input data change over time and future instances might be outliers of older models. Validating that input data respect the expected data schema in terms of data types and range of values is a good way to ensure that they’re within the expected distribution.

The number of invocations or requests per second is a metric that cannot be optimized as it reflects the business adoption of the model. It should be tracked though, next to the previous service metrics to inform any infra design decisions around scalability.

model invocations is a metric usually expressed in requests per second

These three service metrics are fundamental concepts, and having them broken down by model and model version in a time-series fashion will allow us eventually to better operate them.

Finally, the space of application performance management, distributed tracing and observability, has tools that offer transaction monitoring capabilities to measure such metrics across the different components of an inference service. Being able to visualize the latency of an individual prediction across the data transformer, the online feature store and the model server is a powerful root cause analysis technique. Similarly in a model mesh environment, following the path of a request across the different models of an ensemble or visualizing the whole routing flow on an inference graph demonstrates our understanding of the model serving traffic.

Key tools in this category include Prometheus, Grafana, Jaeger, Kiali.

Model server metrics

After training and serializing a model to any model archive format, there are some metrics we can look at when we are loading it to a model server for serving, especially during the initialization and normal operation phase. A model file size can be of a few bytes (imagine a linear regression model, we just store the slope and the intercept) or some tens of gigabytes (see the 6bln params gpt-j model).

When the model server loads the model file from storage, it copies the computation graph, the weights and biases values to the CPU memory of the model server. If a GPU is used, there is an additional operation of copying the model file from the CPU RAM to the GPU RAM. The time it takes to perform these copy activities can be many seconds long and measured by the model load (init/restore graph) time.

Some patterns allow lazy initialization, where not the full model is loaded in memory, but it can later be warmed up with a few normal requests. That speed of the initialization of a model server has a direct impact on the ability of the system to scale out fast enough to meet increasing demand. Warmup latency is the metric that tracks this specific case.

histograms of tensorflow serving computation graph metrics exposed in prometheus buckets

To measure the actual execution time of the computation graph (the forward pass), we have the graph runtime metric, which is the same as the latency discussed above and observed as the prediction latency in a higher service abstraction level.

Finally, as GPUs allow the use of fractions of their memory by many model servers, the instances of the model servers should expose the size of their GPU memory allocation.

All these metrics can be stored also in time series db, and since these model servers come usually in the form of microservices, it would be a best practice to expose histograms of these data, preferably in Prometheus buckets.

Payload logging

The result of any inference request is the new knowledge produced by a model and encapsulates the value of a business/product that this model supports. As such a treasure, that data should be stored and processed as a first-class citizen. Logging that payload (both request and response), is a fundamental principle because we can use it to generate new training datasets to refresh our models, understand what we predict and even explain the behavior of the models. Payload logging is essentially an enabler for the downstream monitoring tasks such as data validation, drift detection, and explainability.

the inference request and the response from a model which evaluates the price of cars

Technically, such a payload should be treated as an event, channeled through brokers which will make sure it will be available to downstream components, the processors. And such processors can be tools that generate descriptive statistics about the incoming data, their new distribution, to compare it against older versions of it and calculate drift. These distance metrics produced for distributions of features can be used as trigger events for retraining jobs. In practice, Shannon divergence for numerical features, L-infinity distance for categorical features. Similarly, there are efforts to provide such metrics for distributions of data in the embedding space.

the inference response of an object detector serialized in a cloudevent

Tools in the space of payload logging include Cloudevents, OpenSearch, Knative-eventing.

Data validation and drift detection

A data validation component has three roles in an ML pipeline. One on each of the training, serving, and post-serving/retraining phases.

During training, it generates the schema of a data snapshot used for training a particular model version and stores it in the metadata store, attached to that version, next to the data snapshot digest from the input data platform. The schema is a collection of descriptive statistics of the features used for that model version, their distribution, the data types, and the ranges of the values of these features.

descriptive statistics of categorical features calculated by tfdv and visualised by facets

It is also a consumer of the prediction event, during serving time. The transformer component of the inference service can use it to validate whether the inference request is within the value ranges of the training data schema or not and respond accordingly. A sample of such a data schema is shown below:

a categorical feature described in a tfx data schema stored in protobuf

Finally, it can also periodically collect windows of inferred data to check whether their distribution is similar to the one that has been used for training. The detection of training/serving skew is a task that depends on such similarity metrics.

Tools in that space are TFDV, great expectations, deequ, alibi-detect.

Trustworthy AI

The use of explainers on model serving is a pattern used for explainability, fairness detection, and adversarial robustness monitoring, the holy trinity to handle the upcoming EU regulations in the space of AI ethics. This guidance should not be seen as a compliance problem, but as a methodology for business opportunities in the space of building user trust on top of what GDPR type of regulations have brought. The pattern consists of the concept that these are components in form of microservices, external to the model predictor, used for the mentioned downstream tasks.

In the case of explainability, different algorithms are available for different modalities, whether we need to understand the data or the model, based on the features or just samples, local or global explanations, providing case-based reasoning, interpretability, and the proxy metrics faithfulness and monotonicity. Faithfulness is the metric of correlation between attribute importance weights and the corresponding effect on the classifier. Monotonicity measures the effect of individual features on model performance by evaluating the effect on the model performance of incrementally adding each attribute in order of increasing importance. An anchor model can report the feature importance in the case of tabular data or can highlight the superpixels of an image to show the region used for the decision.

On fairness detection, there are algorithms to mitigate bias in datasets and models, as well as a rich list of metrics, including disparate impact to uncover discriminatory intent, group fairness metrics derived from selection and error rates, including rich subgroup fairness, sample distortion metrics, generalized entropy index, differential fairness, and bias amplification.

Finally, there are attack techniques to defend and evaluate models against the adversarial threats of evasion, poisoning, extraction, and inference. Metrics include empirical robustness, loss sensitivity, clever and Wasserstein distance.

Tools in the space of explainers are AIX360, AIF360, Adversarial Robustness Toolbox, and Alibi.

Conclusion

The plethora of these metrics is seen as the elegant approach when running models in production. It should not be omited by AI practitioners who strive for operational excellence, but should be seen as a continuous effort of improvement. The level of their adoption mirrors the maturity of MLOps in our group, and eventually our ability and readiness to use AI to generate business value.

I’d like to thank my colleagues from the Prosus AI team, for their suggestions and help in structuring it in a blog post format. Please feel free to ask questions / provide suggestions in the comments section or reach out to us at datascience@prosus.com.

--

--

No responses yet