How to evaluate and monitor a Machine Learning model from experiment to production?

VeepeeTech
VeepeeTech
Published in
14 min readApr 12, 2022

This article has been written by Benoît LEGUAY and Mohamed Amine ZGHAL.

A/ Introduction

Veepee, the online flash sales european leader, has invested in a homemade recommender system to rank the 200+ sales on its website and to personalize their communications. The purpose of this Machine Learning algorithm is to display the right offer for each member, thus maximizing the conversion rate (number of buyers divided by number of visitors) benefiting all business units and customer segments. This service was built for batch (communications) as well as live (website) recommendations, meeting several technical requirements such as response time under 100 milliseconds. More information about the context and the algorithm we built are available in this previous article. Since the project kick-off 4 years ago, our team has shipped in production more than 50 flavors of our recommender system, each time adding new features and bringing more value to our customers: members and brands. We’ve been able to frequently iterate and identify new valuable features through constant evaluation and monitoring of our models in all the steps from training to production.

Our architecture

Our current architecture can be split into 6 main building blocks with slight differences between model training and serving pipelines:

  • Data preprocessing: We consume tracking (click, view, add to cart, etc) and operational (customer, sale, product, orders, etc) data in batch from our data warehouse and in stream (for serving) through pub/sub events from our tracking tool. During this step, we apply simple transformations, mainly filters.
  • Feature engineering: We transform the customer, sale, product and customer behavior data into meaningful features to feed our ML models
  • Data ingestion: We populate our production databases with pre-computed features and live features for live and batch inference.
  • Experiments: We experiment with new features and architectures by training and evaluating several models.
  • Model serving: We push the best models in production (API) by running frequent A/B tests.
  • Monitoring: We monitor the performance of our models in production with business metrics (such as the conversion rate, turnover per visitor, etc) and technical metrics (such as the average latency, fallback ratio, data freshness, etc).
Figure: Recommender system building blocks

B/ Evaluation

  1. Training evaluation

Before going across the metrics used, let’s recall a bit about our model architecture.

Our recommender system uses a siamese neural networks architecture with a pairwise loss.

The pairwise architecture needs to be fed with 2 examples at the same time. In the Veepee context, 1 example describes a user interacting with a specific sale. Then, an observation is composed of 2 examples: one positive and one negative. Imagine a user visits our website and makes a purchase, the purchased sale is a positive example and the rest of the sales on the homepage are considered to be negatives. The goal of the siamese architecture is to maximize the distance between a positive and a negative example. Thereby, the loss pictures a relative difference between the score assigned to both examples, this is used to update the model’s weights thanks to gradient descent.

Figure: Positive and negative examples

This siamese architecture is relevant for training because the model learns a score based on a relative difference between 2 examples. For inference we input all examples, and rank them according to descending output score in order to have a ranked list of sales for each customer. A high output score (high affinity between a member and a sale) results in the sale being ranked high in the homepage. On the other hand a low output score results in a low rank.

Learn more about our recommender system in the previous article Learning to rank at Veepee.

Figure: Pairwise model architecture

It’s important to evaluate our model during training for 2 main reasons:

First, it is useful to compare relative performances between models with different hyper-parameters, architectures, loss function etc. Since the space describing all parameters combinations is too large and pushing a model in production is costly and time consuming, we would like to assess models performance as early as possible i.e in the training step. This allows us to choose the best set of parameters to keep improving some models instead of others.

Figure: Loss evolution on an evaluation dataset during training step for several models

For example, in the figure above, we would prefer to continue the training step for the orange and red models and stop it for the blue one that underperforms.

Second, it allows the team to track potential learning failures. Indeed, deep learning’s number of parameters and layers create a non-convex loss function where it is easy to find suboptimal solutions (local optima, saddle point, divergence etc. ). Our models in production are usually trained during a full week, thus, we want to avoid losing time and computation resources for ones that diverge. In our training settings, we save our model state frequently, these training metrics are used to select the best version of one model, and define our early stopping threshold.

We use 2 metrics at training time, to achieve relative performance assessment and failure tracking:

  • Loss function

The loss function is the closest measure representing how our model is performing in the task assigned to it. In our case we use the hinge loss function but the following can be applied to any. As a general rule, you want your loss to decrease monotonically. Other patterns can indicate multiple different problems. For example, exploding loss can demonstrate unsuitable learning rate, or NaNs in the input data while repetitive patterns in the loss function might indicate that your input data needs to be shuffled. Comparing the training loss with the evaluation loss can show overfitting behavior.

  • Pairwise accuracy

As explained earlier, during training, our model takes 2 examples at the same time: a positive and a negative one. We would like to train our model such that a positive example outputs a greater score than the negative one, i.e, a positive example would be ranked before the negative one whenever we order the sales following a decreasing score order. This is important because at inference time, for each member, we get the score associated with each sale available on the home page. Then, we used these scores to rank the sales.

We define pairwise accuracy as the percentage of observations where the score of the negative example is lower than the positive one. This is equivalent to the percentage of well ranked pairs thus good recommendations.

These metrics really depend on the model used and should picture its low level goal (minimizing objective function). They are computed on the train dataset at each step (training on a batch of observations) and the evaluation dataset at 20 000 steps.

2. Offline evaluation

Offline evaluation metrics are calculated on the test dataset once the model has been trained. They aim at measuring metrics closer to the business goal in order to anticipate the model behavior in production.

Basically, while the training and evaluation datasets are composed of pairs of positive and negative examples, the test set is composed of the exhaustive list of examples, i.e, all available sales on the website at the order time. In this list, one is positive (sale purchased by the customer) and the rest are negative (available but not purchased by the customer at order time). Thereby, we focus on ranking-based metrics.

  • Mean Rank, Mean Reciprocal Rank, Mean Normalized Rank and Distribution

The Mean Rank (MR), is the average ranking of the positive (purchased) sale among all the available ones. We aim at minimizing it (i.e if the algorithm was in production, the purchased sale would have been ranked at the top of the website). The mean rank is a really straightforward approach to get a grasp on our model’s ability to predict the likelihood of purchasing a sale with respect to a member. The mean rank varies between 1 and the number of available sales on the website. The smaller the mean rank, the better we rank the purchased sale among all available ones.

Figure: Mean rank

The same kind of insight is given by the Mean Reciprocal Rank (MRR), that is the average inverse of the positive example’s rank. The Mean reciprocal rank varies between 0 and 1. The closer the MRR is to 1 to one the better is the ranking.

Even though they are quite close, using both allows you to have a better view on your model performance. In particular, MRR is less affected by poorly ranked items than MR, on the other hand small changes in well ranked items can impact the final score a lot.

Figure: Different MR and MRR for different rankings.

Comparing figures A and B, we can see that the MRR rewards more the examples ranked in first positions, the value being almost divided by 2 while the MR is unchanged. The difference between A and D shows that the MR is greatly impacted by outliers value. On the other hand, C shows that improving the MRR is a more complex task as it requires the positive element to always be among the first positions.

Because the number of sales on the home page can vary a lot depending on the country, the Mean Normalized Rank is a more appropriate metric when comparing the same model in different countries. As you cannot interfere with business decisions in a country (i.e. the number of sales displayed on the home page), we have to adapt our metrics to fit our comparison needs.

The main drawback of the MR/MRR/MNR is that average computation is subject to outlier values, making it difficult to interpret without other information. That is why we also use the distribution function of the rank. This gives a clear idea about the variance and the presence of outliers.

  • @K metrics

The @K metrics are quite useful to binarize the information contained in a list of ordered objects. It divides the list in 2 groups: the top K and the rest, allowing us to compute any kind of metrics based on these groups. By using this idea, we assume we don’t care about the specific rank but whether or not it is among the top K. With this assumption, we can obtain multiple insights:

  • The Precision@K gives you the frequency of the positive sale being in rank in the top n.
  • The Turnover@K pictures the generated turnover by sales ranked at least at n. When comparing models, you can have a grasp on how different models tend to recommend more different kinds of sales based on their potential turnover.

At Veepee, we use K equals 5 since the first 5 sales are the most viewed. This is based on our business rules and it should always depend on that. If the structure of our e-commerce website highlights 10 sales, we should use K = 10.

On our homepage, people mostly buy from one sale for each session, hence we usually have 1 positive feedback per context. This is the reason we cannot use ranking metrics based on multiple positive examples, for instance NDCG.

3. Online evaluation

This module is also called “Purchases Replay” and it gives us feedback on our models in real time using our serving pipelines. The great advantage of this module is that it allows us to test our model in a production-like environment without having it running in production.

Basically, for each purchase made on the home page, we call the models currently in Purchase Replay, the ones we want to evaluate. We’re answering the following question: what would have been the sales ranks on the homepage if we were using this new model instead of the one running in production. Thanks to this online evaluation process, we can compute the metrics described in the section above, and compare it with the results of the offline evaluation.

Once the model is deployed in purchase replay, we have a constant feedback loop with online data.

The second advantage of online evaluation is its ability to reproduce the production environment in terms of features computation. This helps verify our feature engineering process is identical in training / testing and in serving.

Indeed, to train and test a model we create a historical dataset (i.e we calculate the state of the features at the time of purchase). However when a model runs in production we calculate the real time state of the features. Because of that, we cannot use the exact same feature engineering process nor the same data sources and as we are not free from human error, this is why we need this pre-production environment.

For a given model, comparing its offline evaluation with the online evaluation allows us to verify we have the same feature engineering process between testing and serving phase. Also, by comparing the purchase replay (online) performances of a recently trained model with the one in production, we can ensure good performances whenever we serve this new model.

C/ Monitoring and alerting

Whenever a new model achieves great results at training, offline evaluation and purchase replay evaluations, we assess its performance in an A/B test. We create an A/B test and split our customers base in multiple groups, each group will be shown a different flavor of our model (new models). During this A/B test, we monitor the performance of the model on business and technical perspectives.

  1. Business monitoring

We assess the business performance of our models in production by tracking a set of business metrics such as the sales entry rate, site conversion rate, turnover per site visitor and the average order value. These metrics are computed following several levels of aggregation (by country, sector, customer segment, etc).

Statistical significance is automatically calculated in order to estimate the A/B test end date (number of remaining days to reach statistically significant results) and ensure strong and reliable conclusions.

Figure: Snapshot of the statistical tests page from the business monitoring board

The metrics are gathered in a dedicated dashboard (DataStudio) that is updated on a daily basis. Looking at this daily snapshot of our production environment is great for analytics purposes but not enough to detect and troubleshoot production issues. Thus, it’s complemented with a live technical monitoring board.

2. Technical monitoring and alerting

We use Prometheus to extract technical metrics from our logs streams and expose them using Grafana dashboards. We use Service Level Indicators (SLIs) and Service Level Objectives (SLOs) concepts as defined in the Google SRE book.

We monitor a large number of metrics that describe all our serving building blocks and set threshold based alerts in order to detect any issue and identify its source.
Whenever an alert is raised, the team receives a slack message along with useful information and a link to the monitoring board.

Here are some examples from our monitoring board:

  • Volume: Number of calls per variation

Looking at the number of calls (per second) per variation, we’re able to recognise a daily pattern which we can use to adapt the usage of our resources. For instance, allocate more resources in the morning to handle the connection peak at sales opening hour. It’s also very useful to detect issues: here after, we observe the volume bottoming out on Thursday. Using this 7 days time range, we’re able to spot unusual drops then diagnose the root cause of this issue using complementary monitoring graphs.

Number of calls during a whole week (Mon-Sun).
  • Quality: Ratio of personalisation answers (vs popularity) per variation

Whenever our personalisation model is too slow (times out) to return a recommendation, we use a popularity based fallback algorithm instead, i.e, rank sales from the most popular one to the least popular one during the last hour. We monitor the ratio of personalisation based recommendations over all recommendations. It’s usually between 99% and 99.9%.
Whenever the ratio is below 90% for more than 5 minutes, an alert is raised. This threshold is high enough to rapidly detect outages and not too high to avoid frequent false alarms.

Figure: Quality — Ratio of personnalisation answers (vs popularity) per variation
  • Data freshness: Watermark lag
    We stream purchases events and use them to compute live popularity features and feed them to our recommender system. We monitor the data freshness of these streaming events via the lag between the event creation and consumption. The smaller the better! This lag is usually under 1 minute. We set an alert whenever its value goes over 10 minutes for more than 5 minutes. It’s a compromise between fast detection and low false alarms.
Figure: Data freshness — Watermark lag
  • Infrastructure: Endpoint memory usage
    We have dissociated the infrastructure of our live service (website recommendations) and batch service (emails and push notifications recommendations). We scale our resources (number of pods running in our kubernetes cluster) according to our needs. For example, communications are sent at specific times of the day every day so we scale up during these time windows and scale back down after the communications are sent. We monitor the resources usage (vCPU, memory, etc) in order to ensure a good service level. Since this metric is very narrow, we use it to explore issues rather than detect them. Therefore, we did not set any alert on this metric.
Figure: Infrastructure — endpoint memory usage

The technical monitoring board is quite exhaustive: divided into two pages, the first one provides a general overview with global metrics; while the second one provides more detailed information about our infrastructure.

We usually rely on alerts set on the general overview page to detect issues and then use the detailed page to isolate and understand the root cause.

3. Data quality monitoring

No automated data quality monitoring is available at the time of writing this article. We manually activate what we call the “features logs” whenever we observe some discrepancies in performance between offline evaluation and purchase replay evaluation and use it to understand and solve the discrepancy.
Features logs are simply a drop in BigQuery of all the serving features that are fed to the model in production for every recommendation (1 recommendation = 1 user connection or 1 user communication). This is a huge volume of data, around 1 Terabyte per day.

Ad hoc comparison is done between the feature logs (serving features) and the offline computed features used for model training and offline evaluation. This analysis is manual, time consuming and very expensive (storage and transformation costs).

Our team is currently designing an automatic data quality monitoring and alerting tool to complete our monitoring and alerting system and reduce cost, time and manual work.

Conclusion

Delivering business value through ML requires exhaustive evaluations and monitoring through the lifetime of ML models. Evaluation at training is already a well known and mastered step in data science. However, monitoring in production is more tricky, although we can rely on the SRE methodology to build a solid monitoring and alerting system. Finally, ensuring data coherence between training and serving remains a fuzzy field where discrepancies are hard to detect and solve without a dedicated data quality monitoring tool.

We’re working on better detection and prevention of data discrepancies through building a dedicated data quality and coherence tool. Stay tuned to get more information about this topic in our next articles!

Read previous articles from Veepee’s Customer eXperience Optimisation (CXO) team:

--

--

VeepeeTech
VeepeeTech

VeepeeTech is one of the biggest tech communities in the retail industry in Europe. If you feel ready to compete with most of the best IT talent, join us.