Evaluation of classification models on unbalanced production data

Taming production data disbalance on classification models comparison

Published in

Bumble Tech

11 min readJan 13, 2022

When it comes to benchmarking machine learning models, the reliability of the analysis plays the most important role. To draw the right conclusions, not only do the right metrics need to be selected but also the test dataset distribution needs to be class-balanced as well as represent the real production load.

Here at Bumble Inc — the parent company of Bumble and Badoo, two of the world’s highest-grossing dating apps with millions of users worldwide — in terms of safety moderation, we realised that an analysis of the photo distribution in production was not going to give us the desired insights we wanted because the data is very unbalanced.

In the light of this, we added an artificial bias to our testing dataset and it helped us to reach conclusions that were far from obvious. We conducted an interesting comparative analysis between our internal photo moderation models and those of third-party providers. In this post, we will cover the methodology we used.

Where did the task come from?

Our mission is to create a world where all relationships are healthy and equitable. Therefore, user safety is our highest priority. We are constantly striving to improve our controls and processes around user safety. Along with photo moderation, we also use information from text data to help improve user safety. We have implemented a rude-message detector in one of our apps. Find out more about that here.

A massive number of profile photos are uploaded to the platforms every single day, and we are keen to make the content as safe as possible. Currently, profile photos uploaded to the app undergo a two-step moderation process. The first one is automatic moderation based on various computer vision models, and the second is manual moderation. The photos that fail to pass automatic moderation and therefore are potentially unsafe, are sent for manual moderation.

Given how sensitive the task is, human moderation is very thorough and time-consuming, that is why we wanted to automate as much of the moderation as possible whilst keeping it high quality. We are focused on improving our photo moderation algorithms, and that is the reason why we are not only improving the models we built in-house — but we are also trying to leverage open-source frameworks and various third-party models.

Unbalanced production data and how to deal with it

Class balance is a common problem for both training and validation of the model. A production distribution of samples is rarely uniform. A lot of common tasks like fraud detection, spam filtering, and anomaly detection have a class balance lower than 1:100. And the problem of class balance becomes even more severe when you have limited model capacity or staging environment limitations. For example, having 10,000 available requests for a dataset with a 1:100 class distribution makes quality assessment almost impossible, since 100 cases of a minor class might be very specific and will not generalise to the production distribution. To prepare a solid model quality comparison with these limitations, metrics should be chosen carefully, and the dataset collected smartly.

Common metrics for comparing classification models

Here follows a brief description of the most popular metrics which are used for comparing classification models and for quality assessment. We will return to the example of a model that outputs the probability of a photo being unsafe. A probability above a certain threshold indicates the photo might be unsafe.

Confusion matrix

The confusion matrix represents all possible classification cases. Leveraging it for the photo moderation task:

True Negative (TN) — when both model and moderator decided that the photo is safe.
False Negative (FN) — when the model decided that the photo is safe, while in reality, it is not (based on the moderator decision).
True Positive (TP) — when both model and moderator agreed that the photo is unsafe.
False Positive (FP) — when the model decided that the photo is unsafe, while actually, it is safe (based on the moderator decision).

The main advantage of using a confusion matrix is that we can measure the specific algorithm outcome and easily transfer it to the production load. For example, if we’re going to automate all the safety moderation, the FN part of the matrix will show us how many unsafe photos will be passed to the platform and FP will represent the number of unnecessary moderations. Moreover, with these metrics, we can easily estimate moderation costs. For instance, if the model performs well on negative cases (TN+FN with a low FN), and we have decided not to apply manual moderation for them, we can calculate how many moderators we will need to handle all the positive (TP+FP) cases.

Model threshold dependency could be considered as a main disadvantage of the confusion matrix in terms of model benchmarking. We would have N confusion matrices for the N thresholds we are testing, which increases the complexity of the analysis. To provide comprehensive and more digestible insights, the confusion matrix elements are usually combined to generate aggregated measures. The most common metrics based on a combination of confusion matrix outputs are — Accuracy, Precision, Recall and F-score

Accuracy, Precision, Recall, F-score

Accuracy, precision, recall, and F-score are the combinations of the confusion matrix output, which represents the specific bias of the model.

All these metrics being generated using a confusion matrix, they inherit its dependency from a specific threshold. They yield useful business insights, for example, the rate of unsafe photos we’re going to pass to the platform if we start using a model without manual moderation (1-recall); or the rate of unnecessary moderation there is going to be (1-precision).

However, the class imbalance could lead to wrong decisions. For example, for the production load of 97% safe photos and 3% of unsafe photos, we’re going to have 0.97 Accuracy (which is high!) for a dummy model that always outputs 0 (safe). And the threshold dependency will not allow us to compare models’ performances until we tried all possible thresholds. The latter problem is solved by calculating the area under the curve that represents precision-recall dependency for a range of thresholds.

AUC-ROC, AUC-PR

The problem of a threshold dependency is solved by the AUC (area under the curve) approach. For a range of thresholds, specific metrics are calculated:

Precision and Recall for PR curve
True Positive Rate (sensitivity) and False Positive Rate (1-specificity) for ROC curve.

The AUC approach deals with the problem of threshold specificity; that is one of its greatest and most useful advantages. On the other hand, it cannot give us the desired business insights we need. The probability of a specific unsafe photo being correctly classified as unsafe could be an interpretation of the ROC AUC number, but still not be enough to understand how the model will perform in production. Because of this, the lack of business insights could be considered as the main disadvantage of AUC-ROC or AUC-PR metrics.

How to deal with production unbalance?

Let’s go back to the photo moderation example. Imagine we have a binary classification Model A in production, and we want to switch it to binary classification Model B, which is not deployed on a production-load infrastructure and therefore has limited request capacity. The main question is — how will Model B perform in production?

Imagine we have 97% of safe photos in production so, in terms of automation, we want to avoid sending photos with “Negative” (0, TN+FP) model output to moderators since it will save us on moderation time. On the other hand, we are keen to reduce the number of unsafe photos passed to the platform as far as possible, so the number of FN should be considered as well. In short, we want to increase automation (TN+FP), with as few FN cases as possible (safety).

Possible production distribution of cases (numbers are invented to demonstrate the example)

The diagram above shows the possible production distribution of model outputs over manual moderation results. And we can see that there are only 300 unsafe cases here, which makes the generalisation ability of data questionable. Ten thousand photos from production could be taken and passed to Model B — but this approach could lead us to a wrong conclusion because positive cases could be very specific and might not represent the real model behaviour.

Instead, we can try to artificially bias the distribution of the dataset we are going to use, to make it balanced. There are two ways of doing that, both of which will reduce the probability of the dataset being non-general, and therefore boost the reliability of the comparison analysis. This is especially so when it comes to positive cases about which we want to be confident of the model’s performance.

Collecting dataset in a “smart” way

The first option is “photo-based”. We can create a balanced dataset in terms of safe and unsafe photos, saving the production distribution of confusion matrix cases. In other words, we can pick 5,000 safe and 5,000 unsafe photos randomly. The distribution should look like this:

Artificially-based distribution of cases saving the production distribution of confusion matrix cases. “Photo-based” distribution (numbers are invented to demonstrate the example)

Selecting this option will allow us to compare models on the more general dataset since we have an equal quantity of unsafe and safe photos. To get accurate business insights into safety and automation, we would then need to transfer the error from the confusion matrix to the production distribution.

The second approach is “model-based”. We can collect output-specific photos (in terms of Model A), making the confusion matrix cases balanced.

Artificially-biased distribution of cases concerning Model A output. “Model-based” distribution.

If we collect a dataset with 25% of all cases (TN/FP/TP/FN), we can compare the performance of Model A and Model B in production, as well as find out if Model B could help us with misclassified samples in further analysis. This approach is very similar to what happens under the ‘boosting’ technique. Boosting is an ensembling algorithm that converts multiple weak learners into a strong one. Therefore, understanding the weak and strong sides of both algorithms could help us create a much better model overall. Having this kind of split could be beneficial not only for comparing models but for gathering more insights around using the new model in ensemble with the old one.

Let’s look closer at how we can compare models using the “model-based” biasing approach.

Measuring quality and transferring errors

After running a Model B over the “model-based” biased dataset, we get the following confusion matrix:

Looking at the confusion matrix above one could wrongly conclude that Model B is better than model A as there are more TN and TP cases. But we can see that this is not the case when we take a closer look at the error transferring:

Using the transfer chart above, we can analyse how Model B performs on specific Model A cases. Since we have added an artificial bias to the dataset to make it balanced, we need to ‘un-reweight’ samples according to the way the dataset was collected to read the production performance.

TN was 90% of the production load. And 77.47% of the load (2152/2500*100=86.08% of TN and 86.08*0.9=77.47% of all production samples) stayed TN, while 12.53% (348/2500*100=13.92% of TN and 13.92*0.9=12.53% of all production samples) turned to FP.

FN was 0.1% of the production load. And 0.08% of it stayed FN, while 0.02% turned to TP.

Doing similar calculations for the rest of the cases we see:

Error transfer chart in percentage from production distribution. As 90% of TN, a decently large part of production distribution was converted to FP (12.53%), and it seriously compromised the overall model quality.

Even though the confusion matrix was showing us better results for Model B, now we can see that production stats seems to be worse, both for:

automation (TN+FN) — 90.1% turned out to be 80.65%
safety (FN) — 0.1% turned out to be 0.37%.

To see more clearly what happened, let’s look at the possible and simplified classification example:

Model A classification example. Red dots are the safe (negative) photos, blue ones are unsafe (positive) photos. The orange line (model A) separates the photos with some FP and FN cases.

The orange line here is Model A which separates negative (red) and positive (blue) samples. The plot above represents the production distribution of samples. When we balance our dataset in terms of confusion matrix cases, we are increasing the importance of FN, FP, and TP, and reducing the weight of TN:

Model A classification example with weighted confusion matrix cases. TN dots became thinner, as we’ve reduced the weight for TN, while FN, FP, and TP dots became thicker.

Now, let’s try to draw possible Model B behaviour. It might differ from the one that is presented, but this illustration can help us understand the underlying pattern in a simple example. Concerning what we’ve seen in the confusion matrix, Model B should work better with emphasised FN and FP. But still, it might have larger errors on non-emphasised samples (which were TP and TN according to Model A).

Model B classification example with weighted confusion matrix cases. We can see while we are reducing the impact of previously highlighted FP and FN, new FP and FN cases appeared.

And that’s the case here — as we can see. More FN and FP appeared on previous TP and TN respectively. Finally, un-weighting samples back to production distribution:

Model B classification example on production distribution. We see, that while we’ve dealt with previously highlighted FP and FN cases, new ones have appeared, and become a large part of the whole production load.

Consequently, we can see that, on production, Model B is going to perform worse than Model A, and so the decision is to stay with Model A for now. However, further investigation on a model that has better identification of some complicated cases (with respect to model A) can lead us to a generally more solid solution that uses strong sides of both Model A and Model B. So the next step for this case would be to try to somehow combine these models in order to get better overall model performance.

Conclusion

Keeping it up to a decent level has its challenges, especially when it comes to Machine Learning. We remain keen to improve our photo safety moderation systems and are constantly trying the latest solutions and models.

To come to the right conclusions with respect to model performance comparison, not only should insightful metrics be selected, but also a dataset should be carefully prepared, and the results transferred to production load.

The growing popularity of AI is bringing lots of new solutions available to the market. Having the ability to accurately compare similar ones is the key to success.