Consolidating AI Performance Measurement in Production

Embracing a Single Holistic Metric

Published in

Theator Tech

8 min readMar 3, 2024

This blog post is all about metrics. Metrics are key to a successful machine learning (ML) project. Success is tricky to measure, as outperforming state-of-the-art in an interesting benchmark or publishing a paper on a new method you developed are significant success milestones. But in the scope of this post, I mean a project that reaches the production environment and impacts users in a meaningful way.

The main takeaway I will try to convey is that metrics used during development phases might not be an optimal representation of production performance. I will also suggest a more consolidated and streamlined approach to quantifying the impact of an ML system on the users in the production environment.

The transition to a single holistic metric discussed in this blog post is mainly suited to a system that provides many prediction outputs for each input sample, as this case raises a complex environment for measuring the end performance.

So what is the problem with the metrics I know and love?

When starting to work on a machine learning project, we usually begin with defining the project goal and determining what we want to achieve by setting the main key performance indicators (KPIs). With this goal in mind, we can start exploring two main components of a new problem: what data do I need? and what metrics should I use?

Focusing on metrics, we can roughly state that each machine learning task usually requires 1–3 main measurements of success. Those metrics will be the lighthouse for the researcher to optimize on.

However, lighthouse metrics can be confusing and lead to different research directions. For example, a standard metric in classification problems is accuracy. But during training as part of the back propagation process, a loss function, such as cross-entropy loss, is used to adjust the model weights based on gradient information. Now is cross-entropy always a good proxy correlated to accuracy?

The byproduct of this process yields a model that can operate — at best — on a specific dev dataset in a somewhat sterilized dev environment.

Note to avoid confusion, I’m not referring directly to a generalization or overfitting issue, as these can be measured and tackled, but more to how a model will act in production when in-the-wild scenarios occur.

Now let’s consider a complex ML-based system built in a chain manner, particularly in cases when one model’s output is the input of one or many other models. In this setup, the system can fail if one part of the chain fails. This is not a failure as in raising an error, but a failure mode in the quality of the end results. Such degradation impacts the end-user experience in a way that is hard to measure using commonly used metrics.

With this in mind, I would like to show how to transition to a single holistic metric, called an impact metric, that represents the end performance of your entire AI system as viewed by your designated users.

Learning by example

In order to dive into the impact metric development, I need an example of data and related annotations. I’ll use the following example of data and annotations structure:

Given a video as input, a team of annotators review the video and tag/mark specific key moments based on designated instructions.

A single action annotation uses a structure of three elements: [action type, start time, end time]. That is, classifying the action into a specific type while marking its start and end times (in seconds).

For example, in the following two videos, taken from the Kinetics 400 Dataset, the expected annotations are [throwing_discus, 6, 8] and [throwing_discus, 0, 2].

[throwing_discus, 6, 8]

[throwing_discus, 0, 2]

Note that although this is a relatively specific example common in video understanding tasks, the same paradigm works in any setting and setup you might have in your domain, whether video-related or not, as long as the system provides multiple predictions for a given input.

Now imagine the process of manually annotating a large-scale dataset as described above using [type, start, end] and how lengthy this process would be. This would be even harder when handling videos with an extended duration, for example, in the case of self-driving cars, sports, or the surgical domain.

To expedite the annotation process and make it faster, more effective, cheaper, and scalable, a computer vision system can be developed to assist the annotators’ workflow. This system can predict these actions as an initial guess and the annotators can then review, fix, add, or delete. As the system improves over time, the entire annotation process can transition into an “only” validation process.

From a researcher’s point of view, during the development of the model, the focus would be on optimizing specific metrics such as accuracy per second or mean average precision (mAP) — but does this represent well enough the experience that annotators would have once this model is deployed?

Developing a single holistic metric - Impact Metric

Our goal was to develop a metric that could digest two sets of inputs:
(1) annotations done manually by annotators and (2) the initial prediction of annotations done by the AI system.

For simplicity, consider the following two sets of input as an example:

Intuitively, we needed to measure how many changes the user made over the AI’s initial predictions and find a way to quantify each type of change. Let’s say that only adjusting an annotation’s end time (1st row above) is less ‘severe’ than deleting a complete annotation that is wrong (4th row above). Also, adding a full annotation that the system missed requires much more manual work than just deleting an incorrect annotation.

To mimic the annotator’s work and quantify each adjustment, we used a proxy task that was in proportion to the number of clicks required to make adjustments. For example, a simple value fix would equal 1 penalty score. A delete would equal 2 penalty scores, and adding a new annotation would equal 3 penalty scores. Obviously, the penalty mapping depends a lot on your product and application.

You’ve got a match!

If we look at the example above, we can see that some annotations are of the same type with different timing (1), and some have the exact timing but a different type (2). A tricky point here is how to match pairs, one from each set of annotations (algo and manual), and measure the proper penalty.

To achieve this, we used the Hungarian algorithm: “a combinatorial optimization algorithm that solves the assignment problem”. This method solves the matching problem using bipartite graphs.

We first needed to define three cost matrices:

C_type - a cost matrix for the type value - a binary matrix with 1 when the type was equal between algo and manual
C_start - a cost matrix for the start value - the absolute difference between the start times of all algo vs. manual (normalized by the video duration)
C_end - a cost matrix for the end value - the absolute difference between the end times of all algo vs. manual (normalized by the video duration)

For the two sets in the example above, the resultant cost matrices were:

C_type = [[0 1 0]
          [0 1 0]
          [1 1 1]
          [0 1 0]]
C_start = [[ 0. 7. 16.]
           [ 7. 0. 9.]
           [16. 9. 0.]
           [20. 13. 4.]]
C_end = [[ 1. 9. 17.]
         [ 8. 0. 8.]
         [15. 7. 1.]
         [21. 13. 5.]]

The final cost matrix was the weighted sum of these three matrices:
C = W_type * C_type + W_start * C_start + W_end * C_end
where each C[i,j] was the cost of matching i (algo) and j (manual), and all W’s are equal to 1.0.

Note that you can adjust the cost matrix definition and the weights depending on what is more important in your application.

Once we had the final cost matrix, the Hungarian algorithm’s goal was to find a complete assignment of algo to the manual. Such assignments achieved minimal cost. Given the matched pairs, we could start accumulating the penalty values:

For the ones that had a pair - checking the penalty score of fixes
For those without a pair in an algo set - treating it as a delete penalty
For those without a pair in a manual set - treating it as an add penalty

Summing up all the penalties gave us the penalty score.

Bear with me; we’re almost there

Ok, so we had a penalty score, but what does it mean?

A perfect score is achieved if zero mistakes are made and manual annotations equal algo results. But how should we consider this value and compare different samples that have a different number of annotations?

To solve this, we also measured the ‘amount’ of work needed if the ML system didn’t provide any initial guess. Let’s call this value the scratch score.

Dividing the penalty score by the scratch score yielded the final impact metric value:

Now, this final score is a very valuable measurement of our models ‘impact’ on users in production:

impact metric > 1 means AI is only interfering in the work process, as more changes were needed than doing this sample from scratch.
0 < impact metric < 1 means AI provides value and help, as some changes were needed to adjust algo results but less than doing this sample from scratch.
impact metric = 0 means a perfect match between all algo and manual annotations.

Final note: The impact metric for error analysis at a large scale

The impact metric results are a good proxy for hunting down problems in the data or annotations. Extreme values that are out of distribution can imply something is wrong with a specific sample or its annotations. It can also help to maintain a data engine loop by surfacing hard or more important samples that can help improve the model’s main failure modes.

At scale, accumulating the impact metric values over large data can be meaningful in analyzing and understating the system’s performance. Using these results as part of the system error analysis can shed light on the main issues that interpret the annotations process or users’ experience. In addition, periodic impact metric analysis on a known fixed subset of the data can act as a gatekeeper as part of the CD process, similar to how regression tests are used to evaluate new models before deploying them to production.

Transitioning from focusing on many specific machine learning task metrics to a single holistic measurement allows a unified evaluation of AI performance in production, making it easier to track and assess overall success and overall impact on the system’s end user. Such a transition is not always trivial but contributes significantly to many aspects of the organization, from the individual researcher to end users.