AUPIMO: Redefining Visual Anomaly Detection Benchmarks

Google Summer of Code 2023 @ OpenVINO™ (Intel)

Joao P C Bertoldo
OpenVINO-toolkit
10 min readJan 22, 2024

--

TL;DR

This post is an overview of AUPIMO [1]: a new performance metric born during GSoC 2023 and how it gives a new perspective to the state of the art in anomaly detection.

If you didn’t understand what I meant by “AUPIMO”, “GSoC”, “anomaly detection”, or “performance metric”, the next section will clarify that 😉.

After an overview, a deeper version follows; and it goes like this:
- we start by defining AUROC; it’s a classic and the parent of our metric;
- then we define AUPIMO, our contribution, and explain its rationale;
- and finally reveal its outcomes, demonstrating how it offers a fresh perspective on the current state of visual anomaly detection.

TL;ADR (A is for “Almost”)

GSoC stands for…

…“Google Summer of Code”. No one better than Google itself to define it:

Fig 1. Google Summer of Code (GSoC). From summerofcode.withgoogle.com on 2023–12–14.

If you’re a student and like to code, you should check it out!

I had the pleasure to work with @samet-akcay and @djdameln, from OpenVINO (Intel) during GSoC 2023, contributing to Anomalib, a library of anomaly detection models.

Anomaly detection is…

…an unsupervised machine learning task.

And what is so special about this one?

Models are trained using a single class (i.e. without annotations), and the challenge is to detect unusual events (anomalies).

For instance, the dataset Transistor (Fig 2) from MVTecAD [2] has images of functional transistors at the same position — think of an industrial production line where this is a requirement.

Fig 2. Images from the dataset “Transistor” from MVTec AD [2] (https://www.mvtec.com/company/research/datasets/mvtec-ad).

At inference time, a model detects deviations like a misplacement or a missing lead (a “leg”) without ever seeing such cases before!

By the way, since we’re talking images, we’ll further refer to “visual anomaly detection”.

Anomalib has a (growing) collection of models capable of detecting not only the presence but also the location of the anomalies (Fig 3).

Fig 3. Predictions from the model PatchCore [3] from Anomalib.

How can you choose one?
Anomalib provides the machinery to benchmark and compare them!

Performance is…

… commonly measured with two metrics: AUROC and AUPRO (Fig 4).

The Area Under the Receiver Operating Characteristic curve (AUROC) is a classic. If you haven’t heard of it, scikit-learn’s short intro [4] can help, or you can check the more complete intro from Tom Fawcett (2006) [5].

Disclaimer AUPRO
We won’t cover the Area Under the PRO curve (AUPRO), but we give several hints when necessary, so don’t panic 😅.
For now, all you need to know is that it’s between 0 and 1 and higher is better.

Fig 4. Average performance on the 15 datasets from MVTecAD. All models indexed in Papers With Code’s benchmark [6].

The good news is: research is moving forward😃!
The bad news is: the benchmarks are misleading 🤨.

Notice how a plateau is being reached. It seems like MVTecAD is “solved”.

The model from Fig 3 has 98.3% AUROC. Yet, Prediction 2 is practically missed while being the most obvious one (there is no transistor!).

AUPIMO is…

… the Area Under the Per-Image Overlap (AUPIMO), a new metric we propose in [1] to address this issue. Put simply, it measures the recall in each image when the model is constrained to have nearly no false positives.

We benchmarked 8 models on 27 datasets using AUPIMO, and the results told a whole new story about the state of the art (SOTA)!

Fig 5. Benchmark of the Transistor dataset. AUPIMO (one score per image) as boxplots; diamond is the average.

In the Transistor dataset (Fig 5), the best (PatchCore [3]) and the worst (PaDiM [7]) models differ by 1.3% AUROC (and 2.7% AUPRO).

With AUPIMO (ours), the difference is more dramatic: PatchCore misses (<50% in-image recall) 1/4 of the anomalies, but PaDiM misses 3x as much.

What’s next

If you read until here, you’re on board!
The upcoming part will dive into the definition of AUPIMO — but first, we define its mother metric, AUROC, to make a comparison between them.
Then we will try to convince you why our modifications make sense.
Finally, we showcase how it tells a new story about the state of the art in visual anomaly

Preliminaries

Let’s first establish some vocabulary.

Visual anomaly detection models commonly output an anomaly score map (a 1-channel image) where higher scores mean “more anomalous”, and a mask can eventually be generated by applying a threshold to it.

Fig 6. Image from MVTecAD’s Hazelnut dataset. (a) input image, (b) ground truth annotation, (c) anomaly score map, (d) thresholded prediction. Source: Anomalib.

Fig 6 shows an example from another dataset in the MVTecAD collection — the stamp is not supposed to be there, so it’s an anomaly.

Our goal is to measure if a model is capable of finding all the anomalous pixels in the images, so a metric’s job is to compare the ground truth annotation (1, in white, means “anomalous”) and the anomaly score map.

From [AU]ROC to [AU]PIMO

AU = Area Under the curve

In simple terms, the PIMO curve is an adapted version of the ROC curve. So let’s recap how the ROC (Receiver Operating Characteristic) curve works.

More precisely, we’re talking about the pixel-wise ROC curve, which means the task is considered a binary classification (normal vs. anomalous) of pixels.

ROC

Fig 7 shows a toy dataset of anomaly score maps from two normal and two anomalous images. At each possible score threshold, the False Positive Rate (FPR) and True Positive Rate (TPR, or recall) are measured on the entire set (all images confounded). The ROC curve is the set of all FPR/TPR pairs.

In other words, ROC traces the compromise of true and false detections.

Fig 7. How the ROC curve is built from the anomaly score maps.

The area under the curve (AUC) summarizes the curve in a score (Fig 8).

Fig 8. AUROC’s definition.

A perfect model has 100% AUROC and a random model (randomly predicting 0 and 1 according to their ratio) has 50% AUROC.

Notice how the integral is defined in terms of FPR levels (from 0 to 100%). Then, the F’s inverse outputs the threshold that yields an FPR level.

As a consequence, the integral accounts for all thresholds between the lowest and highest scores among the normal pixels. This corresponds to scanning all segmentation masks (illustrated in Fig 7 blue contours).

PIMO

Now, here is another interpretation: AUROC is the mean of the recall function (y-axis) over the domain of thresholds indexed by the FPR (x-axis).

Based on this observation, we designed the PIMO (Per-IMage Overlap) curve to make the metric more meaningful to visual anomaly detection.

Fig 9. PIMO curve’s axes definition.

Notice that each image has its own curve (subscript i in Fig 9), and the x-axis is represented in a log scale (see Fig 10).

Fig 10. How the PIMO curves are built fromt the anomaly score maps.

Thus each image has its own area under the curve (AUC) with the x-axis accordingly in log and bounded (defaults are 1e-5 and 1e-4):

Fig 11. AUPIMO’s definition.

The term “1/log(U/L)” is a normalization factor such that 0 ≤ AUPIMO ≤ 1.

What’s different?

AUPIMO brings several modifications relative to AUROC:
1. Pixel ratios are computed per image;
2. The x-axis is built only from normal images;
3. The x-axis is in log-scale and the AUC is bounded;
4. Each image has its own score.

Why is that better?

Let’s go through the list above one by one.

1. Pixel ratios are computed per image

The FPR/TPR metrics in the ROC curve mix all the normal/anomalous pixels in all the images in the test set, thus completely disregarding the images’ structures.
In PIMO, their analogous metrics are computed in each image’s scope, thus accounting for their independence.

Fig 12. Tiny anomalous regions in ground truth annotation (pink) in the dataset Macaroni 2 in VisA [8]. These noisy regions are given too much weight in AUPRO but neglected in AUPIMO.

One could compute the TPR (recall) in the scope of each connected component (i.e. blob, region), which is how AUPRO works. However, finding the regions makes the computation (much!) slower.

Besides, AUPRO is sensitive to annotation mistakes because small regions are given too much weight. Turns out, VisA (Visual Anomaly Dataset) [8], another important collection of datasets, is full of noisy annotations (Fig 12)!

2. The x-axis is built only from normal images

Building the x-axis only with normal images conceptually makes a big difference because the test normal images can be assumed from the same distribution as the training.

The metric in the x-axis is crucial because it acts like a threshold indexation function. Using a metric sampled from the same distribution as the training set makes PIMO more representative and not biased by the available anomalies — the goal is to detect any type of anomaly!

Also, anomalous images have both normal and anomalous pixels, and AUROC confounds them. However, models often have higher scores in normal pixels near the anomalies. That behavior should not be punished by the metric because anomaly boundaries between them may be unclear.

3. The x-axis is in log-scale and the AUC is bounded

By scanning the entire image of the function F (i.e. 0 to 1, see Fig 8), AUROC accounts for useless operating points. For instance, see how the contours at 33% and 67% FPR levels in Fig7 detect too many pixels.

Like AUPRO [9] (not covered here, check in our paper [1]), we impose an upper bound FPR (1e4) to limit the metric to “useful enough” thresholds (i.e. they raise very few false positives). As for the lower bound (1e-5), it is necessary due to the log scale in the x-axis, which is useful to “zoom” in on the low levels of FPR.

With this strategy, it’s easier to build a visual intuition of what the integration bounds represent. Fig 13 shows how AUPIMO’s bounds (1e-5 and 1e-4) yield tiny false positive regions compared to the object.

Fig 13. Visual intuition of in-image False Positive Rate. Examples of normal images in the test set of the dataset Screw from MVTecAD.
Each color corresponds to a threshold at different FPR levels: darker blue is 1e−2, lighter blue is 1e−3 , white is 1e−4, and black is 1e−5 .

3.5 Validation-evaluation framework

Since the x-axis is from the same distribution as the training set, the FPR-based bounds act like a model validation scheme!

The y-axis, on the other hand, measures the pixel-wise recall of anomalous pixels in each image, which is used for evaluation.

We coin this 2-in-1 scheme the “ validation-evaluation framework”.

4. Each image has its own score

Assigning individual scores to the images offers important advantages:
1. It’s possible to compare images (e.g. worst and best cases);
2. Benchmarks are more detailed with the distribution of scores.

Let’s take a look at this in practice in the next session!

Benchmark

We benchmarked the 27 datasets from MVTecAD and VisA with 13 models, and the results reveal some insights previously hidden in the lack of details.

If you are not familiar with the models, it’s ok, we won’t focus on their details!

Fig 14. Benchmarks in three datasets: Transistor and Screw from MVTecAD and Macaroni 2 from VisA. Diamonds are average AUPIMO scores.

Notice how the AUROC (blue) and AUPRO (red) barely show the differences between models (especially the best ones), but with AUPIMO it’s possible to realize the shift of performance even when the averages are close.

AUPIMO also reveals two facts about the state of the art:
1. Cross-image recall varies a lot in most cases;
2. Even the best models fail to recall some anomalies (left queues in Fig 14)!

Go to Fig 4 and notice how this contradicts the perception we started from!

This is further confirmed in Fig 15, where we show a summary of all models and datasets in our benchmark.

Fig 15. Summary plot of our benchmarks (8 models X 27 datasets).

Another lesson taken here is that there is no go-to model. Even the overall best (PatchCore WR101) fails miserably on some datasets. Conversely, the worst model (PaDiM R18) achieves SOTA performance in one dataset.

Conclusion

Summing up, we introduced AUPIMO, a novel metric cooked up during GSoC 2023 at OpenVINO (Intel).

Unlike current anomaly detection metrics, AUPIMO scores each individual image, offering a nuanced evaluation of model performance. Besides, its validation-evaluation framework gives a task-meaningful interpretation:

AUPIMO is the average segmentation recall on an image given that the model is conditioned to (nearly) not raise false alarms.

We put it to the test across 27 datasets and 15 models, and the whole idea of “solved” datasets got a reality check. By the way, we almost forgot to mention: AUPIMO is much faster to compute than AUPRO!

Make sure to check our paper and repository!

Links

  1. My GSoC 2023 project page: summerofcode.withgoogle.com/programs/2023/projects/SPMopugd
  2. Our paper in arXiv: arxiv.org/abs/2401.01984
  3. Standalone Code: github.com/jpcbertoldo/aupimo
  4. OpenVINO: github.com/openvinotoolkit/openvino
  5. Anomalib: github.com/openvinotoolkit/anomalib

Integration with anomalib is coming up on anomalib v1 release!

References

  1. J. P. C. Bertoldo, D. Ameln, A. Vaidya, and S. Akçay, “AUPIMO: Redefining Visual Anomaly Detection Benchmarks with High Speed and Low Tolerance.” arXiv, Jan. 03, 2024. doi: 10.48550/arXiv.2401.01984.
  2. P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “MVTec AD — A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection,” CVPR, 2019.
  3. K. Roth, L. Pemula, J. Zepeda, B. Schölkopf, T. Brox, and P. Gehler, “Towards Total Recall in Industrial Anomaly Detection,” CVPR, 2022.
  4. scikit-learn.org/stable/modules/model_evaluation.html#receiver-operating-characteristic-roc
  5. T. Fawcett, “An introduction to ROC analysis,” Pattern Recognition Letters, 2006, doi: 10/bpsghb.
  6. paperswithcode.com/sota/anomaly-detection-on-mvtec-ad
  7. T. Defard, A. Setkov, A. Loesch, and R. Audigier, “PaDiM: A Patch Distribution Modeling Framework for Anomaly Detection and Localization,” ICPR, 2021.
  8. Y. Zou, J. Jeong, L. Pemula, D. Zhang, and O. Dabeer, “SPot-the-Difference Self-Supervised Pre-training for Anomaly Detection and Segmentation,” arXiv, 2022, doi: 10.48550/arXiv.2207.14315.
  9. P. Bergmann, K. Batzner, M. Fauser, D. Sattlegger, and C. Steger, “The MVTec Anomaly Detection Dataset: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection,” IJCV, 2021, doi: 10/gjp8bb.

--

--