Spotting Defects! — Deep Metric Learning Solution For MVTec Anomaly Detection Dataset

daisukelab
Analytics Vidhya
Published in
9 min readSep 16, 2019
Spotted defects in the screw images from the MVTec Anomaly Detection dataset.

Every day we are checking if there’s anything wrong visually, and it happens naturally in our life. For example, when you are taking food out of the fridge, you would be doing it unconsciously to have a glimpse of it to see if it is OK or not.

In business, visual inspection is widely done in the final process of production. And this would be the major application of machine learning anomaly detection.

MVTec Anomaly Detection Dataset (MVTec AD)

https://www.mvtec.com/company/research/datasets/mvtec-ad/

A German company MVTec Software GmbH recently released a novel MVTec Anomaly Detection dataset[1], it has realistic data from 15 categories.

Figure from MVTec AD website: Good (green) and Bad (red) examples from 6 categories.

Categories from industrial to agricultural, defects from each different domain, with various alignments in the images, and even with segmentation data of defect areas in the annotations — it’s a great dataset.

“To the best of our knowledge, no comparable dataset exists for the task of unsupervised anomaly detection.”

MVTec AD is introduced to play the role of MNIST, CIFAR10, or ImageNet for unsupervised anomaly detection (and segmentation) research area.

This dataset comes with a paper[1] which not only introduces the dataset but also evaluates baseline methods such as GAN, autoencoder, or other traditional methods.

Deep Metric Learning

This article made some experiments to apply deep metric learning to solve anomaly detection tasks with this dataset. Major deep metric learning such as ArcFace[3]/ CosFace[4] are popular in face verification/recognition tasks, and these methods can measure the distance between data. The distance is then used to determine if the faces in two photos have the same identities or not, for example.

A big benefit of these deep metric learning methods is their simpleness.

  • They are very simple just to reuse conventional CNN models with small additional plug-in layers.
  • The modified CNN can be trained in classification tasks as usual, even without any change on the training process, then it’s done.
  • The body of the CNN is now ready to output features that can be used to calculate the distance in between data.

Tested deep metric learning methods

Figure 1. from CosFace[4]: An overview of the proposed CosFace framework.

In all the experiments, the following methods are tested:

  • L2-constrained Softmax Loss[2]
  • ArcFace[3]
  • CosFace[4]
  • SphereFace[5]
  • CenterLoss[6]
  • And conventional CNN, its features also work for measuring distances

Experiments

3 different classification problem setups have been tested. One failed, and the next failed, then the final one succeeded with a technique newly invented.

  1. Experiment 1: Traditional supervised multi-class classification.
  2. Experiment 2: Binary classification between normal and one of anomaly (defect) classes.
  3. Experiment 3: Self-supervised to generate anomaly samples from normal samples.
  4. And here’s one more try: Towards AUC over 90% in all categories.

Experiment 1: Traditional supervised multi-class classification

As MVTec AD paper claims, many prior works use major image datasets for evaluation such as MNIST, CIFAR10 for example. And the problem settings are:

  • Before training, assign some of the original classes to normal, while others are anomalies for the anomaly detection task.
Figure to show CIFAR-10 class split example.
  • Train class samples that are assigned to normal only. No anomaly samples are used, then the model will be trained to discriminate one of the normal classes.
  • After training, evaluate with all test samples by measuring distances.
    (This is described later.)

So the similar setting is applied here:

  • The training set consists of defect-free (normal) samples from the following 4 categories: ’capsule’, ‘carpet’, ‘leather’, ‘cable’. So the training is a multi-class classification problem of these 4 classes.
  • The test is performed per category. All the test samples in each category are evaluated, and finally, AUC is calculated for assessing performance.
  • Note that the test class is different from the training class. Test samples consist of normal ‘good’ samples as well as various types of anomaly defect samples.
The figure of an example of a training batch. It is a 4 class classification problem.

Training finishes successfully, it’s easy problem to classify images that look very different from each other.

But unlike usual CNN classification, this actually trains metric-learning-enabled CNN to learn metric that measures the distance between samples.

Measuring distance and anomaly discrimination

In the test phase, all the test samples are measured their distances from normal class by following steps:

  1. Cut the final layer from the trained model, then the model will output 512-D features. (ResNet18 is used here and it outputs 512-D features right before the final FC layer)
  2. Get features e_n for all N training examples x_n in advance.
    (Feed all x_n to the model, and get features from its output)
  3. Now get the features e_m of mth test sample x_m, then calculate cosine distance with the all training example e_n. Now we have N distances from the test sample x_m to all the training samples.
  4. Then pick the minimum value from the N distances. This minimum distance is the distance d_m of the test sample x_m from normal.
  5. Repeat steps 3 and 4 until we get all M test sample distances.
  6. Then we can set any threshold that divides the distance into normal or anomaly classes. Samples with shorter distances are normal, and other samples with longer distances are detected as anomalies.

Result of traditional supervised multi-class classification

AUC result of models trained in multi-class classification setup.

Training model to learn metric have failed (less than 0.5; worse than random-answer), measured distances are almost incorrect. This is natural that:

  • Trained to discriminate very different objects like ‘capsule’ vs. ‘cable’.
  • Tested between very similar objects; _normal_ ‘capsule’ vs. _anomaly ‘capsule’ with defects, these look much closer than ‘capsule’ vs. ‘cable’.

So the traditional problem setting of metric learning evaluation doesn’t work in the realistic scenario…

We need to motivate model to learn to measure small differences.
It’s not difference between cat and car, even not between black cat and gray cat.
It’s between a clean screw and a screw with a tiny scratch!

Experiment 2: Binary classification between normal and one of anomaly (defect) classes

Models are supposed to find tiny differences from normal samples. To do that,

  • Pick one defect class out of the test set (only the test set has defect classes), then put its samples to the training set as anomaly class.
  • Now it’s a normal/anomaly binary classification task to train models.
The figure of an example of a training batch that has both “good” (normal) and “broken_large” (anomaly).

But it also failed, about 0.5 to be close to random-answer.

AUC result of models trained in binary classification setup.

Basically, the anomaly sample size is too small; the normal sample size is about 200+, while the defect anomaly sample size is about 10, very imbalanced. Followings are made to mitigate this problem, though these didn’t work.

  • Oversampling defect class, with augmentations, and with training techniques like a mixup.

The journey to training models so that they discriminate tiny “something wrong” continues…

Experiment 3: Self-supervised to generate anomaly samples from normal samples

This is based on a Kaggle-ish simple idea.

Now what we need are anomaly samples that have tiny differences from normal ones. Then, we can just simply generate anomaly samples from normal samples.

Once anomaly samples are ready, we can train models so that they distinguish the tiny differences from the original normal samples. This training problem is binary classification. We use normal samples only, then this is also self-supervised training.

This experiment invented a new dataset class (actually it’s an ImageList class for fast.ai library[7]) that:

  • Doubles the training normal samples.
  • Assign normal labels for even-numbered samples and anomaly labels for odd-numbered samples. Then all the original samples now have “anomaly twin” samples.
  • When images are used, all the images labeled as anomalies will draw a random line on them. It’s a tiny line that makes difference from normal samples. The randomness when drawing a line also contributes to data augmentation.
The figure of an example of a training batch. It has both “normal” and “anomaly”, all anomaly samples are generated with a color line as defect “scar”.

The results make sense.

AUC result of models trained in self-supervised setup. Improved.
Table of the AUC result details. CosFace or ArcFace are stable in this case.

Let’s check samples with grad-CAM activation heatmap. Successful cases shows that the heatmap captures defect part on the image:

Example result of ArcFace with hazelnut at AUC 99%. Test sample and its heatmap (top and middle) and the closest counterpart training sample (bottom) for distance calculation.

Failure cases below show that the model is not looking at the defect. The model didn’t learn correctly to find these types of defects. There are some more examples like this, showing that it’s still leaving some more spaces to improve.

Example result of ArcFace with a screw at AUC 62.3%, the heatmap shows that model is distracted by the background. It failed to find defects at the tip of the screws.

One More: Towards AUC over 90% in all categories

As a final result, the followings are the grad-CAM heatmaps, after tuning the model to achieve AUC 90%. It was fairly easy to tune, we can tweak how to create anomaly twin samples so that it simulates defect modes.

But what does this mean? This is done by using the knowledge of defect modes that happens in the test samples. It’s like cheating.

  • So this does not apply to the use case where no prior defect mode is available.
  • But in many cases, many defect modes are known. (It is usual in a production line that major defects are sorted out with their percentage or ppm.)
    Then this would still be useful for automating the detection of known or possible failure modes.
Example result of ArcFace with a screw at AUC 91%, improved from 62.3%. The model finds defects at the tip of screws.
Example result of ArcFace with transistor at AUC 95.8%, improved from 89%. The model finds defective legs.
Example result of ArcFace with the grid at AUC 99.9%, improved from 79.8%. Heatmap shows defect parts.
Example feature (embeddings) distributions of ArcFace with the grid at AUC 99.9%. “Good” (normal) features are shaping a circle at the left bottom corner, far apart from other defect samples at the right top corner.
Example result of ArcFace with the pill at AUC 93.5%, improved from 70%. Some defects are found correctly, but the right two examples show that the model is distracted (risk of possible future example failure).

Comparison with paper

All the results on the original MVTec AD paper[1] are based on segmentation output. Then basically it is difficult to compare.

  • AUC results on the paper show lower than 90% such as ‘carpet’, ‘cable’, ‘metal nut’, and ‘zipper’.
  • But it’s calculated based on segmentation pixel TPR/FPR. It would be a harder problem than image distance-based.
  • As long as simply judging if the image is normal or an anomaly, Deep Metric Learning might be useful for its simpleness, comparing to segmentation-based detection which requires complicated threshold determination discussed in the paper.

Final thoughts

Experiments in this article showed that:

  • Deep Metric Learning methods can be used as self-supervised anomaly detection of MVTec AD dataset;
  • And its grad-CAM could be used to guide the defective part in successful cases,
  • Or at least it can be used to distinguish defect or not with the performance of AUC 90% or more.
  • Combination/ensemble of the use of both segmentation-based and Deep Metric Learning-based could make better performance, or open up new possibilities for some potential applications.

Source code

Find example code here: https://github.com/daisukelab/metric_learning/tree/master/MVTecAD

Many thanks to the fast.ai library[7] for minimizing time to develop experiments.

Resources

--

--