Add ensembling methods for tiling to Anomalib — GSoC 2023 @ OpenVINO™

Blaz Rolih
OpenVINO-toolkit
Published in
11 min readDec 19, 2023

--

When detecting defects in high-resolution images, we encounter many challenges. One of those is that models are hard to train due to memory constraints. By down-sampling, we lose information resulting in the oversight of small defects.

One way of avoiding such issues is by using a tiled ensemble mechanism. In this case, we split the image into smaller tiles, and process those using a separate model for each tile position. This strategy enhances localization and improves detection of small anomalies, while also ensuring that the models remain within the memory constraints.

In this blog, we present how the tiled ensemble mechanism was developed within Anomalib — an anomaly detection library utilizing deep learning models. The project is part of Google Summer of Code for Intel’s OpenVINO™ Toolkit organization. The final result is a pipeline from data to final predictions, using an arbitrary dataset and any model already implemented in Anomalib. Join us in this blog, as we explore the design, evolution, and successful integration of this mechanism.

This blog contains a lot of diagrams that explain the procedure and results, so you can get a good idea by just glancing over all the images if you don’t have time to read all the text.

Google Summer of Code project page
Anomalib (GitHub) | OpenVINO (GitHub)

Many thanks to @samet-akcay and @djdameln for being such amazing mentors during this GSOC period.

Introduction

Detecting anomalies in images is an important task both in research and industrial applications. In the past years, deep learning methods have greatly improved the performance of various methods for this task.

Nonetheless, due to the constraints of real-world applications, the models are usually limited by memory usage and throughput. This comes from the fact that such models are usually present in product control, where high throughput is expected as well as high accuracy, where a missed defect could cause significant financial loss or even cause harm to human health.

One of the examples of anomaly detection on capsules can be seen in Figure 1 below.

Figure 1: Anomaly detection example from VisA dataset. In this case, we are trying to find a defect in the capsules.

In industrial applications, we often want good localization and detection accuracy on high-resolution images, which quickly leads to memory issues with regular approaches.

This is where the idea of the tiled ensemble, introduced earlier in this blog, comes into play. As shown in Figure 2 below, the process applies to individual images but works the same way across the dataset. Images are divided into a chosen number of tiles, and separate models are trained for tiles in specific positions.

Finally, the predictions from all models are combined and processed. This results in a high-resolution anomaly mask and a score indicating whether the object is anomalous or not.

Figure 2: High-level overview of the tiled ensemble approach. Each image is split into tiles, then a separate model is trained for each tile position. Finally, the predictions are joined back together and post-processed, resulting in a score and predicted anomaly mask.

Anomalib currently has a basic tiler. However, the benefits of locally-aware models, like Padim, are diminished when used without an ensemble. By just tiling the image and using a single model, the same pixel in the input now corresponds to multiple locations in the original image. It is also not that memory efficient.

To address this problem, we present a tiled ensemble implementation that preserves the advantages of local awareness while also enhancing performance. It works with datasets already implemented in Anomalib as well as the models.

Design

Our approach aims to incorporate Anomalib’s existing features (e.g. training configuration, standard post-processing, visualization, and metric calculation), while providing users the flexibility to tailor specific steps within our pipeline, leading to results that match the format of standard Anomalib predictions. All the while, we ensure that our approach remains memory-efficient.

In Figure 3, we can see all the building blocks of our approach. For the sake of simplicity, we will describe the entire working of the method as if we are processing one image at a time, but during the training, it’s all done in batches.

Figure 3: Diagram of building blocks that combine into a tiled ensemble mechanism. Images are first tiled, then a separate model is trained on each tile position. Tile-level predictions are then joined and finally sent through the post-processing pipeline.

Tiling and training

The input image is tiled into equally sized tiles that can also be overlapping. All the tiles from one position are then input into a separate model. This is repeated for all tile positions until we have trained all individual models for all positions. For a better representation of tiling, check Figure 4.

Figure 4: Visualization of tiling and training on tiled data. Each model is only trained on tiles from one position.

Our result of inference is an anomaly map, which is essentially a score for every pixel. We can see it in the “Predicted Heat Map” tab in Figure 5, where we can also see its corresponding segmentation mask.

The anomaly score is determined as the maximum value of the anomaly map. If this value surpasses a certain threshold, we label the entire image anomalous.

Anomalib also supports the task of detection. This involves generating bounding boxes around anomalous regions from the anomaly map, as visible in the last tab of Figure 5.

Figure 5: Example of results. An anomaly map, or here called a heat map shows the anomaly score per pixel. The mask is produced from an anomaly map, using the determined threshold. In the segmentation result, we see the parts of the original image that are anomalous based on our mask. In the last tab, we have an example of detection, where we mark the anomalous region using a bounding box instead of a segmentation line.

The example above shows predictions for the entire image, but in the case of the tiled ensemble, we get predictions for each tile separately. To get the above result, we first need to join everything into a full image representation.

Joining

To get our predictions from tiles back into full image, we use a joiner, which is responsible for assembling all kinds of predictions; anomaly maps, scores, and predicted boxes. The following happens, also illustrated in Figure 6:

  • Tiled data such as anomaly maps are assembled back into full-scale images,
  • Scores are averaged over all the tiles.
  • Boxes are collected into a single list.
Figure 6: Joining is performed on three levels. Tiled data like anomaly maps and images is untiled, scores are averaged across tiles, and boxes are stacked from all tiles.

After joining, we now have all results in image-level representation, the same way as predictions look like if we don’t use an ensemble. As a last step, predictions go through a post-processing pipeline, introduced at the start of the blog, again shown in Figure 7.

Figure 7: Post-processing pipeline used for tiled ensemble predictions. First, we apply post-processing to our data, then visualize and calculate metrics.

Post-processing

To improve the result, we apply various post-processing steps, including normalization and thresholding.

Both normalization and thresholding can be done for each tile separately or on the joined images. This can be selected by the user, but by default, we do both once the images are joined, as this creates a nicer visualization.

At this stage, we have predictions prepared for final evaluation. This part consists of visualizing and calculating metrics.

When visualizing the results we get all parts of the prediction, as well as ground truth, nicely shown as depicted in Figure 8.

Figure 8: Example of full visualization result. In this case, we can see each predicted element in more detail as well as compare it to ground truth which shows us the area of the image that is anomalous.

To quantitatively evaluate the performance of our approach, we finally calculate metrics on our predicted data. This usually consists of image and pixel metrics. Image metric is when our unit of prediction is the entire image, and pixel metrics are used when each pixel is treated as a separate prediction unit. Both offer some insight into the workings of our method, where image metrics usually tell how good are we at finding anomalous objects, while pixel metrics tell us how good can are we at finding the part of the object that is anomalous. We will present the results in the next section.

Experiments and results

We evaluated the approach on two datasets: MVTec AD and VisA. MVTec AD further referred to as MVTec, is a well-known, standard visual anomaly detection dataset. It consists of 15 different categories, with some examples shown in Figure 9. The objects are positioned nicely in each image, and anomalies vary in scale. While it’s not the best fit to properly show the benefits of our tiled ensemble, it’s still included to present the performance even in such cases.

Figure 9: Examples of MVTec AD dataset. Image source: https://www.mvtec.com/company/research/datasets/mvtec-ad/

VisA dataset on the other hand is a bit more challenging. 12 different categories sometimes also include multiple instances of objects on the same image. A lot of defects in this case are small, displaying the benefit of using a tiled ensemble. We can see the examples of each category in Figure 10.

Figure 10: Examples of the VisA dataset. Image source: https://paperswithcode.com/dataset/visa

We evaluated our tiled ensemble in comparison with a single model with two different resolutions. The tiled ensemble had an image size of 512x512 pixels, that was split into 9 overlapping tiles, each with 256x256 resolution by stride of 128. One model for each architecture was trained using either an input size of 256x256 or 512x512. Note that for Patchcore, we couldn’t train the model at 512x512 due to high memory usage. All other parameters were the same for approaches. This includes resnet18 as a backbone, batch size 32, and seed fixed to 42.

We trained five different architectures, to see how tiled ensemble performs for different approaches of anomaly detection. These models are Padim, Patchcore, FastFlow, Reverse Distillation, and STFPM.

We used AUROC as our image level metric (where we decide if the image is anomalous or not) and AUPRO for pixel level metric (where we assign each pixel as anomalous or not). The choice of these metrics comes from the latest papers on anomaly detection as well as their properties. AUPRO is selected over AUROC for pixel-level localization as it ensures that both small and large anomalies are equally important.

We can see the results obtained on the entire MVTec dataset in Figure 11.

Figure 11: A tiled ensemble can improve the performance of some models, but that’s not always the case. We also notice that improvement doesn’t come just from larger resolution.

One of the first things we can notice is that a tiled ensemble can improve the different capabilities of each model, but in some cases, the performance also degrades.

We can also notice that performance gain does not come only from increased resolution, as a single model with a resolution of 512x512 does not always perform as well as an ensemble or even the same model with a lower resolution. This might occur due to problems of pretrained back-bones.

While MVTec is a fairly standard dataset, it does not represent samples where a tiled ensemble offers the most benefits. That changes when we look at the VisA dataset, where there are more objects present in a single image, and anomalies can be really small and complex. We can see the results in Figure 12.

Figure 12: Tile ensemble outperforms or at least matches the performance of a single model with the same input size as the ensemble tile size. We can see that in some cases the resolution greatly improves the result, but the tiled ensemble mechanism can in some cases additionally improve them.

In this case, we can see that the tiled ensemble performs better or at least on par with a single model, where the input size matches the ensemble tile size. We can see that again, the performance gain is not equal for all architectures.

Another thing to notice is that performance improvement is again not just due to resolution increase, although in the case of Reverse Distillation and STFPM, one model with increased resolution outperforms the ensemble in the localization task, but in the case of Reverse Distillation greatly degrades detection performance.

While this performance increase of larger input size is good to see, we do need to note that tiled ensemble by design can process larger resolution with lower memory usage. Each separate model in the ensemble only needs as much memory as a single model with a resolution of 256x256. This enables the good results for the tiled Patchcore ensemble observed, while we couldn’t train a single model with 512x512 due to large memory usage.

As we can see from metrics, a tiled ensemble has the advantage of detecting smaller anomalies as well as improving overall performance in some cases. There are also cases where this does not help, reducing the performance. To better illustrate these findings, we will now also look at qualitative results.

In Figure 13, we can see an example from the MVTec toothbrush category, where the ensemble most notably outperforms the single model. As we can see the localization of smaller anomalies works much better, but some noise is introduced.

Figure 13: (Padim, MVTec Toothbrush) The MVTec toothbrush category sees the most performance gain with the ensemble. Small defects are better detected and localized.

Another example where the ensemble improves localization is the Grid category from MVTec, seen in Figure 14.

Figure 14: (Padim, MVTec Grid) Another example where the ensemble localizes anomalies that are missed by one model.

In Figure 15, we can see another example of a Toothbrush, where even increased resolution doesn’t help in localizing two smaller anomalies.

Figure 15: (FastFlow, MVTec Toothbrush) Improvement in the case of a tiled ensemble does not come only from effective increased resolution.

But there are also cases where the tiled ensemble performs worse for all architectures. That is most notably the Cable category in MVTec. From Figure 16 we can see that localization degrades significantly.

Figure 16: (Padim, MVTec Cable) Failure case for ensemble. MVTec Cable category seems worst degradation overall.

As described before, we can see the case of VisA anomalies, that are much smaller. This causes a single model to miss the defect while the ensemble localizes it, as depicted in Figure 17.

Figure 17: (Padim, VisA Pcb3) The VisA dataset has many smaller anomalies that a single model easily misses but a tiled ensemble can detect them.

With many cases of small anomalies, we also have additional examples of such behavior, shown in Figure 18.

Figure 18: (Padim, VisA Pcb1) Another example of a small anomaly was missed by one model but detected by the tiled ensemble.

When increasing the resolution of a single model, it is capable of detecting smaller anomalies. We can see an example where the defect was missed before but is detected with increased resolution in Figure 19.

Figure 19: (Padim, VisA Pcb3) Increasing the resolution of a single model can improve performance to match a tiled ensemble.

But there are cases where increasing resolution just introduces more noise into predictions, such as in Figure 20.

Figure 20: (Padim, VisA Pcb1) We don’t always match the performance of the tiled ensemble by simply increasing resolution.

Ensemble can’t fix problems that are inherent to the model itself. As displayed in Figure 21, we can see that the noise is still present, although not of the same shape.

Figure 21: (Padim, VisA Pcb1) Failure case of the tiled ensemble. False positives are still present in a tiled ensemble, as we don’t fix the shortcomings of the architecture itself.

It also happens that the ensemble sometimes not only misses the anomalous region but also introduces many false positive predictions. This can be observed in Figure 22.

Figure 22: (Reverse Distillation, VisA Chewing gum) While a single model doesn’t detect anything, the tiled ensemble predicts produces false positive areas.

Conclusion

Tiled ensemble enables anomaly detection in images with higher resolution, while still satisfying memory constraints. This technique involves dividing the image into smaller tiles and training a separate model for each tile position. It was implemented as part of Anomalib within the scope of the Google Summer of Code 2023 project at OpenVINO organization.

This approach demonstrates the potential for improvement in the localization of smaller defects in research and industrial-level applications, opening new possibilities for improving existing and new approaches, all while satisfying memory requirements.

--

--