Anomaly detection in brightfield microscopy images

13 min readJun 14, 2020

Disclaimer: This project was developed by Kaspar Hollo and Nurlan Kerimov for the Neural Networks course at the University of Tartu. The data and the code used in this project are not public and, in this blog-post only a few examples from the dataset will be shown. The data was provided by PerkinElmer.

Introduction

Nowadays, microscopy images are often used for doing medical diagnosis. For example, in this paper, a deep learning model was developed to count mitotic cells to help diagnose breast cancer. There is a problem though — the captured microscopy images may contain some so-called anomalies which can be considered as noise. It is found that the cell count and position predictions (cell segmentation) are performing badly in areas with anomalies. In our project, we tried to predict the pixels belonging to anomalies (instance segmentation) in the microscopy images. If the anomaly segmentation is successful, problematic areas in the images can be avoided and better performance in image processing can be achieved.

For more information about different tasks of computer vision (classification, object detection, semantic segmentation, instance segmentation) feel free to take a look at the following link:

Data Description

The brightfield images in our datasets are 1080x1080 pixel grayscale images and the dataset itself was split into 3 parts:

Training: 2016 (1080*1080*1) data points.
Validation: 504 (1080*1080*1) data points
Test: 504 (1080*1080*1) data points

Examples of brightfield microscopy images with anomalies.

We carefully selected the images with potential anomalies and the balance of anomaly/normal images in our dataset turned out to be:

Training: 196/2016 ~ 9.7%
Validation: 71/504 ~ 14%
Test: 101/504 ~ 20%

The dataset contains images of nuclei from seven different cell lines, namely, human lung carcinoma (A549), canine kidney epithelial cells (MDCK), human cervical adenocarcinoma (HeLa), human breast adenocarcinoma (MCF7), mouse fibroblasts (NIH3T3), human hepatocellular carcinoma (HepG2) and human fibrosarcoma (HT1080) (for more info about cell types see this article). The images in our dataset also contain several types of anomalies (dark spots, scratches, bacterial colonies, hair, dust, etc.). As can be seen from the example images, each image has at least one anomaly (more often than not there were multiple anomalies on each contaminated image).

First Considerations

Since we needed to apply instance segmentation, we considered multiple existing models for transfer learning (a.k.a. feature extraction). The finalists turned out to be YOLO and Mask R-CNN. According to the literature review we made, YOLO was considered faster since it only sees each image once. However, Mask R-CNN was considered to have better performance and to be more flexible to fit onto your own model. For these reasons, we decided to use Mask R-CNN in our project.

Annotation

As the data was unannotated, we had to annotate the anomalies manually in all of the contaminated images. We hear what you’re saying: ”You are not experts in the field, you should not do that!”. Yes, we totally agree. But, you know:

Image is taken from https://ryuutei.wordpress.com/2012/07/23/futurama-poster-set/ . Thanks :D

We decided to annotate all of the anomalies as one big base class — “anomaly” (in total we had 2 classes — background and anomaly). As expected, the annotation of the anomalies turned out to be a pretty gruelling and time-demanding task, although we only had one class to assign the annotations to. Fortunately, we found a tool which somewhat alleviated the pain through its ease of use — VGG Image Annotator (VIA). Even though the tool was easy and comfortable to use, it still had some little bugs in it — from time to time the exported JSON file didn’t include some of the annotations or the format of the attribute (in our case “anomaly”) was somewhat wrong. For these reasons, the JSON file had to be manually checked before it was being used.

Screenshot from **VGG Image Annotator (VIA)**

The share amount of images (and remember, each image had at least one anomaly) and the somewhat buggy tool weren’t the only problems which accompanied the annotation task. In particular, there were 3 more problems:

The minimum size of the anomalies

There just had to be a minimum size threshold for the anomalies, otherwise, we would still be annotating the images as there were loads of small scratches on many of the images. But this, in turn, raised a question — how big should an anomaly be to be classified as an anomaly? As a rule of thumb, we chose to classify anomalies as such if they were twice as big as their surrounding cells. But this was also not an ideal method because the images were not captured in the same zoom-in levels and therefore the cell sizes alternated.

The borders of the anomalies

In some cases, the borders of the anomalies caused a little bit of a headache, because they were not clear-cut or even blurry. So these anomalies were mainly annotated based on our subjective gut feeling.

Contextual differences between images.

As normal human beings, while annotating images, we are only able to spot anomalies in the context of a specific image. What we mean by this is that there were multiple cases where we would classify an object as an anomaly in one specific image context but not in another. Simply put, we were able to detect an anomaly in an image in comparison to the rest of the same image.

Environment

Initially we thought about using the university’s HPC for training the network. However, it took some effort to set up the environment and we decided to use Google Colab’s free 12GB GPU environment. Since we had a relatively small dataset and Colab did not need much configuration effort it was a better choice in our opinion. Having a premium GDrive account (which offered 100GBs of storage) came in handy in this situation. Taking into consideration that the weights of one epoch of training is ~250MBs, without a premium account it would not be possible to train for multiple epochs.

Mask R-CNN

Mask R-CNN is a model used to solve the instance segmentation task. It is an extension of the FasterRCNN model, which itself is the evolved version of FastRCNN. You can find in-depth information about Mask R-CNN and the other models in the following articles:

Splash of Color: Instance Segmentation with Mask R-CNN and TensorFlow

Explained by building a color splash filter

engineering.matterport.com

Faster R-CNN (object detection) implemented by Keras for custom data from Google’s Open Images…

Introduction

towardsdatascience.com

A Gentle Introduction to Object Recognition With Deep Learning - Machine Learning Mastery

It can be challenging for beginners to distinguish between different related computer vision tasks. For example, image…

machinelearningmastery.com

We did not see the point of re-explaining the model, because the creators of the model have probably done it better than we ever could :). However, we will briefly explain what kind of problems we encountered while implementing the Mask R-CNN model.

Problems we encountered with the model

The widely available version of Mask R-CNN is developed to work with regular 3-channel (RGB) images. Although they have added some instructions on how to make the model work with grayscale images, implementing all of the given instructions was not even nearly enough to run the model with grayscale images. After multiple changes in the source code and dozens of WTHs we managed to run the training with our grayscale images. Ultimately, we had trained weights and we were very proud of this:

Then it turned out that our problems were far from solved — we found out that we should do more changes in the source code to be able to predict with the trained model. Because, you know, it was implemented (hard-coded) to work only with RGB images (and no instructions were provided on how to make this part work :) ).

When we were trying to come up with a solution for this issue, we were like:

“Wait, actually converting our images to RGB will probably not affect the performance too much. Yes, we will probably lose some time in training, but this change is only related to the very first layer in the multi-dozen layered model. So, we should be ok!”.

So, we started to train the Mask R-CNN model with our RGB converted grayscale images. We used cell-nuclei weights with ResNet101 as the backbone and re-trained only the heads (the closest layers to the output layer).

”heads”: r”(conv1\_.*)|(mrcnn\_.*)|(rpn\_.*)|(fpn\_.*)”,

Another problem we encountered with Mask R-CNN was that it does not work with the latest versions of Tensorflow. It uses Tensorflow version 1.13.1. Currently, if you install tensorflow e.g. with pip it can automatically use your GPU if you have an eligible one. So, you don’t need to install tensorflow-gpu explicitly. But, guess what: IT WAS NOT LIKE THAT IN VERSION 1.13.x. So we thought that we are training on GPU but apparently we were wrong:

One epoch was taking like 4 hours. After figuring out our dumb mistake each epoch was taking less than 3 minutes.

Magic :D

To sum it up, we wouldn’t suggest making the model work with grayscale images specifically, let Mask R-CNN convert the images to RGB and make sure you are using GPU while training your model.

Augmentation

In order to fight overfitting and increase our training data we also used augmentation. Since we had grayscale images we used only very basic augmentation techniques, such as rotation, mirroring, adding gaussian noise and multiplying (changing the magnitude of the colours to brighter or darker tones)

augmentation = iaa.SomeOf((0, 2), [
  iaa.Fliplr(0.5),
  iaa.Flipud(0.5),
  iaa.OneOf([iaa.Affine(rotate=90),
  iaa.Affine(rotate=180),
  iaa.Affine(rotate=270)]),
  iaa.Multiply((0.8, 1.5)),
  iaa.GaussianBlur(sigma=(0.0, 5.0))
])

Training/Testing Parameters

The Mask R-CNN model has dozens of parameters — it takes quite a long time to understand the functionality and/or meaning of all of these parameters. Although we kept most of the parameters as their default values, we were highly influenced by the cell-nuclei detecting model’s parameters.

Results

In order to give some intuition how the predictions of our model look like we added some masked images below. In the left column we put only the original images. In the right column we put the ground truths (highlighted in green — the annotation we manually made) and predictions of the model (highlighted in red). The model also predicts the bounding boxes of the anomalies, but we are not specifically interested in them in this project.

Region Proposal Network (RPN) parameters

As you can see from the predicted images above, there are multiple overlapping predictions/masks in the areas of the ground truths (annotated areas). Mask R-CNN basically scans the whole image in these arbitrarily set window sizes and looks for anomalies in each window.

As we set multiple values for the window sizes, there are also multiple smaller and overlapping predictions instead of one big prediction. In addition, we think that the bigger predictions were often excluded because we set a minimum threshold of 0.7 for the predictions and these bigger predictions just didn’t reach the threshold.

Numeric results (Metrics)

In order to measure the metrics we first merged (took the union) the predictions if they overlap. Then we calculated the Intersection over Union (IoU) for each of the predictions.

The image is taken from here. Thanks to you too :D

So, if the IoU is 0 then it means the model predicted an anomaly where we did not annotate (a.k.a. False positive). If IoU is equal or more than 0.5 (IoU >= 0.5) we encountered it as a correct prediction (a.k.a. True positive), if not (IoU < 0.5) it is a false positive. After having all the pixelwise metrics we counted the metrics per merged prediction mask. In this case we can not have any true negative results, but we have false negatives when the annotated region has not been predicted as an anomaly. Numeric results are as following:

Validation accuracy:  0.309
Validation precision:  0.323
Validation recall:  0.871
Validation F1 score:  0.472
-----------------------------
Test accuracy:  0.38
Test precision:  0.399
Test recall:  0.89
Test F1 score:  0.551
-----------------------------
Train accuracy:  0.509
Train precision:  0.52
Train recall:  0.961
Train F1 score:  0.675

Discussion of the results

Problematic image type

One thing we wanted to address is that our model had some rather poor results in one specific image type — HepG2. We think that the background structure of HepG2 and the patterns of some of the anomalies were too similar and that is why our model made some pretty bad predictions in this specific image type.

In order to see the effect of this problematic image type we excluded them to see the difference in metrics.

Performance before filtering problematic type
Validation accuracy:  0.309
Validation precision:  0.323
Validation recall:  0.871
Validation F1 score:  0.472
----------------------------------------------------
Performance after filtering problematic type
Validation filtered accuracy:  0.349
Validation filtered precision:  0.368
Validation filtered recall:  0.869
Validation filtered F1 score:  0.517

As it can be seen from the results above, this problematic image type causes small differences in validation result metrics (~ 4%).

Performance before filtering problematic type
Test accuracy:  0.38
Test precision:  0.399
Test recall:  0.89
Test F1 score:  0.551
----------------------------------------------------
Performance after filtering problematic type
Test filtered accuracy:  0.448
Test filtered precision:  0.463
Test filtered recall:  0.931
Test filtered F1 score:  0.618

In test datasets the difference was a little higher (~ 7%), due to more of this kind of images.

Unclear and inadequate annotation

We think that the main reason for the low accuracy is the preciseness of our annotations. Since we are not experts in the field and were doing it for the first time, all the decisions which were made in the annotation process were totally subjective. We decided not to annotate small anomalies (anomalies with the size of one cell or less in that particular image), however our model did a good job predicting smaller anomalies which we did not annotate. Secondly, we overlooked some (relatively big) annotations which were detected by the model. When we look into what makes the numeric performance of our model this low, we can see that most of the damage is done by false positive predictions (see no_overlap_count — simply, the anomalies which we did not care to annotate but had been predicted by the model and some really wrong predictions):

total_predicted_count:  167
over_threshold_count:  54
no_overlap_count:  88
not_sign_overlap:  25
false_negative_count:  8

Overall model performance

Although we got rather poor metric results, we think that our model generalized well and was even able to do predictions which were definitely anomalies but weren’t annotated (e.g. considered too small to be annotated or simply overlooked by the human eye, etc.). We are happy with the results and we think if the images were annotated by specialists then the model would get better metric results.

Recommendations and Summary

So, what could be done to improve the results even further? Luckily, we have some ideas that might help to do just that.

The first idea would be to add annotation classes. As we are dealing with grayscale images it can be beneficial to classify anomalies based on their patterns (e.g. black spots, bacteria, etc.). This way it would be possible to identify, which anomalies are predicted better and apply more data/augmentation to the more problematic classes.

The second idea would be to train a model with each image type separately. As we described earlier, we suspect that the background structure of some image types and the patterns of some of the anomalies were very similar and that caused some confusion in the model. Therefore separating the different image types would probably eliminate this problem.

The last idea would be to create an ensemble model (Mask R-CNN and U-Net) to get better results. In the following paper, it was claimed that this kind of ensemble model achieved far better results than each of these models separately. It would be interesting to see what results would the ensemble model achieve with our dataset.

Despite the rather poor metric results, we think that our model generalized well and was even able to find anomalies which were definitely anomalies but weren’t annotated.

Acknowledgements

We would like to thank Dmytro Fishman and Mohammed Ali for their excellent supervision. Additionally we would like to thank the University of Tartu and the teaching staff of Neural Networks course for their great effort.