Weakly Supervised Transfer Learning in Medical Imaging

Kerem Turgutlu
10 min readMay 7, 2019

Also part 2 of Semantic Segmentation — U-Net

Why Transfer Learning?

It is now well known fact that bigger models combined with immense amounts of data prove to beat the previous state of the art at any given task. Unfortunately in some tasks or domains it’s not trivial to obtain a that amount of high quality data for training purposes. That’s why we still put faith in techniques which don’t rely on huge amounts of data but still generates competitive results.

First technique that comes to mind in such setting is transfer learning. Transfer learning is the general idea of using pretrained models to finetune on a downstream target task. In general Imagenet pretrained models for computer vision and language models for NLP are the most popular and widely used choices.

In my experience transfer learning is almost always helpful when there is not enough data for a given task. The usefulness of transfer learning is correlated with the learned features within a pretrained network and transferability of these features to the target downstream task. This also explains why Imagenet pretrained models are more helpful with natural images as compared to medical images.

One should know that it is not an immutable truth that all kinds of transfer learning would help at any circumstances. There are papers stating that transfer learning is not helpful for some tasks in their experiments or there are some arguments that with enough data transfer learning is not necessary. All of these claims might be true but I believe that every task-data-sample size trio is a unique case and we should conduct ablation studies without making any prior assumptions. Still, what I am sure though is that transfer learning can’t hurt any model’s performance. After all, the day worst case scenario would be as same as randomly initializing the model weights.

Lately, I conducted some experiments for my work investigating the effects of task dependent and task independent transfer learning for semantic segmentation. What I mean by task independent transfer learning is using a pretrained model that was trained on a different task rather than semantic segmentation, such as Imagenet classification. Contrarily, when I refer to task dependent transfer learning I mean using a pretrained model which was trained on the same task as the downstream task, for example COCO semantic segmentation task.

Below, blue curve represents Imagenet pretrained Dynamic Unet model, green curve represents randomly initialized Dynamic Unet model trained from scratch and orange curve represents a Fully Convolutional Network with Inception blocks, which is called DeepCut in the graph, again trained from scratch. All of the models are trained with varying sample sizes to see the effect of training sample size. We can see that Imagenet pretrained model (blue curve) outperforms at all sample sizes and gives very close results with only 100 samples compared to the case of training on 10x full data.

No Pretraining

Here, we see the same exact models but this time all of the models are pretrained on COCO segmentation data and later finetuned. Even though Imagenet->COCO->Downstream model outperforms at all sample sizes other models also perform pretty good with just 100 samples thanks to transfer learning.

Pretrained on COCO

These experiments made me realized that models benefit from transfer learning as long as learned features are meaningful for the downstream task even though pertaining task is different than the downstream task. For example, we know that image classification is a classification task that outputs class label probabilities but in fact it inherently learns object localizations as well. Class activation maps are an excellent way to prove this point. These features for example provide powerful priors for the fine-tuning process for localization task such as semantic segmentation.

Why Weakly Supervised Learning?

Now that we know transfer learning is great at all circumstances, it’s time to give the bad news: unfortunately you won’t have a pretrained model for every deep learning task you would like to tackle. In my opinion there are two main reason why there might not be a pretrained model available:

  1. Your task might require a new/custom/unique architecture which is not yet pretrained on a large dataset or
  2. It might be very expensive to collect a dataset for pretraining purposes

For my project in collaboration with UCSF, we were lucky enough to have both of these problems :)

The goal of the project was to segment brain for skull stripping and ventricles for pre and post surgery volume calculations. You may search google for brain ventricles. Due to ethical reasons I won’t be posting any sample images even though all patients were anonymized. Volumetric changes in ventricles are important since it’s an indication of whether the patient should be released or not after a surgery. Decision process is a very manual task which requires careful assessment by looking at both the scans (before and after surgery) side by side. Another option is to use a commercial software such as MIM and manually segment the ventricles to later compare by calculating both the volumes. Manual segmentation for just a single patient takes couple of hours even at the hands of the experts and depends on how much of an accuracy you care.

So, what we wanted to accomplish by this project was to come up with a deep learning model which would automatically segment the ventricles. Throughout the experiments both 2D and 3D CNN models were used and 3D models give superior results as they were also incorporating contextual information from multiple slides simultaneously. This unique architectural choice brings us to the first problem of not having a pretrained model. As a solution to this we might have trained our own pretrained models but in the field of medicine it’s impossible or too costly to obtain a dataset at Imagenet scale and this brings us to the second problem.

Since transfer learning was not an option a lot of time was spent on acquiring data for couple of months. Without using transfer learning I’ve conducted experiments with different variants of 3D CNN models. As promised in a preceding blog post I will explain each model in detail.

3D Unet [paper]

In the preceding blog post I’ve explained Unet and Dynamic Unet architectures in depth. Here, 3D Unet is the same exact model but instead of using 2D operations such as 2D convolutions, 2D pooling, 2D dropout we use their 3D counterparts.

3D Unet

MeshNet [paper]

MeshNet is an architecture that I came across to after searching for “brain segmentation deep learning models”. Sometimes you go with the most trivial way of research and see what other’s have done in a similar context :) It is a fully convolutional model which uses dilated convolutions to reduce number of parameters while maintaining same effective receptive fields.

MeshNet

3D ResUnet [paper]

This model is based on 3rd place solution from BRATS Brain segmentation challenge. It encapsulates ideas both from Unet and PreAct ResNet.

3D ResUnet

I am also providing code for all the above models implemented by myself in PyTorch using fast.ai. In general I’ve experimented with different combinations of normalizations, activations, width and depth within these architectures.

Experiments

In our training sets we had human annotated MRI scans of 112 patients and CT scans of 107 patients. For each MRI and CT dataset 1 validation and 2 test sets were also constructed. Sample sizes for these test sets were [15, 20] for MRI and [9,18] for CT respectively. Test 2 has more challenging and unique cases with more abnormalities, as well as pre and post surgery scans. Having different test sets in terms of difficulty is beneficial for understanding the generalization of our models in terms of rare edge cases.

At the beginning of the study we didn’t know how much data would be sufficient and we didn’t start any experiments before the human annotation was complete. So we didn’t have the chance to say; “This is enough data now, thank you :)” to the experts. What we only had as a dataset was what I’ve explained in the previous paragraph.

Now, here comes the cool part. We realized that we can use machine generated weak labels for pretraining and human annotated strong labels for finetuning. I learned that it’s possible to label couple of images and then ask the commercial software to create similar masks for a bulk of MRI scans. Software probably uses some sort of heuristic algorithm (or maybe knn based??) in order to come up with these masks because they are not as good as the human annotator and the quality is visibly worse. After all, if the commercial software was good enough we wouldn’t be working on this project to start with. As a result we’ve gathered a 10x larger dataset of machine generated weak labels.

These kind of imperfect labels are called “weak labels” because you are not 100% sure if it’s true and the process of training with weak labels is called “weakly supervised training” (a nice post from FAIR), and I will call our method “weakly supervised transfer learning” — we are not only training a model with coarse labels but also finetuning with more accurate ones.

I’ve created 11 baseline models with combinations of different architectures, normalization layers, activation functions, dropout rates and so on, all can be accessed from this gist. For each dataset I’ve trained pretrained networks using the weak labels. Later I used these pretrained networks without needing to change the architecture (since task remained same) and finetuned them on our strong human annotated labels. We only had weak labels for MRI modality for both brain(skull stripping) and ventricle labels, so same MRI pretrained models were used for CT finetuning as well.

Below are the results for brain and ventricle segmentation tasks.

NOTL — represents no transfer learning and training with human annotated data directly from scratch.

WEAK — represents only training with weak labels from scratch.

TL — represents weakly supervised transfer learning from weak MRI labels to human annotated modality labels.

Numbers are from the best models on test1 and test2 independently.

Even though, at first glance applying transfer learning seems a bit beneficial for brain segmentation and seems not harm ventricle segmentation task I couldn’t find any statistical significance that proves weakly supervised transfer learning to be working with this setting.

Then I started thinking deeply and asking questions to myself:

  • Maybe number of samples we gathered from human annotators is already enough?
  • Maybe the given tasks are too easy to be solved with such small labeled data?
  • Maybe transfer learning from MR to CT modalities is not that of a good idea even though we normalized them with a similar fashion?
  • Maybe differences in modality distributions is not something learnable by transfer learning?

As a result we beat the commercial software in all test cases and human annotators in some test cases for our primary research.

But I was still curious to see how this weakly supervised transfer learning might help. So I conducted more experiments and varied sample sizes, and here are the final results.

Varying sample sizes show that for MRI -> MRI; weakly supervised transfer learning helps substantially while MRI -> CT results are not still very clear to be helpful. With only 10 samples we are performing very close to training with a 10x larger dataset which has test1 and test2 dice scores of 0.8575 and 0.7939.

In my opinion, main takeaway is that each time we are dealing with a data-task-sample size trio we should approach the problem at hand differently and always think about what we can do differently before going towards the most obvious and expensive solution, such as collecting more labelled data.

Future Work and Goals

Even though weakly supervised transfer learning is very cool and shows it’s power when training data is very limited it’s still always not an option. Because, still accessing or generating weak labels might not be possible for some of the tasks. For example, I don’t think it would be possible to come up with coarse machine annotations for diverse brain tumors, chest X-ray readings, etc. Which means that we would need more general purpose encoders such as Imagenet pretrained models but for medical imaging. I believe it’s now time that we have a shared open source library with pretrained medical models for as many as modalities, body parts and even tasks.

I just registered to UK BioBank with the goal in mind to make this dream come true. I hope that my application will be approved and we can have a leap in medical deep learning. UK BioBank is a great organization with an amazing goal in mind: to support bona fide researchers for public good. I highly recommend you to visit their page to witness the amazing work they got there.

I hope you enjoyed reading this, let’s keep experimenting to explore other cool stuff together ;)

--

--