Using Generative Adversarial Networks (GANs) for Data Augmentation in Colorectal Images
A new way to generate augmented data for medical datasets could provide new insights for improving deep learning classifiers.
So a pretty recent development in machine learning is the Generative Adversarial Network (GAN), which can generate realistic images (shoutout to my boi Ian Goodfellow for having the big brains to do this). An even more recent development is the use of GANs to generate images that can be used to augment datasets. Over the past few months, I’ve been working on a project that uses CycleGANs (more specific info about CycleGAN in my previous blogpost) to generate synthetic data for a colorectal polyps dataset. I had the privilege of presenting my work at NeurIPS 2019 (I’ll write a blog post on my NeurIPS experience in the near future).
The Problem. Using deep learning models for image classification tasks in a medical context is very popular, but accurately analysis medical images with deep learning classifiers requires large and balanced datasets. This isn’t always the case, though. For many diseases, the distribution of disease sub-classes in collected datasets is heavily skewed by each class’s prevalence among patients, and so detecting rare diseases in medical images with deep learning can be challenging. In other words, it’s hard to get enough training data if the disease is super uncommon. In these situations, a reliable method of data augmentation can mitigate the effects of data imbalance by preventing overfitting and thus improving overall performance.
The (General) Solution. GANs!!! Use GANs to generate images of the rare diseases so that you don’t have that problem anymore. More specifically, CycleGAN might be better for medical images because information from one class might be useful to create synthetic data in another class. Think about it this way: it’s probably easier to generate an image of lung cancer if you’re already given the general structure of a lung than if you have to make it from random noise. For this reason, an image-to-image translation model would be suited for the task.
The Dataset. I used a dataset of colorectal polyp images collected from the Dartmouth-Hitchcock Medical Center in New Hampshire, USA. I had 427 high-resolution whole-slide images, which I split into a training set of 326 whole-slide images and a testing set of 101 whole-slide images. For the training set, I had pathologists annotate all whole-slide images with bounding boxes representing regions of interest, for a total of 3517 variable-size image crops. Each image crop was labeled with a single class for the polyp type, which was either benign (normal or hyperplastic), or precancerous (tubular adenoma, tubulovillous/villous adenoma, or sessile-serrated adenoma). A lot of medical jargon — for the important aspect, look at the image below.
The Team of Networks. For my project, I used CycleGAN to convert images of normal colon tissue to images of precancerous colon tissue (the real medical lingo is kinda complex so we’ll leave it in that simplified form). But when you try to train CycleGAN with all of your precancerous tissue images, the generated images might look pretty normal. That’s because not all images that are labeled by pathologists as precancerous have strong precancerous features. In other words, some images might just be a little bit precancerous, but they still have to be labeled as precancerous; the images range from slightly precancerous to very precancerous. This is different from your common GAN to generate pictures of dogs: you don’t have images that are slightly dog vs very dog. So what’s the fix? Pretrain a classifier on your precancerous images and only use the images that are most precancerous to train your CycleGAN. How do you do that? Run the classifier on all of your training images and take the images with the highest output probabilities for each class based on what your classifier says. Don’t believe this works? Check this out:
Proof that this filtering method works! So you’ll see in the table above that I got numbers to back me up. Let me first define the alpha for you: I only use the top alpha images for training CycleGAN. So for alpha = 1/4, I only choose the top 25% most adenomatous images to train CycleGAN. As alpha gets smaller, I’m choosing fewer and fewer images and restricting my training set to only the most precancerous images. So the table above is representing the percent of CycleGAN’s generated images that are recognized by a classifier as the class they’re supposed to be. Let’s take the TA class as an example. When I train CycleGAN with all TA images, only 35.4% of the generated images are classified as tubular. But when I train CycleGAN with only the top 1/32 most precancerous TA images, over 90% of the generated images are classified as tubular. That’s an A! So yeah, use this filtering method to help your pal CycleGAN out. The picture below this is further proof for the h8rs.
Quality of Resulting Synthetic Images. Here’s the heavy hitter — I showed my CycleGAN-generated images to some expert pathologists and they had a hard time telling the difference between the real images and the fake images. I gave 4 pathologists 100 real images and 100 fake images and asked them to classify each image as real or fake. No more than half of the pathologists were able to do so at a statistically significant level. What does this mean? It means that the fake images are similar to real images to the point that the differences are almost indistinguishable. So that’s some cool proof that CycleGAN learned something. Check out the picture below for some more details.
Improving Classifier Performance. Now that you have realistic images of your rare classes, you can augment your dataset! I tossed some generated images into my dataset and trained a ResNet classification model to identify images of sessile serrated adenomas, the smallest class in the original dataset. When I compared the performance of this ResNet to the performance of the ResNet trained using the original dataset, the differences were clear.
- ResNet trained using real data only achieved an AUC of .78, similar to ResNets trained using real data and augmented data from DCGAN and DiscoGAN baselines.
- ResNet trained using real data and augmented data from CycleGAN achieved an AUC of .89.
- ResNet trained on only synthetic data from CycleGAN outperformed ResNets trained on only synthetic data from DCGAN and only synthetic data from DiscoGAN
- ResNets trained using only synthetic data did not achieve performances comparable to ResNets trained using real data.
Closing thoughts. I’d say this was a pretty cool project (although I am biased). The implications are definitely huge, since class imbalances are a huge problem in machine learning everywhere — being able to use GANs to address this issue would be a huge development. If you have an unbalanced medical dataset and want to improve the performance of your classifiers, I’d definitely suggest that you try this method out. Your results might get much better!
The full paper for the work can be found here.
Source code is available on GitHub at this link.
If you happen to use these methods, feel free to cite this paper as
Jerry Wei, Arief Suriawinata, Louis Vaickus, Bing Ren, Xiaoying Liu, Jason Wei, Saeed Hassanpour, “Generative Image Translation for Data Augmentation in Colorectal Histopathology Images”, Full Paper in Machine Learning for Health Workshop at NeurIPS, In Press, 2019.