Artificial Intelligence in Healthcare Part II

Sandra Carrasco
11 min readMar 14, 2022

--

A Tutorial on generating synthetic skin lesions using PyTorch

Written by Sandra Carrasco and Sylwia Majchrowska

Sharing medical data is one of the main challenges in the AI healthcare industry nowadays. Although there are some initiatives of high quality datasets — like MIMC or ISIC datasets — we need much more diverse and complex datasets in order to learn effectively. We need open access to high quality datasets in order to advance and build the best possible machine learning models for healthcare. One way healthcare institutions have shared data is by using de-identified or anonymized data. But this also has its limitations: it is often not private and it has been reported numerous privacy breaches and re-anonymization being possible on such datasets.

Synthetic data, artificial data generated from scratch based on the real data, can serve as proxy dataset for research, since it resembles real data but contains no real samples for any specific individual. In this way, data holders can generate high quality synthetic data to share with the machine learning community. Data scientists can work with this data to create ML models which can be used by the original data holders (hospitals).

In our previous post we listed a lot of advantages of using synthetic data for such purposes. Today we want to share with you the knowledge of how to prepare your own synthetic dataset of skin lesions images using Generative Adversarial Networks (GANs).

StyleGAN2-ADA Architecture

The first step is to select a proper architecture to accomplish the task. Our research is based on the StyleGAN2-ADA network, which is the latest version of the ever popular StyleGAN from NVIDIA. The adaptive discriminator augmentation (ADA) is an augmentation process applied only to the discriminator to overcome the problem of overfitting. In small datasets (less than 30k images) the discriminator sees the same images over and over again. This can lead to the discriminator memorizing the real images which can in turn diminish the generator’s ability to create new images. The solution is to apply classical augmentations to the images that the discriminator sees.

Augmentations are used all the time in classifier networks, but with generative networks, if we augment the original dataset, the model can learn to mimic this augmentation as well, which can result in ie. violet skin, which is not so realistic.

In ADA procedure these augmentations are applied during the training process but using the augmentations only when the model is determining if the samples are real or fake, augmenting both the generated and the real images. This leaves the generator unaffected by augmentation. The key here is that the augmentations are differentiable, so that there is a way to mathematically undo these processes (-f(x) in Figure 1) to train the generator.

Figure 1. Overview of ADA mechanism. The blue elements highlight operations related to augmentations, while the rest implement standard GAN training. The orange elements indicate the loss function and the green boxes mark the network being trained. A diverse set of augmentations are applied to every image that the discriminator sees, controlled by an augmentation probability p. Image by NVIDIA Research.

Figure 1 shows an overview of how the stochastic discriminator augmentations work. The p parameter is a probability score that specifies how and when these augmentations occur. So in Figure 1 you can observe that starting with p=0.1 the images are almost untouched by the applied preprocessing, ending at 0.8 when they become pretty unrealistic. What is important to note is that this probability is applied to all the augmentations possible at the same time, meaning that in some cases you can get rotation, shifting, coloring, etc., while in others — just one or two of those augmentations. This parameter is really important to tune, since if you use a high probability the generator can start outputting these augmentations in the generated images, what we call leakage. In general, ADA offers better results for datasets with less than 30k training images, the training is 1.6x faster and it requires 1.5x less GPU memory consumption.

Getting hands on

Now, let’s take a look at how we can get this up and running. First of all, let’s talk about the system specifications and requirements. As you may already know, GANs are hungry for computing power and GPUs, so you need at least one high-end GPU. You can both use the provided Dockerfile to build an image with all the required dependencies, or you can just install the necessary python libraries with Python 3.7, PyTorch 1.7.1 and CUDA toolkit >= 11.0. You can find detailed instructions in the official repository.

ISIC 2020 dataset preparation

Next thing you need to do is to prepare your custom dataset to fit the implemented dataloader. In this particular case, we will guide you through the tutorial using the open-source International Skin Imaging Collaboration (ISIC) 2020 database [1] as an example, but the process would be the same for images of, let’s say, brain tumors, where the data is more limited and hence, more prone to overfitting in the generative process. In these cases of rare diseases, the ADA mechanism is particularly interesting.

This dataset consists of more than 33 thousands dermoscopic training images of unique benign and malignant skin lesions from over 2 thousands patients. Although it is a large dataset, it is highly unbalanced, containing only 2% of melanomas in the whole dataset as well as gender and age biases. The dataset was created for the SIIM-ISIC Melanoma Classification Challenge hosted on Kaggle during the Summer of 2020 [2], where you can find a plethora of classification implementations as well as data curation approaches. We described the dataset in detail in our previous post. Additionally, in our experiments, we made use of 4 thousand external images of melanomas coming from the ISIC 2019 dataset with a resolution of 256x256 [3].

The dataset provided to train.py must be stored as uncompressed ZIP archives containing uncompressed PNG files and a metadata file dataset.json for labels. In order to create your ZIP archive from your custom dataset, obtaining optimal performance, you can use dataset_tool.json specifying the source path of your data and the desired width and height (if your data is not resized yet).

In the case of training conditional models we need special annotations for each image — in our particular case melanoma (1) and non-melanoma (0) samples. The class labels are stored in a file called dataset.json that should be inside the dataset root folder. This file has the following structure:

We can easily create this file from our ISIC dataset with the following lines of code:

StyleGAN2-ADA Training

Now we are almost ready to start the training. In its most basic form, training new networks boils down to running the following command:

python train.py outdir ~/training-runs --data ~/mydataset.zip --gpus 1

You can use the — dry-run option to validate the arguments and print out the training configuration.

The training exports network pickles snapshots and example images at regular intervals controlled by the option --snap. In each of these intervals, it also evaluates the Fréchet Inception Distance (FID) (or any of the metrics specified by --metrics) and logs the resulting scores in a json file, as well as the TFEvents.

The training configuration can be customized with additional command line arguments. Here, we will cover some of the most important ones that haven’t been mentioned yet:

--cond: flag to enable class conditional training, which requires a dataset with labels, like the one we just created.

--mirror: flag which enables doubling the number of images by flipping them during training (left-to-right). You can also set --mirrory to flip them top-to-bottom. It’s important to note that these are independent from the training augmentations, so that it’s okay to generate flipped images.

--metrics: you can specify which metrics to compute during training as a list or you can turn off the metric calculations, since it adds additional time to training. By default it computes the fid50k_full score (although the paper states that kid50k_full is the best metric to run).

--gamma: this is a R1 regularization weight. Basically, it stabilizes the network, allowing higher learning rates to be used with a less uniform dataset while avoiding mode collapse, being 10 the default value.

--kimg: specifies a max kimg, so that once reached, your model will stop training. Personally, we believe it is better to stop your training just based on what you see on the generated examples from the iteration step.

--resume: if you are resuming a training from a previous run or want to transfer from another model of yours, indicating the path to a .pkl file.

--aug: It can be set to noaug to test StyleGAN2 without ADA. One thing to note here, is that the paper states that augmentations work good for small datasets, but for larger ones it can be detrimental.

--p: to fix the probability of an augmentation being set. By default it’s a fluctuating value dependent on the scoring metrics, so it uses a weighted value based on training length.

--augpipe: This is a really important argument that sets the augmentation types to be used in the augmentation process. In our experience, leakages occur more frequently than expected, and one way to fix it is to change what augmentations are available. Here there is a list of the augmentations specifications:

Figure 2. Impact of p for different augmentation categories and dataset sizes on FFHQ dataset. The dashed line indicates baseline FID without augmentations. (d) Convergence curves for selected values of p using geometric augmentations. Image by NVIDIA Research.

From Figure 2 we can note that the blit and geometry are by far the most important augmentation techniques to use. Color sometimes affects and the rest don’t contribute much, so by default augpipe is bgc (blit, geometry and color). However, we have noticed that color has a tendency to leak to the output, so we have removed color from the augpipe, using the code bg.

Watching a model train

During model training you should pay special attention to several things. Let’s discuss some of them.

Overfitting

The first pitfall you can come across is when your images look exactly the same as your dataset. This means that your model is essentially just memorizing your training data. Hence, the latent space won’t be regularized, it will be very jumpy, meaning that you won’t get a smooth interpolation between images. Briefly, the latent space is just any hypothetical space that contains the latent (hidden) representations of the images in a way that a Generator knows how to convert a point from the latent space to an image. You can find an explanation of the intuition behind latent space in much more detail in the article by Ekin Tiu [4].

If you are wondering how you can check for overfitting in your model, most metrics only check for fidelity or diversity, but not for generalization or authenticity — it is an open question in the field (and a topic for our next post, stay tuned!)

This can happen for several reasons, mainly because your dataset is not diverse or large enough, or it contains a lot of duplicates. This is why it is so important to properly prepare your data before training the GAN. Some options to address this is to add extra images to your dataset, apply some augmentations or play with the gamma value.

Figure 3. Simple 2D example of underfitting and overfitting. Image by geeksforgeeks.org.

Mode collapse

The second thing you should watch out for is mode collapse. This occurs when the generator finds a couple of solutions that fool the discriminator but it doesn’t cover the whole distribution of the data (see Figure 4).

In order to address this issue you can alter the learning rate and double the batch size, since the batch size determines how much of your data subset is passing through your model on each tick and hence, it gets a wider variation or diversity of the data.

Figure 4. Example of GAN generated data (blue) for data samples coming from a 2D mixture of 8 isotropic Gaussian distributions (red). In the first image the synthetic data covers the real distribution while the second image showcases an example of mode collapse. Image by [5]

Gradient explosion

This is a pretty common problem to all deep neural networks. This is a problem where large error gradients accumulate and result in very large updates to the model weights during training, becoming unstable and unable to learn from the training data. In generative models, this is manifested as images that look very noisy, shattered and chaotic.

Some things you can do is to restart from a previous good model .pkl, decrease the learning rate and also increase the minibatch size.

Leakage

Finally, the last thing to watch out for is leakage, that happens when the augmentations leak into your output generated images.

To prevent this leakage you can either remove an augmentation method from --augpipe and/or lower the --p value.

Fake or real — do you see a difference?

After going through the whole training procedure, we can finally enjoy our artificial images of moles. In the repository there is a script called generate.py that can be used to generate a series of images from specific seeds. You just need to specify the path to the network .pkl, the output directory for the generated images, the seeds (list or range) and the label of the class to be generated. Again, it can be easily called from command line as below for melanoma class (label 1):

python generate.py --outdir out --seeds 0-35 --class 1 network /path/network.pkl

Of course the main important part is to make images as real as possible, but without leaking the sensitive information. For now just enjoy your work and try to guess if the pictures below are real or fake (do not read the description first!). We will publish a third post on the evaluation of GANs, stay tuned!

Figure 5. Examples of real and synthetic images. In the first row we have real, synthetic, real, in second — synthetic, real, synthetic. Image by authors based on [1].

Literature

  1. Rotemberg, V., Kurtansky, N., Betz-Stablein, B., Caffery, L., Chousakos, E., Codella, N., Combalia, M., Dusza, S., Guitera, P., Gutman, D., Halpern, A., Helba, B., Kittler, H., Kose, K., Langer, S., Lioprys, K., Malvehy, J., Musthaq, S., Nanda, J., Reiter, O., Shih, G., Stratigos, A., Tschandl, P., Weber, J. & Soyer, P. A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Sci Data 8, 34 (2021). https://doi.org/10.1038/s41597-021-00815-z
  2. SIIM-ISIC-melanoma-classification, Kaggle, retrieved 28.12.2021 from https://www.kaggle.com/c/siim-isic-melanoma-classification (2020).
  3. Processed ISIC2020 Dataset with external malignant examples: melanoma external malignant 256 | Kaggle (2020)
  4. Understanding Latent Space in Machine Learning | by Ekin Tiu | Towards Data Science (2020)
  5. Quan Hoang, Tu Dinh Nguyen, Trung Le, Dinh Phung, MGAN: Training Generative Adversarial Nets with Multiple Generators | OpenReview (2018)
  6. Karras, Tero, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen and Timo Aila. “Analyzing and Improving the Image Quality of StyleGAN.” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020): 8107–8116.

--

--

Sandra Carrasco
Sandra Carrasco

Written by Sandra Carrasco

AI Scientist @ AI Sweden as part of the Eye For AI Talent Program, working on applied AI problems at Astrazeneca, Zenseact and Sahlgrenska University Hospital.