Tutorial: Abdominal CT Image Synthesis with Variational Autoencoders using PyTorch

Lasse Hansen
MICCAI Educational Initiative
7 min readNov 18, 2019

By Lasse Hansen, Maximilian Blendowski and Mattias P. Heinrich — Institute of Medical Informatics at the University of Lübeck, Germany

This is a short introduction on how to make CT image synthesis with variational autoencoders (VAEs) work using the excellent deep learning framework PyTorch¹. We will tackle common issues and solutions and provide code examples that enable you to run your own synthesis on real abdominal CT images. This blog post is accompanied by two GitHub repositories containing an exercise notebook and a corresponding runnable solution. In addition, both repositories contain code versions that run directly in Google Colab² with GPU support. We recommend everyone who is interested in an improved understanding of the conceptual insights of VAEs, its application to medical image synthesis and implementation details in PyTorch to first work on the exercises and only then come back to this post.

Synthetically generated abdominal CT image slices.

Why do we need synthesis in medical imaging?
Data scarcity and privacy concerns are common in medical image analysis that prevent researchers from having access to large annotated datasets. Common data augmentation, including geometric and intensity transformations and distortions, provide only limited realism and cannot fill the missing gaps of the required widely distributed and densely sampled space of training images. Synthesis of “new” medical images is one solution to overcome this issue.

What are VAEs and how can we use them for image synthesis?
Classical autoencoders are not suitable for synthetic image generation. When training an encoder and decoder net with a small but unrestricted latent space, sampling from this latent space does not produce realistic images. To overcome this problem, a variational autoencoder³ aims to learn a probability density function and restrict it to a multivariate normal distribution. Therefore, the encoder net predicts reasonable distribution parameters (μ and σ) and the decoder net maps encodings from the interpretable to a hidden distribution and finally, to the image space. To learn the network parameters we optimize the reconstruction loss and the Kullback-Leibler divergence, that measures how well the learned density function approximates a normal distribution.

What are common issues when training VAEs and how to solve them?
Generative models, e.g. GANs, enable improved image synthesis but are hard to train — VAEs are mathematically elegant and thus superior architectures that are more robust, but produce synthetic images of limited spatial sharpness. Perceptual losses that penalise deviations from not only the output images themselves but also their early feature representations can be included in training VAEs to reduce blurry outcomes and obtain sharp predictions. However, ImageNet pre-trained networks commonly used in computer vision (e.g. VGG-16) are incapable of supporting medical VAE training, thus a task-specific pre-training for perceptual feature extraction is necessary. Finally, architectural choices of the expanding decoder architecture have a strong influence on the obtained outcome. We show how replacing commonly used transposed convolutions with bilinear interpolation layers can improve the visual outcome of VAEs for medical image synthesis.

In the following section we will describe how to train VAEs for the synthesis of abdominal CT images, solving common problems such as blurry image outcomes and explicitly adressing implementation details in PyTorch.

Training Data
As training data we sampled approximately 2500 2D slices from 43 patients of the TCIA pancreas data set⁴ and randomly cropped patches of 256x256 pixels. We restricted the sampling to the region of the pancreas and additionally, extracted 2 further slices from above and below the current image slice to provide more contextual information. Thus, a single training image has dimensions 3x256x256.

Exemplary abdominal CT image slices from the TCIA pancreas data set.

VAE implementation
The gist given below shows the complete implementation of the VAE in PyTorch. The encoder takes image batches of size Bx3x256x256 and produces two 512 dimensional latent vectors (μ and σ). It consists of nine blocks of Conv2d>BatchNorm>LeakyReLU operations with a kernel size of 3x3 and an increasing number of filter channels from 16 to 64. Every other layer halves the feature maps resolution by using convolutions with stride 2. A fully-connected layer (implemented as Conv2d) with 1024 channels and a kernel size of 8x8 is followed by two further fully-connected layers with 512 channels, each predicting one of the latent vectors (μ and σ). Next, we use the reparameterization trick that adds noise to the latent distribution thus introducing a new source of randomness, enabling gradients to backpropagate through μ and σ. The decoder part takes a Bx512x1x1 and generates a full-sized (Bx3x256x256) output image. A fully-connected 1x1 Conv2d layer maps the interpretable 512 dimensional latent vector to a hidden state with 1024 channels, followed by a transposed convolution with kernel size 8 and 64 channels. Then again, blocks of Conv2d>BatchNorm>LeakyReLU with kernel size 3x3 are used, but are alternated with blocks of ConvTranspose2d>BatchNorm>LeakyReLU with kernel size 4x4 and stride 2 to increase the feature maps resolution. The network architecture is finished with a Conv2d with 3 output channels and a tanh activation. In addition to the generated output image, the network returns the predicted latent vectors μ and σ that are needed for loss calculations.

VAE training with KLD loss
The VAE network is trained in an unsupervised fashion by optimizing an L1 reconstruction loss on the input and generated images. An additional loss term is given by the Kullback-Leibler divergence that ensures that the learned density function follows a normal distribution. For full training code we again refer to the provided notebooks.

Visual results of synthetic images sampled randomly from the learned latent space are shown in the figure below. The VAE trained with the reconstruction and KLD loss already generates anatomically reasonable results. Outlines of different organs and anatomical structures such as liver, kidneys and vertebra are clearly visible. However, despite being realistic, images synthesized with this VAE variant are blurry and miss finer details.

Synthetically generated abdominal CT image slices. Top: VAE. Middle: VAE with additional perceptual loss. Bottom: VAE with additional perceptual loss and bilinear upsampling instead of transposed convolutions.

VAE Training with additional perceptual loss
To improve the detail-preservation VAEs can be trained with an additional perceptual loss. The perceptual loss (first introduced for style transfer and superresolution) adapts the reconstruction criterion to more visually meaningful similarity. Commonly, pre-trained networks on ImageNet (e.g. VGG-16) are used for perceptual feature extraction but due to the different nature of the image domains they are incabable of supporting medical VAE training. Thus, we trained a fully-convolutional network which roughly follows the VGG architecture for CT segmentation of the pancreas. This pre-trained model is then used for feature extraction in the perceptual loss calculation. Here, mid-level features are extracted from after the ReLUs in layers 2, 5 and 9 by registering forward hooks for the corresponding layers. After a forward path of the input images through the VGG model the feature output tensors can be copied to a list. The same procedure is repeated after passing the reconstructed images through the network and an L1 loss is employed for each corresponding feature maps from all three layers. The generated images after training with the additional perceptual loss are less blurry and more details (e.g. of the kidneys or the vertebra canal) are recognizable. However, now we observe pattern-like image artifacts, which brings us to the last part of this tutorial.

VAE implementation addressing checkerboard pattern
As described in the excellent interactive blog post from Odena, et al.⁵ the observed checkerboard pattern stems from the use of the transposed convolutional operator in the decoder part of the VAE and can be easily reduced by replacing transposed convolutions with a bilinear upsampling layer and a subsequent convolution. We can therefore replace every transposed convolution in our VAE model with the module definition given below and retrain our network. The results show the effectiveness of this small modification. Generated images look even sharper and no more artifacts can be observed.

Where to go from here?
To make the provided code for the exercises lightweight and fast, we only used relatively few training images and small VAE and VGG models. But we hope this tutorial sparked someones interest in synthetic medical image generation and thus, we are excited to see more detailed, sharper and realistic images from the community using more training images, better data augmentation techniques, bigger models, improved perceptual loss weighting, new network architectures, etc. For state-of-the-art methods using VAEs a number of this years MICCAI papers are excellent starting points for further reading with applications ranging from joint modality completion and segmentation⁶ over frame rate up-conversion in echocardiography⁷ to unsupervised anomaly localization⁸.

[1] Paszke, Adam, et al. “Automatic differentiation in pytorch.” (2017). https://pytorch.org

[2] Google Colab. https://colab.research.google.com/notebooks/welcome.ipynb

[3] Kingma, Diederik P., et al. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013). https://arxiv.org/pdf/1312.6114.pdf

[4] Roth, Holger R., et al. Data From Pancreas-CT. The Cancer Imaging Archive (2016). http://doi.org/10.7937/K9/TCIA.2016.tNB1kqBU

[5] Odena, et al., “Deconvolution and Checkerboard Artifacts”, Distill (2016). https://distill.pub/2016/deconv-checkerboard/

[6] Dorent, Reuben, et al. “Hetero-Modal Variational Encoder-Decoder for Joint Modality Completion and Segmentation.” arXiv preprint arXiv:1907.11150 (2019). https://arxiv.org/pdf/1907.11150

[7] Dezaki, Fatemeh T., et al. “Frame Rate Up-Conversion in Echocardiography Using a Conditioned Variational Autoencoder and Generative Adversarial Model.” (2019).

[8] Zimmerer, David, et al. “Context-encoding Variational Autoencoder for Unsupervised Anomaly Detection.” arXiv preprint arXiv:1812.05941 (2018). https://arxiv.org/pdf/1907.02796.pdf

--

--