Applying a GAN-based classifier to improve transcriptome-based prognostication in breast cancer

Abish Pius
Computational Biology Papers
6 min readMay 14, 2023
The T-GAN-D robustly stratifies low and high risk breast cancer patients.

Guttà, Cristiano, Christoph Morhard, and Markus Rehm. “Applying a GAN-based classifier to improve transcriptome-based prognostication in breast cancer.” PLOS Computational Biology 19.4 (2023): e1011035.

Full Article: Applying a GAN-based classifier to improve transcriptome-based prognostication in breast cancer | PLOS Computational Biology

OVERVIEW

Researchers proposed a classifier for breast cancer risk stratification based on a data augmentation pipeline using a deep learning algorithm called a generative adversarial network (GAN). The classifier, called T-GAN-D, outperformed established biomarkers in identifying high-risk patients in a breast cancer cohort. Importantly, T-GAN-D also performed well when applied to independent and combined transcriptome datasets, improving patient stratification.

Background

Breast cancer is the most common tumor in women, with a high incidence and mortality rate worldwide. Current clinical practice involves determining the expression of estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) to classify breast cancer into molecular subtypes. However, there are variations in treatment response, highlighting the need for additional prognostic markers. Transcriptome-based multi-gene activity tests have been developed to assist in breast cancer management, but they are limited to specific patient subsets. The use of deep learning (DL) methods, such as convolutional neural networks (CNN), has the potential to extract features from large-scale transcriptome data and improve diagnosis, prognosis, and treatment prediction in cancer.

DL models, originally designed for image analysis, can be repurposed for transcriptome data analysis. However, challenges arise from the imbalanced ratio of mRNA samples to patients, leading to overfitting, as well as the risk of capturing patterns that are not robust for larger populations. Strategies such as feature selection, under- and over-sampling, and data augmentation with generative adversarial networks (GANs) can help mitigate these issues. GANs, typically used for imaging data, generate synthetic data that enriches the source dataset, improving classifier performance. GANs have been successfully applied to transcriptome data for cancer diagnosis, staging, and subtyping.

The METABRIC and TCGA-BRCA cohorts are extensive breast cancer datasets with comprehensive patient information. Although not directly interoperable due to different sequencing technologies, these datasets serve as test cases for DL-based prognostication approaches. The study aimed to develop a prognostication framework using a GAN architecture’s trained discriminator (T-GAN-D) as a standalone classifier. Transcriptome profiles were converted to images for input into DL architectures. In a transfer learning approach, T-GAN-D was independently used to predict the risk category of breast cancer patients in the METABRIC cohort. The framework’s robustness was assessed by integrating patient profiles from the independent TCGA cohort in the training set. The performance of stratification was compared to classical machine learning algorithms, a CNN, and commonly used breast cancer biomarkers. Finally, the transferability of the framework was tested by applying T-GAN-D to the smaller and imbalanced TCGA cohort.

Results

The described GAN architecture is based on a combination of different techniques to improve stability and performance. Here is a breakdown of the key components:

  1. Wasserstein GAN (WGAN): The architecture is based on the WGAN framework, which uses the Wasserstein distance as the loss function instead of the traditional Jensen-Shannon divergence. The Wasserstein loss helps mitigate issues like vanishing gradients and mode collapse during training.
  2. Gradient Penalty (GP): The gradient penalty is a regularization technique introduced in the WGAN-GP paper by Gulrajani et al. It adds a penalty term to the loss function to enforce Lipschitz continuity, which further improves training stability.
  3. Auxiliary Classifier: An auxiliary classifier network is implemented in the conditional GAN (cGAN) fashion. This approach involves supplying labels to both the discriminator and the generator during training, which helps stabilize the training process and reduce mode collapse.
  4. Z-vector Input: The generator takes a z-vector of size 250 as input. This vector serves as a latent representation that is used to generate synthetic samples.
  5. Architecture Design: The model uses strided convolutions with a step size of 2, batch normalization, and LeakyReLU activation function. These architectural choices are commonly used in GANs for better performance. The discriminator and generator have shallow networks with only two layers each to ensure stability and reduce the number of trainable parameters.
  6. Training Process: The model is trained for 1000 epochs using a training dataset. Before each full network training run, three “discriminator-only” training runs are performed. The generated images are smoothed with a final convolution layer. The model generates expression profiles of size 144x144 or 120x120, depending on the dataset used.
  7. Data Augmentation and Transfer Learning: The trained GAN discriminator (T-GAN-D) is used as an independent classifier to stratify low and high-risk breast cancer patients. It is trained on a dataset and then tested on independent patient samples. By augmenting the data with synthetic samples generated by the GAN, the classifier improves patient stratification.

Overall, this architecture combines various techniques to address the challenges of limited patient data and improve the performance of prognostic classifiers for breast cancer patients.

Discussion

This study developed a deep learning-based tool called T-GAN-D to stratify breast cancer patients into high and low-risk categories based on their transcriptome profiles. The tool converted gene expression data into images and used a trained discriminator of a generative adversarial network (GAN) as a prognostic classifier. The T-GAN-D performed better than traditional outcome predictors and maintained robust performance when merging two independent cohorts.

Previous studies have also applied artificial intelligence (AI) to breast cancer using different types of data, such as mammography images and transcriptome data, for diagnosis, treatment planning, and prognosis. However, challenges exist when dealing with small or imbalanced datasets, and deep learning algorithms may be prone to overfitting. In this study, the T-GAN-D addressed these challenges by using data augmentation and generalization capabilities of GANs.

The study showed that the T-GAN-D classifier could integrate and analyze transcriptome data from different cohorts, which is typically challenging due to differences in sequencing technologies and protocols. By training the network with a subset of one cohort and the entire other cohort, the classifier outperformed clinical biomarkers and established gene expression signatures in predicting risk categories.

The T-GAN-D also demonstrated its robustness when dealing with imbalanced datasets, and it captured relevant risk patterns even when one cohort was heavily underrepresented in the training dataset. This suggests that the framework can be used to generate personalized outcome predictions for new, smaller datasets.

To improve the performance of the prognostic framework, future strategies could involve integrating feature selection as a preprocessing step and structuring transcriptome profiles into biologically meaningful arrays of pixels. Additionally, the T-GAN-D could be tested on multi-omics data to analyze multiple -omics domains simultaneously and capture hidden inter-omics relationships.

Overall, this study represents a scalable approach for using data augmentation to develop a tool for individualized prognosis in breast cancer. As genomic data generation continues to increase, GAN-based approaches could play a significant role in leveraging such data for patient benefit. Moreover, other -omics domains, such as proteomics and metabolomics, could also be integrated with clinical information using GAN-based approaches to improve patient-tailored interventions and prognostication.

FREE ChatGPT Document Q&A: Get questions answered about any document type of any length!

Plug: Please purchase my book ONLY if you have the means to do so, I usually do not advertise, but I am struggling to stay afloat. Imagination Unleashed: Canvas and Color, Visions from the Artificial: Compendium of Digital Art Volume 1 (Artificial Intelligence Draws Art) — Kindle edition by P, Shaxib, A, Bixjesh. Arts & Photography Kindle eBooks @ Amazon.com.

--

--

Abish Pius
Computational Biology Papers

Data Science Professional, Python Enthusiast, turned LLM Engineer