Clustering and Visualizing Cancer Types from Gene Expression Data using Variational Autoencoders (VAEs)

Published in

The Startup

6 min readJun 29, 2019

The Goal

Cancer diagnosis is very unpredictable and with new cancer types being discovered daily, doctors are burdened with the difficult task of correctly diagnosing the specific cancer subtype. Traditional methods are generally based on organ of origin, type of spread, and other factors, but this has its limitations since the same organ can have different types of cancer and similar types of cancer can be found in different organs. Drugs and treatment plans, preconditioned on organ, may not be effective on different types of cancer in the same organ. Therefore, there is a large need for a different way to identify cancer types using other data. Cancer cells are known to express genes quite differently than regular somatic cells. Therefore, using the gene expression of cancer cells as a clue lends itself to more easy diagnosis of the specific cancer type.

Table of Cancer Types

The Challenge

The difficulty in using gene expression data is the size of the data. The human genome contains between 40,000 and 50,000 genes. If we take a biopsy of cancer tissue, at least 2,000 to 6,0000 genes may be expressed in that tissue of which many maybe be normal, but some — abnormal. Human perception and reasoning cannot detect patterns in such a high dimensional observation across hundreds of patients. If we can train a computer to cluster the gene types, specifically an unsupervised model, it can discover patterns in the data itself, telling us more information about the gene expression of cancer, as well as relationships between cancer types.

In other words, the same cancer type occurring in different organs can be evidenced by the same gene expression profile. Likewise, different cancers that appear in the same organ and visually, even microscopically, appear the same to oncologists might reveal their differences in their gene expression profiles.

Even for a machine to be able to detect patterns, a 5000-dimensional profile of individual genes is too large. Genes do not appear and work in isolation. There are natural cohorts. If we find a way to reduce a 5000 gene profile into a 100 or 50 gene set profile, it would make the information more amenable to both machine and human pattern recognition. That is the goal of this work.

The Approach

Fortunately, recent advances in deep learning have enabled automated learning of such high-level abstractions from the data itself. This technique is called “unsupervised learning”.

Specifically, an unsupervised technique, autoencoding, can be used for this dimensionality reduction. By design, an autoencoder can take a 5000 dimension data set, reduce it to a 36 dimension representation and reconstruct back to the 5000 dimension original data, hence the name “autoencoding”. Because the reduced dimension representation is able to reconstruct the data, it is expected to contain all and only the information relevant in the original data (compete and concise). A regular autoencoder, however, can simply memorize the data and fail to generalize. A technique called variational autoencoding has been developed to overcome this issue. In a nutshell, a VAE learns to associate slightly perturbed versions of each example to the example itself. This ensures that the model it is learning is ‘continuous’ and generalizing over nearby samples.

The Greene Lab at the University of Pennsylvania has successfully applied variational autoencoding to cancer data and have generously open-sourced their excellent code base (Tybalt) in a series of well-annotated Jupyter notebooks. I have tried to reproduce and extend the application in a few new directions, yielding some exciting results.

The Data and the Experiment

The gene expression data set comes from The Cancer Genome Atlas. The dataset of 11,000 patients across 33 different cancer types was retrieved. The VAE was then used to compress 5000 of the original dimensions into 36 dimensions with the expectation that these dimensions will be clinically/oncologically meaningful.

But how can we test this?

We can do this in two ways: visualization and classification.

The visualization approach requires us to ask: does this representation reveal new insights about similarities and differences between cancer types?

Pursuing this idea, we would remap the 36 dimension data into a two-dimensional space in order to see whether it can reveal that two specific cancers occurring in different organs are closely related and conversely, that two different cancer types in the same organ are not related and therefore separated out.

A technique called tSNE (t-distributed stochastic neighbor embedding) is capable of remapping 36-dimensional data into a 2 dimension approximate neighborhood preserving map.

The other way is using a classification model. This method requires us to ask: does this reduced representation learned by an unsupervised mechanism contain enough information to identify the correct class of cancer?

Here, we can use a supervised training method, random forest-based classification, to learn which of the thirty-three cancer types a patient represents based on the 36-dimensional representation.

The Classification Results

The essence of the classification experiment is asking: is the representation complete and reliable?

The following confusion matrix shows the results of random forest classification on a test data set. The classifier used the reduced representation built by the VAE as features to detect cancer types. The rows represent the correct cancer types and the columns represent the predicted cancer types. Diagonal items show correct classifications and the off-diagonal items are the misclassifications.

We can notice that most cancer types are classified with no or low confusion (ex: ACC [, BRCA, DLBC)

Confusion matrix showing correct classification and misclassification of cancer types.

The Visualization Results

The essence of the visualization experiment is asking: is the representation useful and meaningful?

Cancer gene expression profiles visualized in two dimensions based on their VAE latent representations. Each point represents a patient diagnosed with the labeled cancer type shown by its color. Click here to access the interactive version.

The figure shows the 11,000 patients plotted by their reduced gene expression set profile and we have colored the point representing every patient with the diagnosed cancer type.

Here are my observations:

The patients are automatically plotted in clearly identifiable clusters even if cancer type information was not used in training the VAE. This is impressive!
In general, all cases belonging to the same diagnosed cancer type are in the same cluster or in close proximity.
We also see that diseases such as glioblastoma (black in the right bottom corner) and lower grade glioma (grey in the right bottom corner), which occur in the same organ, have much gene expression overlap.
Cancer types occurring in different organs such as Lung squamous cell carcinoma and Esophageal carcinoma seem to have similar gene expressions as shown by the overlapping of their clusters. I’m not sure if this is already common knowledge for oncologists, but I think the VAE representation could reveal similar hidden connections not yet known to us and is worthy of further study. Such discovery could lead to potentially sharing of diagnostic methods, treatments, and drug design approaches.
Also, this visualization, or an improved way to present the VAE latent features, may provide a way to locate ‘similar’ patients with similar gene expression, thereby case-based review of past experience (symptoms, treatments that worked and did not work, and how the disease progressed, etc).

My plan is to follow up with an oncologist and researchers in the field to explore these ideas and get their expert opinions. If you have interest or experience in any of these fields, please feel free to contact me on LinkedIn. That would be a great privilege for me!

The plot was made using the excellent Highcharts javascript library and I have hosted it here. As you hover over a point, the patient identification in TCGA their gender, and the diagnosis are displayed in the hovering comment.

Final Thoughts

The following are some of the ideas I’m pursuing:

Understanding the constituent (related) genes in various gene clusters
Analyzing the classifier coefficients to correlate cancer types to specific gene expressions
Subdividing cancer subtypes occurring in the same organ
Listing similar cancers occurring in different organs
Building an easy diagnostic tool based on gene expression profiles summarized by VAEs and other unsupervised techniques

I’m pursuing research along all these lines. Again, if you’re interested (or you know someone who would be interested) in any of these or have questions or suggestions, please connect with me on LinkedIn.

References

Greene Lab Paper: Way G, Greene C. Extracting a Biologically Relevant Latent Space from Cancer Transcriptomes with Variational Autoencoders. 2017

Greene Lab Tybalt Code: https://github.com/greenelab/tybalt

Highcharts Javascript Library: https://www.highcharts.com/

Background on Variational Autoencoding: https://jaan.io/what-is-variational-autoencoder-vae-tutorial/

Introduction to Variational Autoencoding Paper by DP Kingma and Max Welling: https://arxiv.org/pdf/1906.02691.pdf