Revolutionizing Cancer Detection with DNA Methylation and Machine Learning

Adrien Galamez
Slalom Data & AI
Published in
4 min readDec 3, 2020
https://unsplash.com/photos/eXoXJrOGqG4

In 2017, 8.8 million people died from cancer worldwide. According to the World Health Organization (WHO), the main difficulty is that many cancers are diagnosed too late. Improving early detection of cancer will lead to less expensive and easier treatment, which will ultimately result in many lives being saved.

Partnering with Caltech and Dr. Noah Ollikainen, we developed and trained a machine learning model that predicted cancer from a tissue sample’s DNA Methylation values. Our trained machine learning model was able to predict — with over 99% accuracy — whether the patient was diagnosed with cancer. How did we make this happen?

Modeling the link between DNA Methylation and cancer

To make our model as accurate as possible, we needed to use as much data as possible on different cancer types. As a result, we used data from The Cancer Genome Atlas (TCGA). It is a program from both the National Cancer Institute and the National Human Genome Research Institute that began in 2006 and generated over 2.5 petabytes of publicly available data. With this raw data, we created a dataset composed of tissue observations from 15,000 patients. Each observation had more than 500,000 DNA Methylation measurements. That’s more than 7.5 billion rows!

DNA is the human encyclopedia containing our genes. DNA Methylation can be viewed as DNA metadata. It measures methyl groups sitting on top of the DNA string, which act as a signal along the DNA to regulate the degree to which the gene is expressed. Intuitively, it can be linked to cancer because the less a gene is methylated, the more the gene can be expressed and replicated, thus increasing the risk of creating tumorous tissues.

Methyl groups attached to strands of the DNA are acting as signals regulating the genes expression. Source: University of Delaware (link)

Once our dataset was created, we tried multiple machine learning models to accurately predict the observation label: either cancerous or not. We settled on using a XGBoost model for two main reasons.

First, by implementing gradient boosting, this model achieves a higher accuracy on our dataset. Then, because it is a tree-based model, it makes it possible to extract the DNA sites that are the most useful in detecting cancer.

As extracting this list of DNA sites was one of the most valuable part of the project for our client, we decided to host and publish our model and our results on an interactive web application. This application was designed to be used by researchers to interact with our model and get the list of most impactful DNA sites to investigate, in order to study the link between DNA Methylation and cancer.

Let’s explore our ML model in more detail

We arrived at a couple of learnings from creating this model.

First, we had to work on creating the target metric to predict. We leveraged the TCGA barcode to identify observations that were coming from cancerous or health tissues. The image below shows the observation barcode format that the TCGA project is using. We learned that the sample id is the part that identifies a cancerous observation from a healthy one.

TCGA Barcode Format. We used the Sample ID to label our observations as cancerous or healthy. Source: National Cancer Institute (link)

Second, because it was not possible to train our model on 7.5 billion rows, we needed to select the best 5,000 features out of the 500,000 available. For this, we leveraged a Dataproc job on Google Cloud Platform. The goal of this job was to select the 5,000 best features that maximize variance across the two groups of patients (cancerous and healthy).

GCP Architecture diagram. The raw datasets were hosted in BigQuery. We prepared the data through Dataproc and our training dataset was hosted in Cloud Storage. We then leveraged AI Platform for training and prediction.

Finally, because TCGA is mainly collecting data from cancerous patients, most of our work centered around cancerous cells, and very few healthy ones. To ensure that our model would generalize well in real life, we applied a technique called Synthetic Minority Oversampling Technique (SMOTE). SMOTE is helpful for rebalancing the dataset and thus making sure we had at least 30% of healthy observations in our training data — making our model more viable and sustainable beyond this dataset.

What’s next?

To the extent of our knowledge, this is the first time DNA Methylation has been used to diagnose cancer. It will first be used by researchers that want to understand the biological impact of genes on cancer development.

Furthermore, the greatest advancement of this novel method introduces is an early-stage cancer diagnosis. Rather than undergoing an invasive and painful biopsy, a patient could undergo a simple blood test to obtain DNA methylation measures. These values could then easily be fed into our model, resulting in a streamlined and accurate cancer diagnosis.

As machine learning becomes a more mainstream method that scientists like Dr. Ollikainen embrace and rely on for day-to-day research and experimentation, we are confident that statistical disease detection, medical scenario modeling, and intuitive data visualization will be key components of bringing mankind to curing our most stubborn maladies in the future. This is just the beginning!

--

--