Cancer Genomics I: An introduction to working with gene expression data using Python

Published in

Adventures In Hacking Healthcare

7 min readJan 8, 2020

By Nicholas Giangreco [LinkedIn][GitHub]
Contributing Authors: Matthew Eng [LinkedIn] [GitHub], Jordi Frank [LinkedIn][GitHub], and Sohil Shah [LinkedIn][GitHub]

A genomic perspective is needed to find new treatments for liver cancer

Liver cancer is the 3rd deadliest cancer in the world

Each year in the United States, about 33,000 people are diagnosed with liver cancer, and about 27,000 people die from the disease. The percentage of Americans who are diagnosed with liver cancer has been rising for several decades and liver cancer has the second highest worldwide cancer mortality rate.

While surgery can be an effective method for treating early-stage liver cancer, there are few options for more advanced and metastatic tumors. Only 26% of people with early stage liver cancer were alive 5 years after diagnosis. Only 4% of those with metastatic liver cancer had survived. Many potential therapeutics have failed clinical trials and there are few approved treatments for managing advanced HCC.

In addition to DNA mutations, different biological processes occurring in the tumor microenvironment can influence disease progression. These dynamic changes are reflected in the expression of DNA. We can better understand the underlying biological pathways contributing to tumorigenesis by mining gene expression data from tumor tissues at various stages of liver cancer progression .

To find and dissect patterns of gene expression, we can leverage tools from machine learning and AI to address this healthcare problem. But for the average data scientist or clinician wanting to solve this problem, it is not obvious how to work on the problem and where to obtain the data. To address these pressing and complex problems, we need to make them accessible for people to solve and provide a collaborative space for ideas and approaches to be developed.

Hacknights is an experimental space to learn and apply machine learning and AI

As part of the New York Health Artificial Intelligence Society, Hacknights is a workshop event series that introduces a pressing problem within healthcare and examines it during a hands-on workshop(s).

A 3-part Hacknights series was held in 2019 introducing cancer genomics and examining liver cancer genomics to understand progression of this deadly form of cancer at a molecular level.

Hacknights only uses publicly-available data (read below!) and open source software (Python). Additionally, we needed a software environment that was accessible, was easy to use, and allowed users to get up and running exploring and analyzing data.

Spell provides a free community platform for addressing problems using data, machine learning, and AI

This 3-part Hacknights series used the data science platform from Spell.

Spell is a platform used to build and manage machine learning projects. We chose to use Spell for its ease in setting up a dev environment quickly and familiar notebook-like structure. On the platform, you can choose to use any environment (e.g. R/Python or CPU/GPU) that works for you.

Some advantages of using Spell for Hacknights were:

Simple and intuitive user interface,
The synchronization with a GitHub repository, and
The package manager for handling dependencies via pip and apt-get.

In the Spell platform, one creates a workspace which is synced with the tree of a Git repository. Once entered into a workspace, it looks exactly like a Jupyter notebook interface….but with some powerful features underneath.

We found using the Spell platform increased the ease of use and getting up and running to focus on the goals of the workshop series: working with data and solving problems collaboratively.

You can get started on Spell by visiting the website and registering for an account. The community version is free and has enough to conduct the analyses in this series.

To run the analyses from Hacknights and the accompanying Github repo on Spell, you can follow the below steps:

Set up a Spell account: https://web.spell.run/register

Verify your account be confirming your email:

Enter this url into your browser: https://web.spell.run/{{username}}/workspaces (go through the help tabs; you do not have to enter credit card information when prompted — just click ‘skip for now’)

Click ‘Workspaces’ in the left tab bar then click ‘New Workspace’

Enter a workspace name and the GitHub repository url (https://github.com/nyhais/hacknights), and continue.

Keep the default settings. When at the pip packages section, paste package names in the file requirements.txt (https://github.com/nyhais/hacknights/blob/master/requirements.txt). Leave empty apt packages. Click continue.

Setting dependencies and hardware for your workspace

Click continue (the data is available in the repository so this is not necessary now). Spell also provides this publicly-available data within the public/ directories!

If all settings are correct, start the server

The server should be starting!

And now our environment looks just like a Jupyter notebook environment that is ready for you to use!

A workspace with all necessary dependencies on the Spell platform

Now you are ready to go!

But first, before we started working with the data, we explained how we obtained the data and what the data actually represents.

(A brief) genomics primer to understand liver cancer genomics data

In biology, the genes in our DNA encode different biological products that give life, health, and sometimes disease.

Our DNA is first transcribed into RNA. The collection of RNA in our cells tell us how genes are expressed to ultimately give rise to function. Unlike DNA, which is relatively static, RNA can change dynamically depending on environmental factors and stimuli from the body. RNA can provide a view into how cancer functions and progresses. One method of obtaining gene expression data is through RNA sequencing (RNA-seq for short).

In RNA-seq, RNA is extracted from a sample and transcribed to a more stable DNA copy (known as complementary DNA because it is the ‘photo-copy’’ to the RNA sample). Sequencing works by attaching one side of each RNA molecule to a surface and repeatedly cycling through a copying process of attaching colored probes that bind to the RNA, imaging the probes, and washing off the probes. Rinse and repeat.

Following this process we now have many strings of T’s,C’s,G’s, and A’s that have been barcoded on either side of the sequence. Since each barcode attaches to a known and unique sequence of a RNA, we can match these sequences to the genes in the human genome.

Once the sequences have been mapped to their corresponding genes, we now have a quantification of gene expression for our samples. Depending on the question we are looking to answer, comparisons of gene expression between experimental conditions can be used to understand influential genetic components and molecular pathways. You can read more about RNA-Seq from this online resource.

For instance, comparing RNA-seq data between normal and tumor liver tissue can show differences in actively expressed (or suppressed) genes and provide insight into the role of genetic components in influencing tumor progression.

By using approaches to classify and predict how our genes influence tumor progression, we can better understand how liver cancer functions and progresses so we can develop more intelligent therapeutics for this deadly disease.

Coming up:

This is the first post of a four-post series titled Cancer Genomics.

In the following posts, we’ll walk through liver cancer gene expression (RNA-seq) data. The Cancer Genome Atlas represents one of the most comprehensive publicly-available datasets for cancer research. We obtain our data from this resource. Datasets available through TCGA contain data on both biological (eg, gene expression) and clinical (eg, tumor progression) factors.

This data is available from many sources, but we obtained the genomic and clinical data from the R package RTCGA. The liver cancer data is already outputted in the accompanying GitHub for easy upload into a python environment.

In the following posts we will:

(Part II) Introduce and explore the publicly available liver cancer genomic data,

(Part III) Detect patterns of liver cancer progression from genomic data, and

(Part IV) Predict liver cancer progression from genomic data.

See you in the next post!