Dash Apps that Analyze RNA Sequencing Data for Drug & Vaccine Development
Author: Alyssa Morrow
Vaccines are powering the world’s news cycles in 2021. You may have heard professionals explaining terms like “spike proteins” or “herd immunity”. You may have even heard, in high school, that the mitochondria is the powerhouse of the cell. Put simply, biology and the field of bioinformatics has never been more crucial to the advancement of humanity.
The mitochondria may be the powerhouse of the cell, but what makes cells different? How can we measure those differences? And how are leading experts leveraging these tools to power medical breakthroughs?
In this blog post, we’ll show you how experts use Dash and Pileup.js to explore differentially expressed genes of cells that have been subject to different conditions.
Your body’s skin cells and neurons have the same DNA, but there are many additional factors that define the roles of cells within your body.
One of the ways that skin cells and neurons can perform such vastly different functions is through the expression of different genes. Genes are encoded in our DNA, and can be transcribed from our DNA to ultimately make proteins. However, before proteins are made, these genes can express themselves at different levels.These levels of gene expression, for all of the tens of thousands of genes in a cell, is referred to as the transcriptome.
So if the transcriptome can help us determine how cells develop differently, then what does it look like? The transcriptome is made up of many molecules of RNA, including messenger RNA, or mRNA for short. mRNA is transcribed from genes. Therefore, the number of mRNA molecules that we can count in a given cell give us an idea of the level of expression for different genes. These mRNA molecules are eventually translated into amino acid sequences, which ultimately make proteins. The quantity and variety of proteins that are available in a cell ultimately help it perform a specific function, whether it be producing melanin, stimulating an electric current, or fighting off those suspicious sniffles.
Because the transcriptome varies so much between cells, it can tell us with precision how cells are different. We can use information about the transcriptome to determine exactly how skin cells differ from neuron cells (for example).
So, how do we measure and track the transcriptome? RNA sequencing (“RNA-seq”) is a technique that allows us to measure the quantity of mRNA in a biological sample. RNA-seq gives us an idea of the level of expression for each gene. We then use these different levels of expression for all known genes and compare them between different types of cells.
RNA-seq can be used to answer many important questions about the transcriptome, and can be broadly used to answer important questions in biological, clinical, and drug development research:
- In biological research, we might investigate the effect of knocking out, or removing, the expression of a specific gene.
- In a clinical setting, we might identify genes that either inhibit or activate signaling pathways that are mis-regulated in cancer.
- Finally, RNA-seq can be used in drug development, where we can monitor how the transcriptome changes after drug interference.
However, before we can analyze RNA-seq, we must first process the data. First, we must align sequenced reads to the genome, using tools such as STAR or Rsubread , . Once these reads are aligned, individual genes can be quantified using tools for transcript quantification, such as RSEM . Other quantification tools such as kallisto  allow genes to be quantified without the need for alignment . These quantification tools allow for the collection of count data that indicate the level of expression for each gene in an RNA-seq sample, and, ultimately, all known genes.
One common task in the analysis of count data from RNA-seq is the detection of genes that are differentially expressed between two conditions. As mentioned above, example analyses that evaluate changes in gene expression include:
- Comparison of gene expression between a CRISPR knockout and a corresponding control
- Identification of genes that are differentially expressed between a tumor sample and a healthy control sample
- Identification of changes in gene expression that result from the interference of a drug
In order to detect genes that differ between conditions, many tools, such as DeSeq2 and edgeR , , are able to determine the magnitude of which a given gene changes expression between two conditions. This magnitude is referred to as the effect size. Although these tools allow for initial identification of genes that are different between two conditions, visualization of RNA-seq data can help us better identify genes that are truly differentially expressed in our dataset (true positives).
In today’s example, we have borrowed from the CRUK CI Bioinformatics Core tutorial, which demonstrates how to process, align, and quantify RNA-seq samples from a mouse mammary dataset (SRA accession SRP045534). For the purposes of today’s tutorial, we assume that the samples have already been processed, and we would now like to analyze results to identify genes that are differentially expressed between two conditions. Here, we are particularly interested in identifying differentially expressed genes between two different cell types. Specifically, we will evaluate changes in gene expression between basal (B) and luminal (LUM) cell types in mouse mammary cells.
Visualizing and selecting differentially expressed genes with Dash Bio
First, we would like to visualize differentially expressed genes between our conditions of interest. Here we’ll use the VolcanoPlot from Dash Bio to read and visualize differentially expressed genes. A volcano plot visualizes the effect size of gene expression between two conditions on the x-axis, compared to the statistical significance of that gene (measured in -log10(p-value)) on the y-axis. You can read more about interactive volcano plots in this blog post.
Depending on the dataset, we may want to vary the threshold in which we consider a gene to be significant. Sometimes, we consider genes with an absolute effect size, or log2 fold change, greater than 1, to be of interest. By using a RangeSlider, we can vary the threshold in which we consider a gene to be strongly differentially expressed, and see which genes we gain or lose from updating this threshold.
Visualizing RNA-seq data with Pileup.js
To visualize aligned reads from our RNA-seq data in a genome browser, we use Dash and the Pileup.js Dash component (if you haven’t used Dash for Python, R or Julia, please see Introduction to Dash). The Pileup.js component is just one of the many open-source components available in Dash Bio. We can use Dash so that any selected gene in our volcano plot will be automatically visualized in the Pileup.js component as it is selected. Shown below, we can view various genes, which have varying levels of differential expression between the basal and luminal cell types.
We can first explore the number of RNA-seq reads that are aligned to a given gene by looking at a coverage track in pileup.js, which shows us the counts of reads at every region in the genome. For some genes, the counts may be very different between conditions. However, for other genes, the counts may be very similar, thus possibly identifying a false positive in our dataset. We can also use our visualization of RNA-seq data to confirm that reads aligned to a given gene actually belong there. Sometimes, reads can align to a gene, but may not actually belong there. This is similar to if you tried to put a puzzle piece in the wrong spot of a puzzle: it would stick out like a sore thumb! Luckily, we can use pileup tracks to visualize RNA-seq reads. Reads that have many mismatches (shown in red, green. blue, and yellow ticks), may not be well aligned. This suggests that these genes are hard to evaluate, and may need additional attention.
We may not be only interested in visualizing RNA-seq data within our Pileup.js component, but may also want to view different types of data that we have collected from a set of conditions. It is often the case that we may have other datasets collected from a set of conditions that we would like to visualize alongside our RNA-seq data. For example, we may have matching ATAC-seq, DNase-seq, or ChIP-seq data that we would like to visualize. In these cases, we may be interested in how these datasets change between conditions in regions surrounding differentially expressed genes. In these cases, we can simply replace RNA-seq data in the Pileup.js component with any other file of interest.
Using the Pileup.js Dash component for your own analysis
In this tutorial, we leveraged the Pileup.js component to visualize RNA-seq data. Dash Bio also supports Igv.js (https://dash.plotly.com/dash-bio/igv), which similarly allows visualization of multiple types of genomic data in a genome browser format. Igv.js supports additional file formats that are not supported in pileup.js, including bigwig, cram, and tdf files. Although both Pileup.js and Igv.js can be used interchangeably, we find that pileup.js works well when your data may not be stored in traditional bioinformatics based file formats. Pileup.js supports the visualization of data through JSON. This makes it easier for users to convert and visualize data that may be stored in a dataframe or matrix, instead of bioinformatics based formats such as bam files, bed files, or vcf files. A comparison of cases in which you can use Igv.js and Pileup.js are shown in the table below.
The example below shows how we can load in a JSON file, containing alignment information in GA4GH format, into a Pileup.js component. Creating a Dash app with the Pileup genome viewer is as simple as this:
In summary, the Pileup.js component supports the following types of visualizations:
- Genome coverage
If you would like to learn more about how this works in practice, watch this recorded webinar ➡️ go.plotly.com/bioinformatics
 H. Martinez et al., Concurrent and Accurate Short Read Mapping on Multicore Processors., IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 12, no. 5. IEEE, pp. 9951007, 01-Sep-2015.
 Y. Liao, G. K. Smyth, and W. Shi, The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads., Nucleic Acids Research, vol. 47, no. 8. Oxford University Press, 07-May-2019.
 M. D. Robinson, D. J. McCarthy, and G. K. Smyth, edgeR: a Bioconductor package for differential expression analysis of digital gene expression data., Bioinformatics, vol. 26, no. 1. Oxford University Press, pp. 13940, 11-Nov-2009.