How to Build Volcano Plots Using Plotly for Quantitative Analysis of Omics Data
Volcano plots are one of the first and most important graphs to plot for an omics dataset analysis. By plotting a scatterplot of -log10(Adjusted p-value) against log2(Fold change) values, users can quickly visualise the distribution and identity of genes or proteins that are most differentially expressed.
The genes with the greatest fold changes and significant p-values are also ideal targets for validation and have been consistently shown to be highly reproducible across multiple platforms (Leiming Shi et al., BMC Bioinformatics, 2008).
The volcano plot is comprised of a two-step procedure:
- First, log2 fold-change (log2FC) is determined by taking the ratio of the gene abundance in the treatment group to the control group, followed by a log2 transformation to obtain a normal or near-normal distribution. Values > 0 are considered as upregulated genes, whereas values < 0 are downregulated.
- Second, an adjusted p-value (or q-value), corrected for multiple corrections, is used to calculate if the gene expression changes between the treatment and control groups are significantly different. This is then followed by a -log10 transformation to obtain the -log10(adjusted p-value).
To illustrate how to build volcano plots with Python, we will analyse a transcriptomics dataset published by Zak et al., PNAS, 2012. In this study, seronegative and seropositive subjects were given the MRKAd5/HIV vaccine, and the transcriptomic responses in peripheral blood mononuclear cells (PBMC) were measured at 6 hours, 1 day, 3 day and 7 days post-vaccination.
On my end, I have processed and compiled the fold-change, ratio, p-values and adjusted p-values (q-values) of the seronegative subjects with respect to time = 0 using the Partek Genomics Suite, and the processed data is available in GitHub.
I strongly recommend using JupyterLab or Jupyter Notebook, which is a web-based interactive development environment for storing and executing Python codes. You may visit here for specific instructions on how you can install them on your computer.
We first load the required packages (pandas, NumPy and Plotly) to plot our volcano plot using Python. Note that you will need to download the packages before you can start using them.
I have chosen to use the Plotly graphing library for data visualisation, as Plotly allows users to hover over data points to query data point attributes. The commands are as follows:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
Next, we will load and inspect the processed dataframe from GitHub. It is also important to label the gene column as the index column so that we can reference the specific points to gene names. Commands are as follows:
df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
The output file shows the values of p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1-day, 3-day and 7-day time-points compared to baseline (timepoint = 0):
The next step will be to create new columns for log2FC and -log10(adjusted p-value). The commands are as follows:
df['log2FC_6h'] = np.log2(df['ratio_6h'])
df['log2FC_1d'] = np.log2(df['ratio_1d'])
df['log2FC_3d'] = np.log2(df['ratio_3d'])
df['log2FC_7d'] = np.log2(df['ratio_7d'])df['negative_log_pval_6h'] = np.log10(df['qval_6h']) * (-1)
df['negative_log_pval_1d'] = np.log10(df['qval_1d']) * (-1)
df['negative_log_pval_3d'] = np.log10(df['qval_3d']) * (-1)
df['negative_log_pval_7d'] = np.log10(df['qval_7d']) * (-1)
Now we are ready to plot the volcano plots for the different time points. The advantage of using Python is that we can overlay the volcano plots of the different time points all in one graph. The commands are as follows:
fig = go.Figure()trace1 = go.Scatter(
)trace2 = go.Scatter(
trace3 = go.Scatter(
trace4 = go.Scatter(
fig.add_trace(trace4)fig.update_layout(title='Volcano plot for seronegatives')
I will give a brief description of the above code. We first tell Plotly that we are going to plot a figure using the go.Figure() command. Next, we overlay the different scatterplots using the different traces (one for each time-point). In each trace, we specify (i) the columns for the x and y-axis, (ii) indicate that we only want to plot points using the mode= ‘markers’, (iii) indicate the figure legends for the different traces and (iv) indicate the text labels when we hover over the data points. Under the ‘fig.update_layout’, we also added the title of the plot. Output for the graph is thus as follows:
At one glance, you can quickly appreciate that day 1 has the most differentially expressed genes as compared to the other time points. You can also mouse over the data points to examine which of these genes are most differentially expressed. An example I have provided is CXCL10, which is one of the genes that is induced to a great extent after one day, post-vaccination.
Now that we have established that day 1 is the most interesting time-point, we may want to zoom into the genes that are at the edges of the volcano plot. We can type in the below command to specifically plot the day 1 volcano plot. Note that I have added the text in the volcano plot this time (using the ‘textposition’ command) so that we can quickly visualise the genes that are most differentially expressed:
fig_d1 = px.scatter(df, x='log2FC_1d', y='negative_log_pval_1d', text=df.index)fig_d1.update_traces(textposition='top center')fig_d1.update_layout(
title_text='Volcano plot for seronegatives (day 1)'
Output file is as follows:
Now, we can quickly visualise that the top upregulated genes at day 1 include the interferon-related genes, such as IDO1, CXCL10, RSAD2, CCL2, CCL8 and LAMP3. As you can begin to appreciate, a volcano plot allows users to have quick sensing of the data and visualisation of the most pivotal genes and proteins that are most differentially expressed.
For easy referencing, the full set of codes can be found below:
And there you have it. Thanks for reading.