How to Optimize Selection Criteria Using ipywidgets

Kuan Rong Chan, Ph.D.
Omics Diary
Published in
7 min readOct 28, 2021

--

Multi-omics quantification analysis allows scientists to uncover the molecular mechanisms underlying biological phenomena. Analysing of omics data begins with volcano plot and differentially expressed gene (DEGs) analysis

Gene expression (or transcriptomics) profiling is the most common type of omics data. The identification of differentially expressed genes (DEGs) from transcriptomics data is critical to understanding the molecular driving forces or identifying the molecular biomarkers behind biological phenotypes.

The comparison is usually performed with respect to controls or baseline conditions, to identify the gene transcripts that are differentially modulated by treatment conditions. To facilitate the identification of DEGs, the fold-change values and p-values (or adjusted p-values) of transcripts in treated versus control conditions will have to be calculated. The selection criteria for DEGs will then be based on fold change and p-values cutoffs.

The fold change cut-off is often selected arbitrarily to be either 1.5 or 2. This cut-off serves to remove genes with smaller differences, as these are deemed to be biologically unimportant changes. However, depending on the treatment conditions and the research question, these default cutoffs may or may not be suitable.

Consider a scenario where a fold-change cutoff 2.0 identified 15 only genes, whereas a fold-change cutoff 1.5 identified 200 DEGs. If the research question is to identify biomarkers, perhaps a fold-change cutoff of 2 is acceptable as this ensures the identification of genes that are most dramatically altered by infection.

However, if the question is to characterise the biological processes associated with infection, then a cutoff of 2, in this case, will be too stringent, which will cause large voids in data analysis. Another elegant example is demonstrated by Mark Dalman et al., 2012 where he showed that choosing the inappropriate fold change and statistical cutoffs can lead to erroneous interpretation of biological pathways.

Statistical significance is usually based on a p-value of 0.05. A student t-test can be used if the researcher is asking if a particular gene is changed as compared to control conditions. However, if the research question is to investigate which transcripts among the 10,000 transcripts are differentially modulated, then the student t-test is over-simplified. This is because of the multiple-testing problem, which states that the more inferences made in a dataset, the more likely a significant gene may be identified by coincidence. Hence, to correct for multiple comparisons, statistical tests such as SAM (significance analysis of microarray), Bonferroni correction, Benjamini–Hochberg procedure etc, can be used.

However, as described in the previous section, it is critical to ensure that the statistical test used should not lead to high false-positive rates that may lead to misinterpretation of the data. Based on my experience, I recommend using a false discovery rate adjusted p-value < 0.05 using the Benjamini-Hochberg Step-Up FDR-controlling Procedure (α=0.05).

Since we need to find out the most appropriate cut-off to identify DEGs, interactive Python codes that allow users to manipulate thresholds will be the most suitable to define the selection criteria for DEGs. In this blog entry, we will use IPython widgets (ipywidgets), which can generate sliders that allow users to set cutoffs or thresholds interactively. The output based on the selected cutoffs will then be generated in real-time for users to evaluate their suitability.

As a proof of concept, we will use the transcriptomics dataset published by Zak et al., PNAS, 2012, examining how seropositive and seronegative subjects respond to the Ad5 vaccine across various time points.

The summary of the study can be found here, and the processed dataset, which was analysed by Partek Genomics Suite can be found in GitHub. The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint = 0).

To install ipywidgets, simply type in the below command (you only need to do this once):

pip install ipywidgets

We can execute the ipywidgets within JupyterLab or Jupyter Notebooks. The instructions for downloading them can be found here. Next, we import the packages required for analysis:

import numpy as np
import pandas as pd
import ipywidgets as widgets
import plotly.express as px
import plotly.graph_objects as go

Next, we will load and inspect the processed dataframe from GitHub. It is also important to label the gene column as the index column so that we can refer to the column for DEG counts. Commands are as follows:

df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)
df.head()

The output file shows the values of the p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1-day, 3-day and 7-day time points compared to baseline (timepoint = 0):

Next, we will execute the ipywidgets command:

from ipywidgets import interact, interact_manual

@interact
def show_DEGs(column1=['fc_6h','fc_1d','fc_3d', 'fc_7d'], x=(1,3,0.1),column2=['qval_6h','qval_1d','qval_3d','qval_7d']):
return print('Total DEGs:', len(df[(abs(df[column1]) > x) & (df[column2] < 0.05)].index))

The above code may seem intimidating, but you can break this down into bite sizes.

First, we import the tools from ipywidgets required to build the interactive features. The @interact decorator automatically creates a text box and a slider for choosing values from a designated column and number.

In this case, we created 2 columns (column1 and column2) to represent values in fold-change and adjusted p-values (q-values) respectively. The function x=(1,3,0.1) means that column1 values (which is fc values in this case) can be changed from 1 to 3, at intervals of 0.1. Finally, we will then print out the total number of DEGs (includes both up and downregulated DEGs) more than the value of x, and with a q-value < 0.05.

The output is as follows:

I have taken multiple snapshots so that we can appreciate the number of DEGs across different days. In this case, the total number of DEGs for each time-point with fold-change > 2 and adjusted p-values < 0.05 are shown. Consistent with the volcano plot, we see that the greatest number of DEGs are in day 1.

We can change the x value to 1.5 to visualise the number of DEGs with fold-change value > 1.5, adjusted p-values < 0.05. The output is as follows:

As expected, we see more DEGs as the fold-change cut-off is less stringent. Note that the number of DEGs in day 1 increased by approximately 3-fold when we reduced the fold-change value from 2 to 1.5, which is quite a big difference! It is thus more reasonable to use a fold-change of 1.5, in this case, to avoid losing too much data.

After we decide that a fold-change value of 1.5 and q-value < 0.05 is most appropriate for our analysis, we will plot these details in a stacked bar. We first calculate the number of upregulated (fold-change > 1.5, adjusted p-value < 0.05) and number of downregulated (fold-change < -1.5, adjusted p-value < 0.05) for each time-point. The commands are as shown:

DEGs_up_6h = len((df[(df['fc_6h'] > 1.5) & (df['qval_6h'] < 0.05)]).index)
DEGs_down_6h = len((df[(df['fc_6h'] < -1.5) & (df['qval_6h'] < 0.05)]).index)
DEGs_up_1d = len((df[(df['fc_1d'] > 1.5) & (df['qval_1d'] < 0.05)]).index)
DEGs_down_1d = len((df[(df['fc_1d'] < -1.5) & (df['qval_1d'] < 0.05)]).index)
DEGs_up_3d = len((df[(df['fc_3d'] > 1.5) & (df['qval_3d'] < 0.05)]).index)
DEGs_down_3d = len((df[(df['fc_3d'] < -1.5) & (df['qval_3d'] < 0.05)]).index)
DEGs_up_7d = len((df[(df['fc_7d'] > 1.5) & (df['qval_7d'] < 0.05)]).index)
DEGs_down_7d = len((df[(df['fc_7d'] < -1.5) & (df['qval_7d'] < 0.05)]).index)

Finally, we can plot the results in a stacked bar format using the Plotly graphing library, as Plotly allows users to hover over data points to query data point attributes. The commands are as follows:

days = ['6hrs', 'day 1', 'day 3', 'day 7']fig = go.Figure(data=[
go.Bar(name='downregulated', x=days, y=[DEGs_down_6h, DEGs_down_1d, DEGs_down_3d, DEGs_down_7d]),
go.Bar(name='upregulated', x=days, y=[DEGs_up_6h, DEGs_up_1d, DEGs_up_3d, DEGs_up_7d])
])
fig.update_layout(
barmode='stack',
title = 'DEGs in seronegative subjects',
yaxis_title='Number of DEGs',
font=dict(
family='Arial', size=18))
fig.show()

I have done a few more customisations in the graph, including adding a title, y-axis titles and changing the font to make the graph look more professional. The output is as shown:

At one glance, you can quickly appreciate that day 1 has the most DEGs as compared to the other time points, and there are more upregulated compared to downregulated DEGs. You can also mouse over the data points to obtain the precise number of up and downregulated DEGs at every time point.

For easy referencing, the full set of codes can be found below:

And there you have it. Thanks for reading.

More content at plainenglish.io

--

--

Kuan Rong Chan, Ph.D.
Omics Diary

Kuan Rong Chan, PhD, Senior Principal Research Scientist in Duke-NUS Medical School. Virologist | Data Scientist | Loves mahjong | Website: kuanrongchan.com