Open source bioinformatics tutorials

Abhilash Dhal
8 min readNov 14, 2023

--

Chapter 1: Analysis of RNA-Seq data

Background:

The global bioinformatics market size is expected to explode 3X from 2019 to 2027. This coupled with the infusion of AI will inherently lead to:

  • new omics based technologies (molecular assays, sequencing technologies, optical high-throughput technologies, etc)
  • plethora of open source tools for enabling disease specific, assay specific computational analysis.
  • demand for more data professionals (data-scientists, analysts and bioinformaticians in the domain.
Global Bio-info market size expansion (https://www.openpr.com/news/2264973/bioinformatics-market-report-2021-market-stood-at-8-5-billion)

Dilemma of domain change to bioinformatics for aspiring data professionals:

Unlike other domains (E-Commerce, Finance, Supply-Chain, Electronics, etc), learning biology and high-throughput technologies have the following complexities:

  • High throughput biological datasets (scRNA, bulk RNA, SNP, WGS, peptide arrays, ect
  • Categorical features (clinical manifests, sample manifests, gene annotations, etc)
  • Lots of documentation (that can be overwhelming to read)

Given the dilemma faced for a transition into bioiformatics domain, I started building a common platform called OmixHub (https://github.com/adhal007/OmixHub) to help students get started on exploring different datasets and applying ML/DL algorithms for analysis.

Here, I present an example of applying differential expression to kidney cancer RNA-Seq dataset with 3 sub-types of cancer using a wrapper with PyDeSeq utilities (https://github.com/adhal007/OmixHub/blob/main/OmicsUtils/pydeseq_utils.py)

0. Libraries needed

import OmicsUtils.pydeseq_utils as pydeseq_utils
import pandas as pd

I. Data Processing

1.1 Read in counts data and metadata

count_data_kidney_cancer = pd.read_csv('./Transcriptomics/data/processed_data/stranded_first_data_with_labels.csv')
count_data_kidney_cancer
count_data_kidney_cancer[['Case ID', 'Project ID']]
kidney_cancer_count_data = count_data_kidney_cancer.iloc[:, :60660].T
counts = kidney_cancer_count_data.copy().reset_index()
counts = counts.set_index('index')
counts
Figure1.1: Counts dataframe for kidney cancer (1643 samples and 60660 genes)

1.2. Transpose the data for deseq2

counts = counts.T
counts = pd.concat( [count_data_kidney_cancer[['Case ID']], counts],axis=1)
counts.rename(columns={'Case ID':'Geneid'}, inplace=True)
counts.set_index('Geneid', inplace=True)
counts.head()
Figure 1.2. Transposed dataframe

II. Metadata pre-processing for PyDeSeq

2.1. Create a metadata dataframe with Case ID and Cancer subtype

Only 3 cancers from TCGA (Renal Cell Carcinoma) are included in this analysis (Single Factor analysis). But it is fairly straightforward to apply the class in the OmicsHub module forhttps://github.com/adhal007/OmixHub/blob/main/OmicsUtils/pydeseq_utils.py for multifactor analysis.

Figure 2.1. Metadata for PyDeSeq

2.2. Subset the counts data to only include the 3 cancers

The original dataframe has 1643 samples, belonging to other cancer conditions as well that we were not intereseted in. Hence we will subset for 3 classes of kidney cancer (TCGA-KIRP, TCGA-KIRC, TCGA-KICH)

counts_rcc = counts[counts.index.isin(TCGA_rcc_metadata.index.values)]

III. Running Deseq2

from importlib import reload
import OmicsUtils.pydeseq_utils as pydeseq_utils
reload(pydeseq_utils)
<module 'OmicsUtils.pydeseq_utils' from '/Users/abhilashdhal/Projects/OmicsUtils/pydeseq_utils.py'>

## Initialize the pydeseq_utils object
pydeseq_obj = pydeseq_utils.PyDeSeqWrapper(count_matrix=counts_rcc, metadata=TCGA_rcc_metadata, design_factors='Condition', groups = {'group1':'TCGA-KIRC', 'group2':'TCGA-KIRP'})
design_factor = 'Condition'
result = pydeseq_obj.run_deseq(design_factor=design_factor, group1 = 'TCGA-KIRC', group2 = 'TCGA-KIRP')

Here you can see that the design factor takes only a single factor ‘Condition’ for performing differential analysis. If we set design_factor to

['Condition', 'Tumor']

Then our analysis would incorporate multi-factor inference by PyDeSeq.

13/11//2023 05:59:1699878581 PM - INFO - PyDeSeqWrapper.run_deseq: Running DESeq2 for groups: {'group1': 'TCGA-KIRC', 'group2': 'TCGA-KIRP'}
13/11//2023 05:59:1699878581 PM - INFO - PyDeSeqWrapper.run_deseq: Running DESeq2 for groups: {'group1': 'TCGA-KIRC', 'group2': 'TCGA-KIRP'}
13/11//2023 05:59:1699878581 PM - INFO - PyDeSeqWrapper.run_deseq: Running DESeq2 for groups: {'group1': 'TCGA-KIRC', 'group2': 'TCGA-KIRP'}
13/11//2023 05:59:1699878581 PM - INFO - PyDeSeqWrapper.run_deseq: Running DESeq2 factor analysis with design factor: C and o
13/11//2023 05:59:1699878581 PM - INFO - PyDeSeqWrapper.run_deseq: Running DESeq2 factor analysis with design factor: C and o
13/11//2023 05:59:1699878581 PM - INFO - PyDeSeqWrapper.run_deseq: Running DESeq2 factor analysis with design factor: C and o
13/11//2023 05:59:1699878581 PM - INFO - PyDeSeqWrapper.run_deseq: Statistical analysis of TCGA-KIRC vs TCGA-KIRP in {'group1': 'TCGA-KIRC', 'group2': 'TCGA-KIRP'}
13/11//2023 05:59:1699878581 PM - INFO - PyDeSeqWrapper.run_deseq: Statistical analysis of TCGA-KIRC vs TCGA-KIRP in {'group1': 'TCGA-KIRC', 'group2': 'TCGA-KIRP'}
13/11//2023 05:59:1699878581 PM - INFO - PyDeSeqWrapper.run_deseq: Statistical analysis of TCGA-KIRC vs TCGA-KIRP in {'group1': 'TCGA-KIRC', 'group2': 'TCGA-KIRP'}


/opt/homebrew/anaconda3/envs/umap-env/lib/python3.11/site-packages/anndata/_core/anndata.py:1897: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
utils.warn_names_duplicates("obs")
Fitting size factors...
... done in 0.71 seconds.

Fitting dispersions...
... done in 12.99 seconds.

Fitting dispersion trend curve...
... done in 4.57 seconds.

Fitting MAP dispersions...
... done in 14.78 seconds.

Fitting LFCs...
... done in 30.33 seconds.

/opt/homebrew/anaconda3/envs/umap-env/lib/python3.11/site-packages/anndata/_core/anndata.py:1897: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
utils.warn_names_duplicates("obs")
Refitting 10827 outliers.

/opt/homebrew/anaconda3/envs/umap-env/lib/python3.11/site-packages/anndata/_core/anndata.py:1897: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
utils.warn_names_duplicates("obs")
/opt/homebrew/anaconda3/envs/umap-env/lib/python3.11/site-packages/anndata/_core/anndata.py:1897: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
utils.warn_names_duplicates("obs")
Fitting size factors...
... done in 0.08 seconds.

Fitting dispersions...
... done in 2.75 seconds.

Fitting MAP dispersions...
... done in 2.71 seconds.

Fitting LFCs...
... done in 5.80 seconds.

Then to retrieve the result object we need to additionally run the summary function like this on the result variable from the output of PyDeSeq module.

result.summary()
Running Wald tests...
... done in 26.52 seconds.

Log2 fold change & Wald test p-value: Condition TCGA-KIRC vs TCGA-KIRP
Figure 3.1 Result dataframe of differential gene expression

IV. Data Interpretation/Visualization of the results

4.1. Merging gene annotations

In order to better visualize the results we need to merge the gene annotations

In the dataframes above, the ```ENSG***``` are the gene_ids from the emsemble database for each gene. Corresponding to each gene_id there maybe multiple gene_names as well as other annotation variables. In a typically RNA-Seq experiment, we want to be able to showcase some of the top genes that are identified after our statistical analysis. For that purpose we will only be merging gene names with the gene_ids.

gene_annotation = pd.read_csv('./Transcriptomics/data/gene_annotation/gene_id_to_gene_name_mapping.csv')

results_df = result.results_df
results_df_filtered = results_df.dropna()
results_df_filtered = results_df_filtered.reset_index()
results_df_filtered['nlog10'] = -1*np.log10(results_df_filtered.padj)

results_df_filtered = results_df_filtered.merge(gene_annotation, left_on='index', right_on='gene_id')
results_df_filtered = results_df_filtered .replace([np.inf, -np.inf], 300)
results_df_filtered.sort_values('padj', ascending=False)

NOTE: gene_annotations are always available as GTX files and (OmixHub) https://github.com/adhal007/OmixHub/ has a method to directly obtain the .tsv version from a .gtx file. (Please check out the documentation and leave a comment)

4.2. Screening differentially expressed genes

Applying outlier threshold for log2FoldChange and nlog10 padj using tukey’s fences

k = 3.0
log2_threshold = k*(np.percentile(results_df_filtered['log2FoldChange'], 75) - np.percentile(results_df_filtered['log2FoldChange'], 25)) + np.percentile(results_df_filtered['log2FoldChange'], 75)
log10_threshold = k*(np.percentile(results_df_filtered['nlog10'], 75) - np.percentile(results_df_filtered['nlog10'], 25)) + np.percentile(results_df_filtered['nlog10'], 75)
print(f"The threshold for log2FoldChange is {log2_threshold}")
print(f"The threshold for nlog10 is {log10_threshold}")
The threshold for log2FoldChange is 3.0146824737036013
The threshold for nlog10 is 61.22039206963072
#picked1 and picked2 simulate user lists of genes to label by color
picked1 = random.choices(population=results_df_filtered.gene_name.tolist(), weights = results_df_filtered.nlog10.tolist(), k = 250)
picked2 = random.choices(population=results_df_filtered.gene_name.tolist(), weights = results_df_filtered.nlog10.tolist(), k = 300)
picked2 = [x for x in picked2 if x not in picked1]

def map_color(a):
log2FoldChange, gene_name, nlog10 = a

if abs(log2FoldChange) < log2_threshold or nlog10 < log10_threshold:
return 'Not significant'
if gene_name in picked1:
return 'picked1'
if gene_name in picked2:
return 'picked2'

return 'Cherry picked'

results_df_filtered['color'] = results_df_filtered[['log2FoldChange', 'gene_name', 'nlog10']].apply(map_color, axis = 1)

#picked3 and picked24 simulate user lists of genes to label by shape
df = results_df_filtered.copy()
picked3 = random.choices(df.gene_name.tolist(), weights = df.nlog10.tolist(), k = 250)
picked4 = random.choices(df.gene_name.tolist(), weights = df.nlog10.tolist(), k = 300)
picked4 = [x for x in picked4 if x not in picked3]

def map_shape(symbol):
if symbol in picked3:
return 'picked3'
if symbol in picked4:
return 'picked4'

return 'not_important'

df['shape'] = df.gene_name.map(map_shape)
df.head()

Here in the above code, we have created

  • automated thresholding using outlier sum statistics for log2fold change and -log(P_adj) to screen for highly expressed and significant genes in a data driven manner.
  • some mappings for different significant genes that were picked so we can visualize them via a scatter plot.
Figure 4.2. Shows the color and shape column used for plotting

4.3. Plot differentially expressed genes

Full volcano plot: To see the scale of fold change and significance of p-values

ax1 = sns.scatterplot(data = df, x = 'log2FoldChange', y = 'nlog10',
hue = 'color', hue_order = ['Not significant', 'picked1', 'picked2', 'Cherry picked'],
palette = ['lightgrey', 'orange', 'purple', 'grey'],
style = 'shape',
style_order = ['picked3', 'picked4', 'not_important'],
markers = ['^', 's', 'o'],
size = 'baseMean', sizes = (40, 400),
)

Zoomed in and well annotated plot

plt.figure(figsize = (6,6))
ax = sns.scatterplot(data = df, x = 'log2FoldChange', y = 'nlog10',
hue = 'color', hue_order = ['Not significant', 'picked1', 'picked2', 'Cherry picked'],
palette = ['lightgrey', 'orange', 'purple', 'grey'],
style = 'shape',
style_order = ['picked3', 'picked4', 'not_important'],
markers = ['^', 's', 'o'],
size = 'baseMean', sizes = (40, 400))

ax.axhline(log10_threshold, zorder = 0, c = 'k', lw = 2, ls = '--')
ax.axvline(log2_threshold, zorder = 0, c = 'k', lw = 2, ls = '--')
ax.axvline(-log2_threshold, zorder = 0, c = 'k', lw = 2, ls = '--')



## print texts for only very top genes
texts = []
for i in range(len(df)):
if df.iloc[i].nlog10 > 150 and abs(df.iloc[i].log2FoldChange) > 6.5:
texts.append(plt.text(x = df.iloc[i].log2FoldChange, y = df.iloc[i].nlog10, s = df.iloc[i].gene_name,
fontsize = 8, weight = 'bold'))

# adjust_text(texts, arrowprops = dict(arrowstyle = '-', color = 'k'))

plt.legend(loc = 1, bbox_to_anchor = (1.4,1), frameon = False, prop = {'weight':'bold'})

for axis in ['bottom', 'left']:
ax.spines[axis].set_linewidth(2)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.tick_params(width = 2)
ax.set_xlim(-10, 10)
ax.set_ylim(100, 400)
plt.xticks(size = 12, weight = 'bold')
plt.yticks(size = 12, weight = 'bold')

plt.xlabel("$log_{2}$ fold change", size = 15)
plt.ylabel("-$log_{10}$ FDR", size = 15)
plt.title('Differential gene expression in TCGA-KIRC vs TCGA-KIRP', weight='bold', size = 15)
plt.savefig('volcano.png', dpi = 300, bbox_inches = 'tight', facecolor = 'white')

plt.show()

CONCLUDING REMARKS:

In this article, we have learnt have to use PyDeSeq quickly for analysis a real dataset available on NIH’s Genomic Data Commons (https://gdc.cancer.gov/developers/gdc-application-programming-interface-api). Additionally, I’ve also provided exposure into how to structure your analysis, streamline your code and build open-source tools for bioinformatics related software.

STAY TUNED:

  • Please leave a like if you found this useful, and reach out if you want to contribute to the OmixHub (https://github.com/adhal007/OmixHub).
  • In the next article, I will showcase how to apply Supervised models to RNA-Seq datasets from one of the modules on OmixHub.

--

--

Abhilash Dhal

Data Scientist| IITian| Bioinformatics| ML and AI Enthusiast