How to use GSEApy for pathway enrichment analysis
An Enrichr pathway analysis package within the Python wrapper
We have used volcano plots, identified the differentially expressed genes (DEGs) and plotted in clustergrams to show the relative expression of these DEGs. The next logical step now will be to query the function of these DEGs. There are a myraid of web tools, such as Database for Annotation, Visualization and Integrated Discovery (DAVID), Ingenuity Pathway Analysis (IPA), Metacore, PandaOmics, Enrichr, GSEA, STRING, Panther, Reactome and KEGG that you can consider as query databases. In this blog post, we will focus on using Enrichr to query on the functionality and localisation of DEGs.
Enrichr is user-friendly, free to use and allows querying of gene lists against a repertoire of gene-set libraries. Moreover, there is GSEApy, which is a Python wrapper for Enrichr, allowing users to query the characteristics of your DEGs within Python. I strongly recommend using JupyterLab or Jupyter Notebook for visualising the output of GSEApy, which is a web-based interactive development environment for storing and executing Python codes. You may visit here for specific instructions on how you can install them on your computer.
If you have not downloaded GSEApy, you can download by entering this command:
pip install gseapy
We first load the required packages and run matplotlib inline using the following codes:
import numpy as np
import pandas as pd
import gseapy as gp
import matplotlib.pyplot as plt
The advantage of using Enrichr is that users can query across multiple databases. To understand the gene set libraries that are available for data query, you can type in the following command:
names = gseapy.get_library_name()
This produces an output file showing the 190 different databases (and possibly more in the future) that users can potentially query against:
As a proof of concept, we will use the transcriptomics dataset published by Zak et al., PNAS, 2012, examining how seropositive and seronegative subjects respond to the Ad5 vaccine across various time points. The summary of the study can be found here, and the processed dataset, which was analysed by Partek Genomics Suite can be found in GitHub. The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint = 0). We will filter the upregulated and downregulated DEGs based on fold-change > 1.5, q-value < 0.05 and fold-change < -1.5, q-value < 0.05 respectively at day 1 post-vaccination using the commands as follows:
df = pd.read_csv('https://raw.githubusercontent.com/kuanrongchan/vaccine-studies/main/Ad5_seroneg.csv',index_col=0)DEGs_up_1d = (df[(df['fc_1d'] > 1.5) & (df['qval_1d'] < 0.05)]).index.tolist()
DEGs_down_1d = (df[(df['fc_1d'] < -1.5) & (df['qval_1d'] < 0.05)]).index.tolist()
My favourite list of databases are Gene Ontology (GO) and Reactome. GO Biological Processes (GOBP), GO Molecular Functions (GOMF) and GO Cellular Components (GOCC) usually provide users a rough idea of what the DEGs do, and their localisation within the cell. To query the upregulated DEGs against all of these 4 databases, the following commands can be executed:
enr_GOBP_up = gp.enrichr(gene_list=DEGs_up_1d ,
)enr_GOMF_up = gp.enrichr(gene_list=DEGs_up_1d ,
)enr_GOCC_up = gp.enrichr(gene_list=DEGs_up_1d ,
)enr_Reactome_up = gp.enrichr(gene_list=DEGs_up_1d ,
These codes may look difficult to understand, so let’s break them down individually to understand the codes. Firstly, gp.enrichr calls Enrichr to query against the gene list that is assigned to DEGs_up_1d in this case. We then specify the gene sets to query against GOBP, GOMF, GOCC or Reactome, which is why the syntaxes are broken into 4 distinct segments. Since the data is from Humans, we assign the organism as Human and add a description. Next we assign the output path in ‘outdir.’ In this example, we specify the creation of a folder named “test” and store the output with the specified file name. Finally, we set a cutoff of p=0.5 for analysis.
To have a sensing of the output table within Jupyter Notebook, we can type this command as an example (showing only the top 5 significant pathways):
Output is as follows:
Note that the “genes” column may look truncated within Jupyter Notebook, so you might have to go to your specified path from ‘outdir’ to find out the full genes list. But nonetheless, the table provides important statistics that allows users to have a quick sensing of the pathways that are enriched.
To show the results graphically, we can key in the following code:
from gseapy.plot import barplot, dotplotbarplot(enr_GOBP_up.res2d,title='GO Biological Processes seroneg day 1 (up)',color = 'r')
We import barplot from GSEApy to allow us to see the output graphically. The barplot command then plots the top 10 enriched molecular pathways with adjusted p-value < 0.05. Since the genes are upregulated, we assigned the colour as red (‘r’). Output is as follows:
At one glance, we can quickly ascertain that the top pathways induced by the Merck Ad5/HIV are related to innate immune responses and type I IFN response. To determine the enriched pathways in from the other databases, we can use the following code:
barplot(enr_GOMF_up.res2d,title='GO Molecular Function seroneg day 1 (up)',color = 'r')barplot(enr_GOCC_up.res2d,title='GO Cellular Component seroneg day 1 (up)',color = 'r')barplot(enr_Reactome_up.res2d,title='Reactome seroneg day 1 (up)',color = 'r')
Output files are as follows:
When we query more databases, we are beginning to see a consensus, where the chemokines, interferon and inflammation pathways are most positively enriched.
Now we will query the pathways that are negatively enriched. The concept is similar with what we did for the upregulated DEGs. We can use the following codes:
enr_GOBP_down = gp.enrichr(gene_list=DEGs_down_1d ,
)enr_GOMF_down = gp.enrichr(gene_list=DEGs_down_1d ,
)enr_GOCC_down = gp.enrichr(gene_list=DEGs_down_1d ,
)enr_Reactome_down = gp.enrichr(gene_list=DEGs_down_1d ,
To visualise the downregulated pathways, we can enter the following codes:
barplot(enr_GOBP_down.res2d,title='GO Biological Processes seroneg day 1 (down)',color = 'b')barplot(enr_GOMF_down.res2d,title='GO Molecular Function seroneg day 1 (down)',color = 'b')barplot(enr_GOCC_down.res2d,title='GO Cellular Component seroneg day 1 (down)',color = 'b')barplot(enr_Reactome_down.res2d,title='Reactome seroneg day 1 (down)',color = 'b')
Output is as follows:
A common theme emerges. We can quickly ascertain that protein translation and genes related to T-cell receptors are the most significantly downregulated pathways after 1 day post-vaccination.
For easy referencing, the codes are as follows:
And there you have it. Thanks for reading.
Kuan Rong Chan
If you enjoy this article and would like to be notified of future posts, please follow me in medium. Thanks!