GSoC’23 @ NRNB : Week 1 Experience
AIM : Creating support for Pathway Commons in clusterProfiler
Mentors : Guangchuang Yu, Augustin Luna
Week — 1 : May 29 - Jun 4
Introduction :
Welcome to the blog post highlighting my first-week experience in the Google Summer of Code (GSOC) program. This week, I focused on the conceptualization of the functions for getting the GMT file and the source for the Pathway Commons data. I took some time to decode the URL and the parts that are relevant for the analyses and understanding the input format of the data that is GMT(Gene Matrix Transposed) format. Let’s dive into the details of my accomplishments and plans for the upcoming week.
Progress Made :
I started implementing the functions that would take the GMT file from the relevant URL and then extracting the unique sources from it.
The get_pc_gmtfile function :
- Enables reading the GMT files at present from v12
- A sub() function is used to find files in GMT (Gene Matrix Transposed format)
- No parameters are involved
The get_pc_source function :
1. Constructed to extract the source name of each GMT file identified.
2. The Source of each file distinguishes one GMT file from another .
3. Returns the unique sources supported by Pathway Commons
Implementation :
- get_pc_gmtfile :
get_pc_gmtfile <- function() {
pcurl <- 'https://www.pathwaycommons.org/archives/PC2/v12/'
x <- readLines(pcurl)
y <- x[grep('\\.gmt.gz',x)]
sub(".*(PathwayCommons.*\\.gmt.gz).*", "\\1", y)
}
2. get_pc_source :
#list supported data sources of Pathway Commons
get_pc_source <- function() {
gmtfile <- get_pc_gmtfile()
source <- unique(sub("PathwayCommons\\d+\\.([_A-Za-z]+)\\.([_A-Za-z]+)\\.gmt.gz", "\\1", gmtfile))
return(source)
}
Next Week Plan :
In the coming week I intend to craft the following functions for extracting and preparing data
- get_pc_data : For downloading and accessing the corresponding GMT file from Pathway Commons.
- prepare_PC_data : It retrieves the Pathway Commons data for the specified organism, selects the relevant columns, and returns them as a list of data frames
The Repository :
https://github.com/YuLab-SMU/clusterProfiler/blob/devel/R/pathwayCommons.R
Conclusion :
As time passes and I delve deeper into the implementation of the project, my interest and enthusiasm continue to grow. This GSOC project has proven to be a significant challenge, pushing me to expand my skills and learn new concepts. One particular aspect that has captivated my attention is the implementation of the roles module.
The underlying biology has pushed me to explore a Bioinformatics , conduct thorough research, and seek guidance from mentors. Through this process, I am not only honing my technical skills but also developing problem-solving abilities and critical thinking.