Gene ID mapping using R

How to overcome problems faced during mapping gene IDs?

Published in

The Computational Biology Magazine

2 min readAug 21, 2020

Inter-conversion of gene ID’s is the most important aspect enabling genomic and proteomic data analysis. There are multiple tools available each with its own drawbacks. While performing enrichment analysis on Mass Spectrometry datasets, I had always struggled to prepare the input files required for each of the packages in R. It takes some data tweaking and cleanup to enable the R tools or packages to accept them as an input. The struggle is more in case of UniProt id’s as very few applications accept them as input. Although UniProt provides the retrieve id mapping function, it does not take into account the number of rows which means any protein or gene id which cannot be mapped is simply omitted from the output file. This makes combining the datasets difficult.

There are numerous tools available for such kind of ID mapping. Here I am laying out a few R packages that I have used and worked smoothly.

AnnotationDbi package

The org.Hs.eg.db package or the org.Mm.eg.db package is to be used for human and mice respectively. mapIds can take any input form like UniProt id, HGNC symbol, Ensembl id and Entrez id and interconvert them.

library(‘org.Mm.eg.db’)ensembl<-mapIds(org.Mm.eg.db, keys=rownames(df), column=’ENSEMBL’, keytype=’SYMBOL’, multiVals=”first”)entrez<-mapIds(org.Mm.eg.db, keys=rownames(df), column=’ENTREZID’, keytype=’SYMBOL’, multiVals=”first”)entrez<-mapIds(org.Mm.eg.db, keys=rownames(df), column=’UNIPROT’, keytype=’SYMBOL’, multiVals=”first”)

mapIds() returns a named vector of id’s.

The output can be merged to the original dataset using `cbind` for further downstream analysis. The one advantage that I have noticed with mapIds is that it matches the gene id’s row by row and inserts NA when it can’t find gene names or symbols for certain UniProt id’s. This is a huge lifesaver when working with huge datasets.

2. biomaRt package

require(biomaRt)mart<-useMart(biomart = “ensembl”, dataset =  “mmusculus_gene_ensembl”)mart <- useDataset(dataset=”mmusculus_gene_ensembl”, mart=mart)mapping <-  getBM(attributes=c(“mgi_symbol”,”ensembl_gene_id”,”entrezgene_id”), filters = “mgi_symbol”, mart=mart, values=data, uniqueRows=TRUE, bmHeader = T)

For human hgnc_symbol and for mouse mgi_symbol is to be used.

Generally, with biomaRt, extra work is required after you perform the initial mapping. You will note that biomaRt does not even return the genes in the same order in which they were submitted.

3. bitr from ClusterProfiler package

The ClusterProfiler package was developed by Guangchuang Yu for statistical analysis and visualization of functional profiles for genes and gene clusters. The org.Hs.eg.db or the org.Mm.eg.db package is to be used for human and mice respectively. The key types can be obtained by typing keytypes(org.Mm.eg.db).

bitr(geneID, fromType, toType, OrgDb, drop = TRUE)ids <- bitr(data, fromType=”SYMBOL”, toType=c(“UNIPROT”, “ENSEMBL”, “ENTREZID”), OrgDb=”org.Mm.eg.db”)

Apart from the R functions listed above there are various tools for gene ID conversion like DAVID, UCSC gene ID converter etc. for non-programmers.

Gene ID mapping using R

How to overcome problems faced during mapping gene IDs?

Written by Gitanjali Roy