Genomic coordinates to gene lists and vice versa — Annotating gene coordinates and gene lists

Pubudu samarakoon
Into the genomics
Published in
3 min readDec 25, 2018

Why do we need to annotate gene coordinates and gene lists?

Since the advent of genetic tests, geneticists have been using various techniques to describe the location of genes (and genetic elements). One of the earliest techniques used was the cytogenetic location that characterised genes based on bands in stained chromosomes. However, with the availability of high-resolution genomic techniques, scientists have been able to describe specific locations of genes and other elements. This location — molecular location which is also known as genomic coordinates describes the precious start and end position of a gene (or genetic element) on a chromosome.

Today the vast majority of genetic tests and high-throughput genomic techniques use genomic coordinates to report their findings. One of the first steps in analysing these reports is to annotate the genomic coordinates with gene information. Although standard genomic pipelines used in the field can perform this task, scientists with specific requirements may end up with lists of genomic coordinates at various stages of their workflows (e.g. lists of CNV regions and list of co-localized genomic elements). Additionally, different bioinformatics software may produce lists of genes that need to be visualised using genomic coordinates (e.g. DeSeq2 differential gene expression analysis). Therefore, annotating lists of gene coordinates and gene lists are essential steps in current genomic studies.

What are the commonly used methods?

  1. UCSC genome browser
  2. Ensembl BioMart
  3. R Bioconductor packages
  4. Biopython solution

How to?

  1. UCSC Table browser (https://genome.ucsc.edu/)
Genes in a list of genomic coordinates: https://genome.ucsc.edu/training/

2. Ensembl BioMart (https://www.ensembl.org/biomart)

Genes in a list of the genomic coordinates
Gene list to genomic coordinates

UCSC Table Browser and Ensembl BioMart are web services that users can interact and retrieve necessary information. However, the main disadvantage of these services is that they are not scalable and cannot be automated robustly. Therefore, bioinformaticians have used R and Python solutions in their bioinformatics pipelines.

3. R Bioconductor package — biomaRt

library(biomaRt)
mart <- useMart("ensembl")
mart <- useDataset("hsapiens_gene_ensembl", mart)
attributes <- c("ensembl_gene_id","start_position","end_position","strand","hgnc_symbol","chromosome_name","entrezgene","ucsc","band")
filters <- c("chromosome_name","start","end")
values <- list(chromosome="1",start="1783590",end="2597819")
all.genes <- getBM(attributes=attributes, filters=filters, values=values, mart=mart)
ref: https://www.biostars.org/p/44426/

4. Biopython solution

Python programmers can use the gffutil package to annotate lists of genes and gene coordinates. Here, a database from a gff3 file can be created using gffutil package. Then this database can be used to access genes and gene coordinates. A description of the gffutil package with examples is available here.

Gffutil package is developed mainly to perform a generic task — reading GFF files that contain a wide range of genomic data. When users need only to access genes and gene coordinates (not to read large GFF files), gffutil may not provide the optimum solution. Therefore I have addressed this issue in my post “Annotating gene coordinates and gene lists — The python way”.

--

--

Pubudu samarakoon
Into the genomics

I’m an infinite learner, and I love to solve hard problems