Annotating gene coordinates and gene lists — The python way

Published in

Into the genomics

6 min readDec 25, 2018

As described in my post “Genomic coordinates to gene lists and vice versa”, python programmers need to rely on a generic package — gffutil when annotating lists of genes and gene coordinates. Gffutil package may not be the optimum solution for bioinformaticians with specific requirements. For example, the gffutil package may not be the best option when your pipeline only needs to map a set of genomic intervals to GENCODE protein-coding genes which were prioritised based on their GENCODE level information. Therefore, when we have specific requirements, we’ll have to write custom programs that solve those needs. Here, I’ll describe how to

Create a source dataset from GENCODE Gene Set release (section A)
Find genes in a list of input genomic coordinates using the source dataset (section B)
Find genomic coordinates of a gene list using the source dataset (section C)

Creating the GENCODE source dataset and using it to annotate lists of genes and coordinates

Section A: Creating a source dataset from GENCODE Gene Set release

STEP 01: Read the gff3 file into a pandas dataframe

gencode = pd.read_table(<path to .gff3>, comment="#", sep = "\t", names = ['seqname', 'source', 'feature', 'start' , 'end', 'score', 'strand', 'frame', 'attribute'])gencode.head() 
gencode.info()

There are nine columns in a gff3 file. They are seqname, source, feature, start, end, score, strand, frame, attribute. Comments in gff3 files start with “#”.

STEP 02: Extract Genes in the gff3 file “feature = gene”

gencode_genes = gencode[(gencode.feature == "gene")][['seqname', 'start', 'end', 'attribute']].copy().reset_index().drop('index', axis=1)

This code shows how to extract GENCODE genes. However, if you need to extract other GENCODE feature types like transcripts, exons, CDSs and UTRs, this can be changed accordingly. For example, if you need to access transcripts, you can replace “gene” with “ transcript” (gencode.feature == “transcript”).

STEP 03: Extract gene_name, gene_type, gene_status, level of each gene

def gene_info(x):
# Extract gene names, gene_type, gene_status and level
    g_name = list(filter(lambda x: 'gene_name' in x,  x.split(";")))[0].split("=")[1]
    g_type = list(filter(lambda x: 'gene_type' in x,  x.split(";")))[0].split("=")[1]
    g_status = list(filter(lambda x: 'gene_status' in x,  x.split(";")))[0].split("=")[1]
    g_leve = int(list(filter(lambda x: 'level' in x,  x.split(";")))[0].split("=")[1])
    return (g_name, g_type, g_status, g_leve)gencode_genes["gene_name"], gencode_genes["gene_type"], gencode_genes["gene_status"], gencode_genes["gene_level"] = zip(*gencode_genes.attribute.apply(lambda x: gene_info(x)))

This function reads GENCODE “attribute” column and access all key-value pairs. Then the filter command extracts relevant values using the gene_name, gene_type, gene_status and level keywords. Depending on the previous step (step 02), you can change the code. For example, you might need to access transcript_type, transcript_status and tag if you extracted all the transcripts in step 02. Similarly, you can change step 04 and 05 accordingly.

STEP 04: Extract all known protein_coding genes

gencode_genes = gencode_genes[gencode_genes['gene_status'] == 'KNOWN'].reset_index().drop('index', axis=1)
gencode_genes = gencode_genes[gencode_genes['gene_type'] == 'protein_coding'].reset_index().drop('index', axis=1)

STEP 05: Remove duplicates — Prioritize verified and manually annotated loci over automatically annotated loci

gencode_genes = gencode_genes.sort_values(['gene_level', 'seqname'], ascending=True).drop_duplicates('gene_name', keep='first').reset_index().drop('index', axis=1)

Level attribute in the GENCODE gff3 file provides an assessment of the reliability of a given feature. Here, level 01 indicates that the element is manually annotated and experimentally verified (Eg. genes and transcripts by RT-PCR-seq and pseudogenes by three different methods). Level 02 means that the feature is annotated manually by HAVANA and merged with features predicted by ENSEMBL automated pipeline. Lastly, level 03 indicates that the element is predicted only by the ENSEMBL automated pipeline.

As described above, step 04 identified all the known protein-coding genes. GENCODE contains multiple records for the same gene when they belong to more than one level. Above code remove all the duplicates while prioritising level-01 over -02 and level-02 over -03. As shown in the GitHub .gist, step 05 removed 46 such duplicated genes.

STEP 06: Generating the indexed GENCODE gene source dataset

%%bash -s gencode.v19.annotation.gff3_all_known_genes.txt
cut -f 1,2,3,5 $1 | sortBed -i > gencode.v19.annotation.gff3_all_known_genes.txt.sorted.formatted.bed
bgzip gencode.v19.annotation.gff3_all_known_genes.txt.sorted.formatted.bed
tabix -p bed gencode.v19.annotation.gff3_all_known_genes.txt.sorted.formatted.bed.gz

Samtools “tabix” is a generic indexer for TAB-delimited genome position files. gencode_genes dataframe created in step 05 can be saved as a tab-delimited text file (gencode.v19.annotation.gff3_all_known_genes.txt) and then index it using tabix as shown in the above bash command.

Section B: Using the indexed GENCODE gene source dataset to annotate genomic coordinates

Step 01 and 02: Defining relevant functions

def overlap(q_st, q_end, res_st, res_end):
    o  = min(q_end, res_end)-max(q_st, res_st)
    return odef gencode_all_known_genes(a, tb):
    genes = []    try:
        for region in tb.fetch(a['chr'], int(a['start']), int(a['end'])):
            if region:
                r = region.split('\t')
                overlap_len = overlap(int(a['start']), int(a['end']), int(r[1]), int(r[2]))
                ret_val = '{}({})'.format(r[3], np.round(overlap_len/float(int(a['end'])-int(a['start']))*100, 2)) ### Percentage of the input interval that overlap with the gene
                genes.append(ret_val)        if len(genes)>0:
            return ";".join(genes)
        else:
            return "NA(0)"
    except ValueError:
        return "NA(0)"

Above code defines two functions to find genes that overlap with genomic intervals (gencode_all_known_genes) and then to calculate overlapping base-pairs (overlap). Here, the gencode_all_known_genes function takes genomic coordinates (a row in a dataframe indicating the chromosome, start and end positions) and a tabix indexed GENCODE gene file. Then “tb.fetch” command reads the tabix file and finds all the genes that overlap with the given genomic interval. Next, overlap function calculates the overlap between the input and tb.fetch results. Finally, the gencode_all_known_genes function returns the gene and the overlap percentage.

Here, the overlap percentage is the fraction of the input interval that overlaps with the GENCODE gene. You can change the code accordingly if you want to find the fraction of the GENCODE gene that overlaps with the input interval (e.g. overlap_len/float(int(r[2])-int(r[1]))*100).

Following code calls the gencode_all_known_genes function and accepts return values into a new genes column in the dataframe df.

import pysam
gencode_v19 = pysam.TabixFile('gencode.v19.annotation.gff3_all_known_genes.txt.sorted.formatted.bed.gz')df['genes'] = df.apply(lambda x: gencode_all_known_genes(x[['chr', 'start', 'end']], gencode_v19), axis=1)

When a genomic interval overlaps with more than one gene, the gencode_all_known_genes function returns a single string with all the overlapping genes that are separated by semicolons (e.g. SLC2A7(23.04);SLC2A5(53.37)). If you need to apply various filters to these genes, this may not be the optimum format for your pipeline. In such instances, you can transform the dataframe using the following code.

new_rows = []
for i,r in df.iterrows():
    g_list = r['genes'].split(";")
    for g in g_list:
        g = g.replace(" ","")
        new_rows.append(np.append(r[['chr', 'start', 'end', 'name', 'score', 'strand', 'genes']].values, g))
        
df_perGene = pd.DataFrame()
df_perGene = df_perGene.append(pd.DataFrame(new_rows, columns=['chr', 'start', 'end', 'name', 'score', 'strand', 'genes', 'gene_ID'])).reset_index().drop('index', axis=1)df_perGene['gene_name'] = df_perGene['gene_ID'].apply(lambda x: x.split("(")[0])
df_perGene['gene_coverage'] = df_perGene['gene_ID'].apply(lambda x: x.split("(")[1].replace(")", ""))df_perGene = df_perGene.drop(["genes", "gene_ID"], axis=1)
df_perGene.head()

Here, the resulting dataframe — df_perGene will contain a single row for each gene. For genomic intervals with multiple overlapping genes, df_perGene dataframe will include duplicated rows that differ only from the contents of the columns — gene_name and gene_coverage.

Section C: Find coordinates of a gene list

As shown in the previous section, GENCODE source dataset can be used to find genes in a set of genomic coordinates. In this section, we’ll use the same source dataset to find genomic coordinates when we have a list of genes.

gencode_genes = gencode_genes.set_index('gene_name')def fetch_gene_coords(g):if gencode_genes.index.contains(g): 
        return gencode_genes.loc[g]['seqname'], gencode_genes.loc[g]['start'], gencode_genes.loc[g]['end'] 
    else:
        return "NA", "NA", "NA"df_perGene['g_chr'], df_perGene['g_start'], df_perGene['g_end'] = zip(*df_perGene['gene_name'].apply(lambda x: fetch_gene_coords(x)))

Here, we’ll use the gencode_genes dataframe from section A to find genomic coordinates of an input gene list. First, we’ll set the gene_name column as the index of the gencode_genes dataframe. Next, we can write a function — fetch_gene_coords, that takes an input gene set. This function can search for those genes in gencode_genes dataframe and returns the coordinates when there is a match.

PyScripts

I wrote two python scripts that can run above-discussed tasks. These scripts will be useful to those who need to perform these tasks but not interested in writing python programs. You can run them as python command line programs (usage information is available in comments sections).

Annotating gene coordinates and gene lists — The python way

Section A: Creating a source dataset from GENCODE Gene Set release

Section B: Using the indexed GENCODE gene source dataset to annotate genomic coordinates

Section C: Find coordinates of a gene list

PyScripts

References

Written by Pubudu samarakoon