GSoC Journey Part 3- Data Analysis

4 min readAug 26, 2019

Feature selection and model training

Please check out the Part 1 and 2 of the GSoC series, if you haven’t already done so. All of the code can be found here.

Now that we have collected data from both, Ensembl and RefSeq, it is time to generate features which will be the input layer for our machine learning models.

A) Generating features

We created 12 or more features in total, here I’ll take about the most important ones —

Overlap Function: An overlapping gene is a gene whose expressible nucleotide sequence partially overlaps with the expressible nucleotide sequence of another gene. In this way, a nucleotide sequence may make a contribution to the function of one or more gene products.
→ To explain this — Let us say there are two genes, namely GeneA and GeneB having start and end coordinates as (startA, endA) and (startB, endB).
→ If start (or end) coordinate of one gene lies in between the start and end coordinate of the other gene; then the genes are said to be overlapping genes!
→ Overlap function main approach/logic:
a) Clean data (arrange in ascending chromosome region and remove sequence regions NW, NT for refseq and M, K, G for ensembl)
b) Overlap to be found between genes lying in the same chromosome regions.
c) For gene1 lying in chromosome1, find all overlaps of gene2…geneN lying in chromosome1.
d) Repeat point c) for all genes in all chromosome regions.
ORF (Open Reading Frame): In molecular genetics, an open reading frame (ORF) is the part of a reading frame that has the ability to be translated. An ORF is a continuous stretch of codons that begins with a start codon (usually AUG) and ends at a stop codon (usually UAA, UAG or UGA).

Here, we found ORFs of the overlapping genes between RefSeq and Ensembl and saved those which gave the maximum ORF length.
→We also found the no. of amino acids present in the max ORF length.

Approach/logic for finding ORF:

a) Used ORF function from biopython library.

b) Calculated no. of amino acids present in an ORF.

c) Calculated GC% of the whole sequence (Ensembl and RefSeq) wherever there was an overlap b/w ens-ref.

3. Sequence Alignment: In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.
→ To understand this clearly, align 2 gene sequences together (parallel to each other) and now check if the nucleotides match each other. That is, A should match with A, G with G and so on.
→ Gene pairs in which alignment is high gets a high Sequence alignment score.

4. Max exonic count and length: Between 2 genes, find max exonic overlap count within transcripts. i.e, whichever transcript has max no. of exons which overlap.

a) Find seq_align score of those transcripts having max exonic overlap count.

b) Also save max exon overlap length.

For rest of the feature selection, please check out this.

B) Machine Learning

Running various classifiers

Now that we collected all of our features, it is time to apply some ML!

Above image shows the histograms for all of our features. We can clearly observe that some of the curves are normally distributed, while other are skewed.

→ Our final training dataset was of [ 635 X total features ] datapoints. Since the output matrix is binary, 0 or 1, therefore we can apply sklearn classifiers here.
NOTE:- Since the data is less, therefore Deep Learning will not work well here.

Given below image is the class distribution overall :

There were more positive cases than negative (class imbalance was observed)

Results:

XGBoost and RF gave the highest accuracy, around ~88%!

We were getting an accuracy of ~87% on our test dataset.

With this, we concluded our GSoC work.

These 3 months have absolutely been a life changing experience for me. I feel extremely grateful to the whole community of ‘Genes, Genomes and Variation’, especially towards my mentor — Daniel Zerbino.

Connect with me on:

1. LinkedIn : vermasrijan

2. GitHub : vermasrijan

3. Medium : @verma.srijan

4. Personal Website : srijanverma

GSoC Journey Part 3- Data Analysis

Feature selection and model training

A) Generating features

B) Machine Learning

Results:

Written by Srijan Verma