GSoC Journey Part 2- The Problem Statement

5 min readJul 4, 2019

Applying machine learning techniques to characterizing and naming lncRNA genes

Advances in RNA sequencing technologies have revealed the complexity of our genome. Long non-coding RNAs (lncRNAs) make up the majority of the non-coding transcriptome. Understanding the significance of this RNA world is one of the most important challenges faced in biology today, and the lncRNAs within it represent a gold mine of potential new biomarkers and drug targets. Its discovery is still at a preliminary stage.

To date, very few lncRNAs have been characterized in detail. However, it is clear that lncRNAs are important regulators of gene expression, and lncRNAs are thought to have a wide range of functions in cellular and developmental processes. There are many specialized lncRNA databases (like RefSeq, GENCODE, Ensembl, SGD, tair).
Our aim is to use Machine Learning techniques to highlight and compare two sets of calls (of Ensembl and RefSeq) and determine which calls are incorrect.

Work done so far : Data Acquisition

Before running any ML Model, the primary step involves collection of data. My first step was to collect the human genome data from two databases, namely — Ensembl and RefSeq (NCBI).

All my current work can be found on- https://github.com/EnsemblGSOC/srijan-gsoc-2019

A) Data Collection from Ensembl

Accessing Ensembl Data

Ensembl data was available through a number of routes.

Small quantities of data —
Many of the pages displaying Ensembl genomic data offer an export option, suitable for small amounts of data, e.g. a single gene sequence.
Large datasets/complex analyses —
If you require larger amounts of data (e.g. all the genes on a chromosome) or are conducting a more detailed analysis, it is recommended to use publicly-accessible MySQL server.
Perl API is available to these databases as well, so they can be scripted against without needing to know the database schema.
If you are not familiar with Perl, using REST server (which can be accessed via various languages like python, C, C++, etc) would then work best. Various REST endpoints provide access to vast amounts of Ensembl data.
Complex cross-database queries —
More complex datasets can be retrieved using the BioMart data-mining tool.
Whole databases —
If required, entire databases can be downloaded from FTP site in a variety of formats, from flat files to MySQL dumps.

Which one I chose?

I used the REST API (2) to extract the numeric data and used the FTP site (4) to extract the sequence data(from Fasta file).

Now, what do I mean by numeric data ?

Gene ID column specifies the Ensembl gene ID having prefix as ‘ENSG’.
Biotype column indicates the overall biotype for the gene annotation for that location.
Start and End column are the coordinates of the feature in 1-based coordinates. Also interpreted as boundaries/range of the sequence.
For the Strand column, ‘1’ indicates that it is a positive strand whereas ‘-1’ indicates that it is a negative strand.
Seq region Name indicates the sequence ID for the genomic sequence.
No. of Transcripts- Total transcripts present for that gene.

And what do I mean by sequence data ?

— The picture above is an example of a series of nucleotide sequence.

My approach for collecting the numeric data —

Extract all Ensembl gene IDs from gff3 file (given here).
Use REST API post service to pass all the IDs and fetch data in JSON format. (Python scripts can be found here).
Extract the required data from JSON file.
And voila ! You have got all the numeric data with you !

My approach for collecting the sequence data —

Download the ncrna Fasta file from here.
Extract all lincRNA (Long intergenic non-coding RNA) IDs from the JSON file.
If the lincRNA IDs are present in the fasta file, extract their sequences.

Why I used REST API, instead of GFF file, to extract numeric data ?
— Visualizing the data from a JSON response (received from API) is somewhat easier than viewing it from a GFF file.
— Display name and biotype stored in GFF file was of variable length, for different Ensembl IDs.
Whereas the same (display name and biotype) was stored in the form of lists and dictionary(key:value) when extracted from the API. (easier to extract!).

B) Data Collection from RefSeq

Accessing RefSeq Data

RefSeq data is available majorly through 3 routes.

BLAST
Entrez
FTP site

Which one I chose?

Numeric as well as the sequence data was collected using the FTP service.
GFF file (found here) was used for extracting the numeric data.
Fasta file (found here) was used for extracting the sequence data.

My approach for collecting the numeric data (code found here) —

Extract index of all RefSeq gene IDs from GFF file.
Between two consecutive indexes lies transcript and exon data; all transcript data will have their ID starting with ‘rna-’ and all exon data will have their ID starting with ‘exon-’.
Use string manipulation, with python, for extracting rest of the data.

My approach for collecting the sequence data —

Download the rna Fasta file from here.
Get all lnc_RNA transcript IDs from GFF file.
Check if these IDs are present in the fasta file.
If they are present (match), then get their sequences from fasta file.

All of the data has finally been collected !

What I learned from working with these databases ?

Working with Ensembl —
-Accessing data from Ensembl is pretty straight forward. There are plenty of routes which are available and that depends on the amount of data one needs to extract.
-Documentation is well written, easy to understand. There are code snippets available as well, for the same.
Working with RefSeq —
-Accessing data from RefSeq could be a bit tricky. Some of the documentation is not clearly written.
-Getting the numeric data from the correct GFF file took me a while, especially while performing string manipulation.

Challenging and fun part :

Ensembl — Using the REST API for parsing ~59,000 gene IDs !
-REST API limit (for post service) is 1,000 IDs per call. But the server somehow hangs (or becomes slow) even when you send ~600 IDs at once(speaking from my personal experience).
-So to deal with this, I sent 100 IDs per call. This process took me around 4hrs .
RefSeq — Extracting numeric data from GFF file
-There are a lot of IDs whose data is not consistent (according to me — some random/garbage values are stored there).
-Cleaning/removing those IDs was challenging until I got to see a pattern in it.

Coming up next —

Selection and creation of features along with model training to be done next !

GSoC Journey Part 2- The Problem Statement

Applying machine learning techniques to characterizing and naming lncRNA genes

Work done so far : Data Acquisition

A) Data Collection from Ensembl

Which one I chose?

B) Data Collection from RefSeq

Which one I chose?

What I learned from working with these databases ?

Coming up next —

Written by Srijan Verma