Auto-Annotating Microbial DNA Sequences Using Predictive Modeling
While I was working on Autodesk Life Sciences Team this summer, I was tasked with building a method of automatic annotation that could be applied to DNA sequences for Genetic Constructor. The idea behind this was to create a structured pipeline that would identify Open Reading Frames (ORFs) existing within segments of genetic material.
This is a particularly useful operation, since reportedly less than 2% of the human genome is actively coding. The larger the organism, the more uncertain we are about internal genetic relevance —C. elegans has approximately 25% actively coding and yeast is around 70%. To find and alter that minute portion and synthetically modify DNA at all, the ORFs must be mapped out and easily viewable so scientists know what sequences should be adjusted.
Here is where Glimmer (Gene Locator and Interpolated Markov ModelER) comes in – a tool that uses predicative mathematics based on a scaffolding training set of prominent eukaryotic data to make speedy, accurate (97%+) guesses about what is an ORF and what is not.
The standard protocol for recognizing an unknown sequence is usually BLAST — but what if the sequence can’t be compared to known sequences? That’s where Glimmer comes in, by using learning structures to breathe life into data that is usually beyond common recognition to find new genetic features.
Glimmer is based on an interpolated Markov Model that is trained against a defined standard data set of eukaryotic genetic data. More specifically, Glimmer is trained against the human, rice, and Arabidopsis thaliana genomes — three fairly well studied model genomes that allow it a comparatively strong grasp over eukaryotic data sets. Glimmer crawls through every sequence that it is given in reverse order, looking for stop codons and then algorithmically identifying the corresponding start codon of a sequence based on a probability model matching it to a recognized ORF against previously collected training data.
This technique is what makes Glimmer3 special, as it ideally never annotates an ORF past the bounds of the sequence, and more easily grips the input sequences by locking down the stop codon before guessing the start codon. Research at Johns Hopkins spurred the change from start to stop codon reading––the way things were in the first 2 implementations of Glimmer. Searching in the forward direction proves to be the a less accurate method since more often than not the stop codon does not appear on the sequence in focus, but rather off the template of Glimmer’s vision––reading backwards fixes this since the start of the sequence itself is always known by the system when reading the bases backwards.
In this way, Glimmer’s interpolated Markov Model is able to find the vast majority (>97%) of long protein coding genes even in bacteria, archea, and viruses.
Docker — Standardizing Computing Environments:
Using Glimmer comes with a catch — it’s only able to operate on Ubuntu, and that too a relatively older version. Since I needed to feed sequences to Glimmer and receive outputs that are piped to the Genetic Constructor UI, using Ubuntu locally would not be an apt or universal solution. Docker to the rescue.
This Dockerfile initializes an image that is booted up by the backend and allows for a non-localized but standardized build of Glimmer layered upon Ubuntu such that every user is accessing the same environment.
Cloud Computing & Job Processing
Ubiquitousness is tough to compose––the cloud is here to help. Autodesk is a digital subscription company that aims to have all its tools in the cloud. I suppose it was only fitting that we follow suit and use a cloud computing model to fit Glimmer into our development.
The overarching architecture of the tool used to send and receive Glimmer inputs and outputs was the Cloud Compute Cannon. Essentially, it sends the docker image and the sequence selected in the UI — converted to the genetic data Fasta format — in a payload up to be processed and fetches the result via POST as a JSON with the results of the Glimmer job.
The fetched results from the Cloud Compute Cannon (CCC) are passed to a node.js script that functions as the job processor by queuing selected Glimmer jobs on sequences and providing outputs to the UI that show the job is resolving and if it has done so successfully. This is the middleware that connects the CCC backend that loads up Docker images to process Glimmer to the front end UI that allows you to click on a sequence, run a Glimmer job, and retrieve the job results.
Interface & Experience
The idea was to package everything in the backend to be out of sight and seamlessly allow someone to click on the Feature Lookup panel, select Advanced (since running Glimmer counts as a job and will require processing), and find ORFs using Glimmer. This should compute, via CCC and Docker to produce a checklist of retrieved features to be applied to the sequences within the selected genetic constructs in focus. Whichever features are selected and added should show up alongside the sequence as annotations – indicating that the encapsulated basepairs make up an ORF.
All of this work put together marks the completion of my shipped feature, start to finish. The development of Glimmer hugely formative for me, and will hopefully serve the Genetic Constructor community in their bioinformatics powered search for the ORFs.
Check out ORF searches with Glimmer at Genetic Constructor
Contact me: udaysuresh.com
Originally published on the Autodesk Life Sciences Blog