Auto-Annotating Microbial DNA Sequences Using Predictive Modeling

Uday Suresh
Jan 6, 2018 · 5 min read

While I was working on Autodesk Life Sciences Team this summer, I was tasked with building a method of automatic annotation that could be applied to DNA sequences for Genetic Constructor. The idea behind this was to create a structured pipeline that would identify Open Reading Frames (ORFs) existing within segments of genetic material.

This is a particularly useful operation, since reportedly less than 2% of the human genome is actively coding. The larger the organism, the more uncertain we are about internal genetic relevance —C. elegans has approximately 25% actively coding and yeast is around 70%. To find and alter that minute portion and synthetically modify DNA at all, the ORFs must be mapped out and easily viewable so scientists know what sequences should be adjusted.

Here is where Glimmer (Gene Locator and Interpolated Markov ModelER) comes in – a tool that uses predicative mathematics based on a scaffolding training set of prominent eukaryotic data to make speedy, accurate (97%+) guesses about what is an ORF and what is not.

Why Glimmer?

The standard protocol for recognizing an unknown sequence is usually BLAST — but what if the sequence can’t be compared to known sequences? That’s where Glimmer comes in, by using learning structures to breathe life into data that is usually beyond common recognition to find new genetic features.

Zero order nucleotide Markov Model

Glimmer is based on an interpolated Markov Model that is trained against a defined standard data set of eukaryotic genetic data. More specifically, Glimmer is trained against the human, rice, and Arabidopsis thaliana genomes — three fairly well studied model genomes that allow it a comparatively strong grasp over eukaryotic data sets. Glimmer crawls through every sequence that it is given in reverse order, looking for stop codons and then algorithmically identifying the corresponding start codon of a sequence based on a probability model matching it to a recognized ORF against previously collected training data.

This technique is what makes Glimmer3 special, as it ideally never annotates an ORF past the bounds of the sequence, and more easily grips the input sequences by locking down the stop codon before guessing the start codon. Research at Johns Hopkins spurred the change from start to stop codon reading––the way things were in the first 2 implementations of Glimmer. Searching in the forward direction proves to be the a less accurate method since more often than not the stop codon does not appear on the sequence in focus, but rather off the template of Glimmer’s vision––reading backwards fixes this since the start of the sequence itself is always known by the system when reading the bases backwards.

In this way, Glimmer’s interpolated Markov Model is able to find the vast majority (>97%) of long protein coding genes even in bacteria, archea, and viruses.

Docker — Standardizing Computing Environments:

Using Glimmer comes with a catch — it’s only able to operate on Ubuntu, and that too a relatively older version. Since I needed to feed sequences to Glimmer and receive outputs that are piped to the Genetic Constructor UI, using Ubuntu locally would not be an apt or universal solution. Docker to the rescue.

I created a Dockerfile and pushed it to the Docker Hub such that it exists as a public repository that could be called upon to invoke Glimmer and run it in this remote Ubuntu environment.

This Dockerfile initializes an image that is booted up by the backend and allows for a non-localized but standardized build of Glimmer layered upon Ubuntu such that every user is accessing the same environment.

Cloud Computing & Job Processing

Ubiquitousness is tough to compose––the cloud is here to help. Autodesk is a digital subscription company that aims to have all its tools in the cloud. I suppose it was only fitting that we follow suit and use a cloud computing model to fit Glimmer into our development.

The overarching architecture of the tool used to send and receive Glimmer inputs and outputs was the Cloud Compute Cannon. Essentially, it sends the docker image and the sequence selected in the UI — converted to the genetic data Fasta format — in a payload up to be processed and fetches the result via POST as a JSON with the results of the Glimmer job.

The fetched results from the Cloud Compute Cannon (CCC) are passed to a node.js script that functions as the job processor by queuing selected Glimmer jobs on sequences and providing outputs to the UI that show the job is resolving and if it has done so successfully. This is the middleware that connects the CCC backend that loads up Docker images to process Glimmer to the front end UI that allows you to click on a sequence, run a Glimmer job, and retrieve the job results.

Interface & Experience

The idea was to package everything in the backend to be out of sight and seamlessly allow someone to click on the Feature Lookup panel, select Advanced (since running Glimmer counts as a job and will require processing), and find ORFs using Glimmer. This should compute, via CCC and Docker to produce a checklist of retrieved features to be applied to the sequences within the selected genetic constructs in focus. Whichever features are selected and added should show up alongside the sequence as annotations – indicating that the encapsulated basepairs make up an ORF.

All of this work put together marks the completion of my shipped feature, start to finish. The development of Glimmer hugely formative for me, and will hopefully serve the Genetic Constructor community in their bioinformatics powered search for the ORFs.

Check out ORF searches with Glimmer at Genetic Constructor

Contact me: udaysuresh.com

Originally published on the Autodesk Life Sciences Blog


Built with the help & support of Autodesk Life Sciences Team + the instruction & guidance of Maxwell Bates, Florencio Mazzoldi, Cornelia Scheitz, and more — Thank you

Uday Suresh

Written by

Bioengineering + Creative Writing @ UC Berkeley ’18

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade