Designing Customizable Nucleotide Sequence Auto-Annotation in Genetic Constructor
This summer, I worked on the Autodesk Life Sciences team’s molecular biology design tool — Genetic Constructor — the start to building the ultimate cloud-based, CAD inspired genetic engineering toolkit.
Genetic Constructor is poised to prove itself useful to geneticists in academia and industry alike, but one previously identified user need was local auto-annotation lookup for sequences. The essence of such a feature is expediting the recognition of particular sequences by allowing people to build a robust personal library of sequences that they are familiar with and extend that familiarity into an instantaneous lookup of any given constructed sequence within the program. The annotation library that a person builds is completely customizable and manifests itself in the form of fully auto-annotated blocks based on nucleotide sequences within the Genetic Constructor canvas.
At this stage in development, Autodesk Life Sciences makes Genetic Constructor available with a standard set of library annotations that are regularly studied and recognized by molecular biologists. In the future, the Genetic Constructor UI will allow scientists to add their own annotated sequences to the library to decorate imported sequences with their own custom features. At this point the backend is fully formed such that a user with basic coding knowledge could easily generate and add their own library elements.
Creating a curated & personalized library:
Parsing is the name of the game. I had to write a series of parsing scripts that would take the raw data in various forms (Genbank/Fasta or CSV/TXT), strip the desired information from them and place the data neatly into a library (in either JSON or a .js file format).
To parse Genbank & Fasta files I used BioPython’s SeqIO tool to extract the name, sequence and type of each entry that was stuck inside. Each entry was placed into a dictionary object, then each dictionary object was placed into a list of features, and this list was appended to the JSON output of my CSV & Text parser.
Recognizing library elements and preserving genomic data:
On a high level we want to preserve the identity of any given feature in the form of an annotation and instantaneously acknowledge it, even when it’s in the midst of other nucleotides. For instance, the specific sequence for a frequently studied gene – LEU2 (YCL018W) – is the indicative open reading frame (ORF) for the Leucine biosynthesis pathway for Saccharomyces cerevisiae. This species, more commonly known as Brewer’s Yeast, is one of the most intensely studied eukaryotic model organisms, and thus is included in Genetic Constructor’s default library — Brewer’s Yeast is nearly always relevant.
The idea is to distinctly annotate the fermentative enzyme ORF gene LEU2 despite the background noise of other nucleotides. For this, a script crawls through the sequences that are in focus on the Genetic Constructor canvas, in both the forward and reverse directions, to check for matches within the sequences stored in the custom library. I wrote some Node.js scripts that linearly search by finding instances of matching sequences between the library and the input, adding those library entries to a list, and retrieving the metadata (name, sequence length, start codon, stop codon, description, direction) from the library entry. Finally, it packages and presents the metadata in the form of annotations to the sequence viewer within the Genetic Constructor canvas – highlighting the matches in a visually intuitive way.
Creating annotations in the genetic alignment UI:
Once a sequence is loaded into the Genetic Constructor canvas, the Instant Feature Search functions are automatically run to produce a list of possible local library stored features that could be present. In our example, a sample sequence has been loaded into the Genetic Constructor canvas and can be inspected using the sequence viewer. The codons are automatically exposed and the Feature Search column indicates that somewhere within this string of nucleotides, a local library recognized LEU2 marker of 1107 basepair length has been found.
When the blue call-to-action to add the selected features is clicked the locally found library annotations are applied within the sequence viewer as annotations (shown in yellow) alongside the given sequence, specifying exactly where the recognized ORF is. The amino acid sequence is overlaid in parallel to the nucleotides and codons of the sample sequence, so people can easily see the direction of the ORF and effortlessly pick out the start and stop codons.
The UI was built to frame the Genetic Constructor canvas in a way that could be instinctively understood, and the CSS+HTML needed to fit the design concepts backing an ideology of organic interfacing. An example of such an interaction between a person and the interface is the initial preview of the local annotations found in the sequence prior to prompting a selection of which feature annotations should be applied to the sample sequence. This workflow reinforces an understanding of the sequence in focus, removing the abstraction that comes with staring at a string of nucleotides.
Validation of concept:
In a literal sense, the testing of the instant lookup feature for local library annotations was written in a series of description statements using Mocha.js to just check for continuity of the programming and the intent. For example, a simple check that a library element was being built when apt is as such:
But in a more conceptual sense, the purpose of the local library for instant annotation is rooted in improving the user experience in both research and development of genetic material. Wet laboratory work and life science data management could potentially experience a surge in computational ability if people are able to extend their human memory into the realm of digital library memory. Processing genetic data also often involves repetitive ORFs, as often many academics tend to spend years, if not their entire career, focused on one organism – or even one set of genes within an organism.