Novel algorithms help gene engineers battle novel coronavirus

Published in

Purdue Engineering Review

5 min readJan 19, 2021

The mapping of the human genome — the complete code of instructions that enables us to develop and function — is vital in the fight against the coronavirus pandemic. Genome engineering, or genome editing, essentially alters an organism’s genetic code, and was recognized in awarding the Nobel Prize in Chemistry in 2020. Recently, labs have turned to gene-based technologies to develop vaccines in record time compared with the traditional approach, in which weakened viruses are grown in mammalian or insect cells and the desired pieces are extracted to inject into humans.

Today, a staggering 94 vaccines to combat the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are undergoing clinical evaluation. The initial three vaccines being administered, starting in late December 2020, are glorious examples of genetic engineering being used to safeguard the world.

The first two vaccines, from Pfizer/BioNTech and Moderna, use messenger RNA (mRNA) coated with lipid nanoparticles to enable them to be taken up by our cells — which then follow the vaccines’ instructions to make a harmless piece of the spike protein found on the Covid-19 virus, triggering an immune response. The third vaccine (from Oxford/AstraZeneca) uses a common cold virus, adenovirus, which has been inactivated by removing the replication gene and harmful genes and splicing in the spike protein gene instead, in an example of recombinant gene technology.

Along with the mRNA coronavirus vaccines, RNA-based therapeutics that mitigate the full force of the novel coronavirus once someone has been infected have risen to the forefront. My lab, Innovatory for Cells and Neural Machines (ICAN), has worked on developing algorithms to reduce the side effects of RNA-powered therapies that can occur through “off-targeting” to unwanted regions of the genome. We are developing simple, LEGO®-style building blocks of these algorithms, mixing and matching these “kernels”, and stitching them together with programming language constructs and the associated compiler to accelerate the development cycle of new computational genomics algorithms.

Our lab also has come up with a Natural Language Processing (NLP)-inspired technique to decode the language of cells probabilistically, to detect and correct errors in sequenced reads. Underlying this approach is a perplexity metric — an indication, based on the currently observed sequence, of what the next sequence will be, with probability scores for differing sequence outcomes. For an analogy, think smartphone software autocorrecting your typing. A lower perplexity metric score for a word means that the software is more likely to suggest that word.

Such advances will lead to greater adoption and effectiveness of precision medicine. Consider the current coronavirus pandemic, in which some people are asymptomatic, while others have varying degrees of resilience to the disease, possibly due to variations in their genetic makeup. I have used machine learning (ML) techniques, specifically neural networks and support vector machines, to identify patterns in the epigenome (a set of chemical compounds that tell the genome what to do) that result in different phenotypes in different humans. My work has enabled identifying regions of the genome that enhance gene regulation called enhancers.

Bayesian optimization process for predicting RNA-gene interactions in the TCGA-BRCA dataset. [C] shows a heatmap of regulatory RNA-gene interactions, as predicted by our algorithm (regressor) — — Tiresias. Tiresias uses an ensemble of two simple neural networks connected by a Bayesian loop for prediction (of interactions) and holds even when the sample dataset violates normality. Finally, the regression error is reduced by switching to a non-linear model. For more, read [https://pubmed.ncbi.nlm.nih.gov/29290807/].

I am leveraging the power of neural networks to extract patterns in the genomic code in order to decipher the computation of cells for precision mRNA therapeutics and correct errors in sequenced genomic codes. With precision-centric RNA technologies, I aim to make the translation of RNA therapeutics — RNA-based drugs or CRISPR-Cas9-based genome editing — more specific, by decreasing the incidence of off-targeting and increasing the robustness of on-target activity. This interplay between ML and data engineering, combined with genomics and cell engineering, will speed the translation of lab research into clinical treatments.

It all comes down to tuning the “music of the cell.” Think of the epigenome and genome as affecting the “computation” of life — in that the alphabets of the genome are like the strings of the note, and if the alphabets are modified or edited by natural mutations or genome editing, the music can become discordant. In disease, the music gets distorted and needs to be fixed. That is why we named one of our deep neural network (DNN)-based algorithms “Aikyatan,” from the Sanskrit, meaning “one harmony.”

We are also working with experts in genome assembly to translate our discoveries in epigenomics to cell engineering solutions for regenerative medicine. Our lab has experienced early success demonstrating how to create pluripotent stem cells (PSCs) that can differentiate to target cell types, such as cardiac cells, to help cure heart disease. The current state of the art would develop stem cells that too often have deleterious side effects, like tumor formation due to incorrect epigenetic wiring. Our data-driven approach creates “recipes” for creating stem cells that our cell engineering collaborators then put into practice in their lab-scale experiments and clinical translation process.

A huge number of cloud computing cycles are required to crunch this data. Our technologies optimize compute power for genomics research, and we have filed for two patents on the use of ML to improve cloud-based and on-premise distributed computation for genomics.

The common chord in this research is applied ML, particularly approximate neural networks for pattern matching, to decipher both the genome and computer-vision data workloads for the Internet of Things (IoT). This gives my laboratory its name, Innovatory for Cells and Neural Machines — meaning the use of machine learning, primarily neural networks, to understand and then program the computation of living cells, and also approximate energy-guzzling computation for the ever-growing IoT world.

Somali Chaterji, PhD
Assistant Professor, Department of Agricultural and Biological Engineering
Director, Innovatory for Cells and Neural Machines

Leadership team member, Wabash Heartland Innovation Network (WHIN) for Digital Agriculture
College of Engineering, Purdue University

ICAN research thrust: Computational genomics

Purdue Engineering Review: Bringing the cloud back down to earth

Purdue Engineering Review: Engineering and biology join forces for healing

NIH research project will upgrade ‘metagenomics’ system

Technology aims to provide cloud efficiency for databases during data-intensive COVID-19 pandemic

Novel algorithms help gene engineers battle novel coronavirus

Written by Purdue College of Engineering