The Startup
Published in

The Startup

Machine Learning in RNA Bioinformatics

If you haven’t already read my article on or , please read those to get some background information that will help you comprehend the information in this article.

Two decades ago, we got our entire human genome sequenced. We had just found out the exact position for each of the three billion nucleotides that makes up our DNA, or our biological code. But while the Human Genome Project cost billions of dollars, sequencing your genome now takes no more than paying $100, spitting into a test tube, and sending it off to 23 and me.

The cost of sequencing a human genome has dramatically increased from $100M in 2001 to $1k in 2019, greatly exceeding the rate of decrease predicted by Moore’s law.

But aside from the more trivial uses for genetic information, like determining whether you have a widow’s peak or how your genetic profile predetermines your height, genetic information could be used to save lives on a large scale.

Imagine a baby, just taken from its mother’s womb, who’s crying draws happiness and joy from all the other individuals who await in the hospital room. For two minutes, happiness floods the room. The new mother looks gratefully at her child, knowing that for the pain in the world, her experience was worth it. But then the crying stops and all you can hear are the beeps from the heart monitor.

The baby starts to have convulsive seizures and its arms and legs flare-up in asynchronous patterns. All happiness disappears as fast as it had appeared, and the room fills with a sense of pure anxiety and distress.

Now with today’s technologies, it’s debatable whether we’d be able to save this newborn child. For all the time we’ve spent and money we’ve invested in thousands of drugs, we’d merely be powerless, staring anxiously as the doctor frantically attempts to save the baby’s life, unsure about how to treat a condition that he/she’s never encountered before.

Conversely, imagine a world where after the baby starts to get seizures, the doctor knows exactly what to do: feed genetic data in machine learning algorithms that could save the baby’s life. The doctor quickly retrieves a cell sample from the baby and sequences its genome, quickly recording each of the three billion nucleotides to find the hiding bug that could take its life.

The machine learning algorithm identifies the variant as part of a noncoding locus within the baby’s genome, instantaneously deduces the mechanism by which the variant induces epilepsy, and suggests that the baby needs to be dosed with Vitamin B.

This sounds like a quite far-off fantasy, and to be honest, right now it probably is, and here’s why: out of the three billion nucleotides we’ve sequenced, scientists consider 90% of the DNA to be “junk DNA” and the remaining 10% to be the portion that’s more significantly relevant to disease diagnosis and drug development.

The problem with this interpretation is that the “junk DNA” isn’t actually junk. If I gave you a list of all the RGB values for each image frame that makes a Hollywood movie, you’d probably discard it as trash instead of trying to reconstruct the movie from them. That’s the exact way we’re treating noncoding DNA: as data jargon that can be discarded without negative consequences. But just because the “junk DNA” isn’t useful in its present form, that doesn’t mean that it isn’t useful in constructing a larger story. In fact, it’s probably the most important element, playing a greater but more subtle role than the protein-coding genes that code for the large macromolecules that carry out most of the chores in our cells.

Imagine watching a theatrical production with backstage production managers maneuvering various colored spotlights. These managers are the ones who put everything in perspective, just like the noncoding DNA within our cells.

The good news is that we’re heading in the right direction. By analyzing the hidden layers of data that represent much of the backstage management within our genome (such as splicing, RNA protein binding, etc.), we’re able to construct a broader picture of our genome that lends itself to the creation of novel therapeutics for thousands of rare diseases, conditions, and side effects that affect a large majority of the human population. We’re beginning to see that the noncoding region of our genome isn’t merely a scramble of A, G, C, and Us, but rather a secret code, which if interpreted correctly, allows us to develop more efficacious and impactful medicine, thereby lengthening the human healthspan.

In this article, I first go over the discovery of a recently discovered riboSNitch that induces a rare disease. Then, I go over some prominent efforts to understand some of the hidden patterns lurking within the noncoding region, and how state of the art machine learning techniques such as convolutional neural networks and reinforcement learning can be used to identify small bugs that are hiding in the code for life.

When I initially started learning about this topic, one initial doubt I had was whether all novel noncoding variants discovered can penetrate through to the phenotype. In other words, how can we know for sure if a mutation in noncoding locus X affects a person’s phenotype and causes visual symptom Y?

It turns out that while the severity of deleterious mutations is in part dependent on the gene that it’s located in, it's reasonable to assume that all mutations will have some amount of “penetrance” and will subtly affect the phenotype in at least a small way, whether that’s via a small side effect to a drug or a chronic life-threatening disease.

While that’s one approach to thinking about the problem, it’s often more logical to think of each multifactorial trait as stemming from a large variety of mutations each responsible for a different type of cellular or external phenotype, whether it’s a loss of genetic regulatory function within the cell or a defect in protein transport mechanisms. Each noncoding variant that we study acts as a small puzzle piece that contributes to our larger understanding of a particular condition.

In fact, based on a study conducted in 2018, a riboSNitch (a noncoding mutation that changes the three-dimensional structure of an RNA molecule) was proven to increase vulnerability to post-traumatic pain disorders. A mutation in what is called the “untranslated region” of a gene coding for a binding protein was dramatically changed.

MiRNA molecules (microRNAs) which normally act to repress the translation of the gene could not bind, and because of that, the gene was overexpressed leading to a detrimental phenotype.

This study shows us how one small mutation, by changing the interaction between a gene and regulatory counterparts, can cause a drastic phenotype.

The problem is, in a 3-billion nucleotide genome, it would take centuries to search through and find all of these bugs. How can we leverage machine learning techniques including convolutional neural networks and reinforced learning to accelerate this process?

A group of scientists from Illumina’s Artificial Intelligence Lab has recently made a neural network called SpliceAI which can sift through pre-processed RNA transcripts and identify splice sites. It’s been able to assign scores to new variants and through novel neural network architecture, determine how the variants affecting the rate of splicing and intron inclusion.

The SpliceAI 10K (it inputs 10,000 nucleotides flanking either side of a specific mutation) algorithm receives input data consisting of annotated nucleotides from the GENCODE Database. It then inputs the data into a convolution neural network that predicts the splicing donor site and the splicing acceptor site (the starting and ending points of an intron, respectively).

The study effectively used machine learning to predict the effect of splice mutations on alternative splicing in different types of tissues and achieved 95% accuracy in determining these “cryptic” variants. Ultimately, the team was able to prove that these variants were truly detrimental by sequencing people with autism and finding these variants encoded into their genes! In addition, they were able to how these mutations acted differently in different types of cells using what’s called the GTEx database. Just like people, although each cell in your body shares a vast majority of the same DNA, genes are expressed differently and this can result in a need for different types of therapeutic targeting and treatment.

This kind of technology, if scaled up, could be the kind technology that makes the fantasy I cooked up at the beginning of the passage to become a reality.

If I told that 250,000 people spread across the globe playing a video game was actually helping us solve this problem regarding noncoding variants, drug targets, and accelerating healthcare, you probably wouldn’t believe me. But that’s exactly what’s happening to the public online game EteRNA.

Here’s what EteRNA looks like if you’re curious. Each pair of circles is a nucleotide base pair. Red-Green represents Guanine — Cytosine and blue-yellow represents Uracil — Adenine.

EteRNA is a game created and managed by members from Stanford Medical School. It’s a game where 250,000 players around the world solve puzzles, where they try to mutate base residues to try and match a given RNA 3D target structure. People who propose sequences that share enough structural similarity with the given structures go on to work with people at Stanford’s medical school to identify the in-vitro structure of their proposed sequence and how they can iterate to improve. So far, humans have shown to be more successful than RNA Inverse Folding algorithms (such as MODERNA, NUPACK, and RNAInverse). You can learn more about EteRNA .

One group of scientists have come up with a reinforcement learning that predicts is able to do “inverse RNA folding” and predict an RNA sequence based on its tertiary structure. Reinforcement learning basically occurs where, for each correct step a computer takes towards achieving its assigned target goal, it gets rewarded, and knows to keep going in the right direction.

This machine learning technique ultimately proved to be more effective than all of the other RNA Inverse folding algorithm, and really goes to prove the potential of machine of machine learning to aid in areas like RNA bioinformatics and revolutionizing healthcare.

Key Takeaways and Summary:

  • We need to make sure we know what we don’t know about the human genome. Until we do that, all of the genome data we have coupled with the decrease in cost won’t really mean much.
  • By looking at specific types of noncoding variants, such as those that affect RNA alternative splicing, or those that change the three dimensional structure of RNAs, we can get closer towards achieving this goal.
  • Using machine learning techniques such as convolutional neural networks (CNNs) and reinforcement learning are helping us to identify more detrimental variants and begin to supply hypotheses for the mechanism by which these variants act.

Four papers I referenced for this article:

Thanks for reading! Feel free to check out my on Medium and connect with me on !

If you’d like to discuss any of the topics above, I’d love to get in touch with you! (Send me an email at or message me on )



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mukundh Murthy

Innovator passionate about the intersection between structural biology, machine learning, and chemiinformatics. Currently @ 99andbeyond.