COSSMO — Competitive Splice Model For RNA Splicing

Laiba Khan
6 min readMar 31, 2019

The process of turning DNA into proteins always seemed so simple. I mean I learned about it almost every year in science class, so I really thought I knew what I was doing.

Turns out, what my science teacher forgot to mention was RNA splicing. Something I knew vaguely about, but never really took the time to explore.

It was when I stumbled upon Deep Genomics research paper on their AI machine, COSSMO I realized how sick RNA splicing is, and how much I had to learn.

You see, we always talk about how important our DNA is, the effects of changes in the sequences. But what do we really mean by that? What is actually controlling how our genetic sequence comes out? Well, RNA splicing plays a huge role in that. Which is why I was so set on understanding this process.

So, a few days, and a couple of research papers later, I can officially say I know how RNA splicing works (That is until I read another research paper that brings up another term I’ve never heard of.) But until then, I’m gonna summarize what I learned about Deep Genomics COSMO machine/

But before I begin my rant on that and how awesome it is, we need to understand RNA splicing.

An Overview of RNA Splicing

Here’s a quick and simple overview of what RNA splicing is form my last article:

We’re still not ready for translation though, the RNA strand still needs modifications. As of right now in the process, the RNA strand is made out of introns and exons. Exons are the sections of DNA that code for protein and introns are those that don’t.

To make this piece of RNA readable we need to remove the introns and also add caps to the ends of the DNA sequence. A 5 prime cam and 3 prime poly-A tail need to be added.

This process is called introns splicing. A complex made out of proteins and RNA called spliceosome removes the intron segments and joins the adjacent exons together to produce mature mRNA.

The mature mRNA then leaves the nucleus through the nucleus pores.

As I already mentioned introns need to be cut out before the RNA is ready to be turned into a protein or go through the translation process.

This is because introns are what we call junk DNA. Junk DNA refers to DNA that does not code protein, instead its in charge of less important tasks.

It actually makes up for 95%, which leaves only 5% of our DNA (exons) that codes proteins, the structures that control the majority of our bodies processes.

The Process of RNA Splicing

It’s officially time for the details. We know what happens during RNA splicing, but not too much on how.

The spliceosome is in charge of cutting these introns out. It’s similar to a ribosome, but it’s made out of RNA and 150 other proteins.

The 5 RNA strands are U1, U2, U4, U5, and U6 (And yes, whoever names these did not care about the number 3). These strands are called the small nuclear RNAs, or snRNA. They range from 100 to 300 nucleotides longs and have several proteins embedded in called small nuclear ribonuclear proteins or snRNPs.

The snRNPs recognize and attach themselves to different parts of the pre-mRNA sequence, eventually forming a spliceosome.

Step One: snRNPs bind to the 5ˈ splice site and the branch site

The 5' splice site is the ends of Exon 1, the 3' splice site is the beginning of Exon 2. These sites are also called the donor splice (5' SS) and acceptor splice (3'SS). The locations of these are determined by something called the GU/AG mRNA processing rule.

This means that that U1 binds with the donor site by using complementary base pairing. The RNA sequence in the 5'ss site is the consensus sequence, which usually starts with the dinucleotides GU and ends with AG, hence the name.

U2 performs this same process to connect itself with the branch site.

Step Two: Bring both sites together and cut them

Then U4, U5, and U6 join to create a complex around U1 and U2. U1 then leaves, cutting the 5'SS site off from the exon it was connected to before it goes. This leaves us with a leftover “G.” This G binds with A or the branch site which has an open hydroxyl group. This then creates a lariat structure (loop).

The spliceosome then cuts off the intron at the 3'SS. The structure then breaks apart, leaving only the exons and lariat loop.

Step 3 — Catalyze the RNA cleavage and joining reactions

The lariat loop is degraded, while the exons connect with using a protein called exon junction complex or EJC.

The connected exons are called mature RNA and go on to the translation progress.

Oh, wait, one more thing!

In some pieces of RNA, the strands will be spliced the exact same way every time, and the exons will stay in one order the entire time, thus producing the same protein, this is called constitutive splicing and what we just learned about.

But sometimes the RNA strand can be cut in a way the exon can be rearranged to produce different expressions of the gene and create isoform proteins. These are proteins that originate from one gene and are structurally similar. For example, transforming factor beta (TGF-B) which exists in three versions/isoforms, TGF-B1, TGF-B2, and TGF-B3. The process for this look something like the following image:

Now, Let’s talk about COSSMO

In the genetics world, we call computational models that can predict the splice site, splicing codes.

Splicing codes will help us better understand the different sequences being produced and their functions.

Some examples of other splicing codes include a model that could predict whether an exon was constitutively spliced or alternative splices. Although Deep Genomics was able to go further and create a model that was capable of predicting the usage distribution of multiple splice sites and of alternative acceptor sites conditional on constitutive donor site (vice versa).

This basically means they could predict the donor and acceptor site of introns being cut out alternative and constitutive splicing.

Although, to do this they had to analyze the inherent strength of the splice site. In alternative splicing, there are multiple splice sites that are competing for the spliceosome recognition, so it's necessary to find the competitiveness of the splice site, not just the splice site.

This is why COSSMO also predict the percent-selected index (PCI) for the sites, so the chance of the different site being used.

COSSMO will allow us to learn more about gene sequencing and mis-splicing, which causes about 15% of genetic diseases. The impact of this technology is kinda crazy, and what’s crazier is that you can actually try COSSMO right over here.

I’m currently working on replicating this program and a video focusing more on the AI side of COSSMO will be out very soon!

If you want to stay up to date with work, follow me on Linkedin for updates. If your curious about my other projects check out my website! Also, claps? :)

--

--