Protein Function Prediction — Motifs

Simon Tse
Learn about Cancer with Code
4 min readOct 2, 2023
Courtesy: Design / Science Photo Library / Getty Images

In last post, I have identified a magic number 36 that gives the maximum number of combinations with 25 amino acids.

Now I am going to explore how a 36-unit k-mers generated from each protein sequence can be used to differentiate each Gene Ontology Group.

Approach

To do so, I am going to create a graph that captures following relationship

GO group => Protein => 36-unit K-mers

This will help me count the number of recurring K-mer in each GO Term and then compare these overlapping K-mer between different GO groups.

And I have following Python snippets to analyse the data.

I have extracted a few GO terms with moderate number of occurrences, that is between 500 and 600 different protein sequences found in that GO term.

Prepared by author

And following table summaries the nature of these GO groups.

Prepared by author

I will use the first three GO groups to demonstrate what those Python snippets will do and produce

Prepared by author
Prepared by author

This second GO group comes from cellular component, which is different from the first GO group from biological process.

Prepared by author
Prepared by author
Prepared by author

Honestly, I am surprised that there is no overlap in motifs, even the two GO groups belong to different classes. On the other hand, this is not improbable because the entire motif space spans over 50 millions different motifs. Given the number of motifs that happened more than once is in the range of thousands to tens of thousands, it is not easy to find overlapping motif sequences when the GO has only 500–600 proteins. That might be different when we are looking at GO term with higher number of proteins.

Following is another GO term coming from biological process. It is a point to note that the distribution has a ‘fat’ tail and the number of degree is exceptionally diverse. That might be an interesting area to explore.

Prepared by author
Prepared by author
Prepared by author

Again, there is no overlap in motifs even both GO groups come from biological process.

Intermission

With this preliminary analysis, it is interesting for find that it is possible to have non-overlapping motifs between different GO groups that can be used to distinguish one from another. I am not saying that is a sure case as I just not-so-randomly pick a subset of GO groups to analyse. The verdict is pending for an exhaustive analysis on all GO groups. However, this seems to be an interesting twist as I would raise following hypotheses:

  1. A small group of recurring motifs within a GO group is likely to set the tone of those proteins. It acts like a ‘backbone’ that defines the generic nature of those proteins.
  2. Those motifs with single occurrence within a GO group would provide specificity when it gives a protein’s unique function.
  3. How these motifs are related to the Tertiary/Quaternary structure and how the protein is to interact with substrates . That’s an area for further exploration

I haven’t figured out what to do next but I have a few ideas that I like to try out. Welcome any input!

So stay tuned!

--

--

Simon Tse
Learn about Cancer with Code

Try to apply my ML/NLP knowledge to problems I am interested in and create a narrative with the data. Current Interest: Cancer Biology