Protein Function Prediction — From Motif to Network

Simon Tse
Learn about Cancer with Code
3 min readMay 27, 2024
Courtesy: Design / Science Photo Library / Getty Images

Introduction

In last post, I covered how to map any protein sequence into finite number of motif of fixed length.

In this post, I am going to write about how I am going to convert that to those non-overlapping motifs to a network for further analysis.

Approach

In last post, I have prepared a data set for each protein sequence into following format.

Created by author

A quick recap on what each column represents: ‘seq’ represents the original protein sequence, ‘ngram’ is the end result of splitting the sequence into a 36-aa long motif. You will notice the last motif is not with 36-aa and this happens all the time for natural protein. ‘start_pos’ and ‘end_pos’ indicates the relative position of that ngram in the original protein sequence. ‘weight’ represents the match between the ngram and the standardised 36-aa long Kmer. ‘UniprotID’ refers to the Primary Accession Number used by Uniprot database for each unique protein sequence. ‘GO’ refers to the Gene Ontology Number used in the CAFA database. I will explain why I include UniprotID and GO terms in the construction in in next section.

To turn the ngram into a network graph, I have following helper functions to facilitate the construction.

Below is the output of running the functions ‘createKmerNetwork’ and ‘plotKmerNetwork’.

Created by author

It might not be apparent from this diagram. The construction is following a set of rules spelled out below.

  1. UniprotID as starting point.

2. Connect UniprotID to the starting Kmer/ngram.

3. Connect the first Kmer/ngram to second Kmer/ngram in position, so on and so forth until last Kmer/ngram.

4. Connect the last Kmer/ngram back to UniprotID.

5. A bi-directional connection between UniprotID and each GO term

The reason of using UniprotID is because I can reference the sequence to its structure and function(s) with that ID. And I can then map which point or seqment of the protein that is, for example, a binding site, a zinc-finger or other information that could be used in later analysis. The GO is used for grouping different protein sequences that share the same functionality according to Gene(s).

Intermission

I have covered how to convert a protein sequence into a DiGraph in this post. In next post, I am going to cover how I would combine all protein sequences from that 5 GOs into single graph for further analysis.

Stay tuned!

--

--

Simon Tse
Learn about Cancer with Code

Try to apply my ML/NLP knowledge to problems I am interested in and create a narrative with the data. Current Interest: Cancer Biology