Protein Function Prediction — Turning Protein Sequence to Motif
Introduction
In previous post, I am writing about breaking down proteins into different, non-overlapping, 36 aa-long chain as sequence motif for further analysis.
In this post, I am going to elaborate on the steps on how I will structure the analysis by creating a network connecting different protein sequences from the 5 GO groups of which PIK3AP1 belongs to.
Approach
To start, I need some helper functions to match different protein sequence into a standardised library of 36 aa-long Kmers.
I used the CAFA data set of proteins to create such a library. This library contains 53,370,039 unique 36-aa long kmers. The protein sequence of PIK3AP1 is as below:
I am going to break this protein sequence into non-overlapping segments for matching to the standardised kmer library using the helper function matchInDic. First, I have to break the protein sequence into segments as depicted in below diagram.
Note: I am not using the PIK3AP1 sequence here. It is for demonstration on what I am doing only.
You will notice the last ngram is not of length 36-aa long. This is pretty much the norm for all protein sequences when I break it down to a 36-aa long ngram and that is why I need a fuzzy match in my matchInDic helper function to handle such cases.
Next, I will match these ngram to the standardised kmer library. Following is the Python script I used. There is 22,530 distinct protein sequences under the 5 GOs of which PIK3AP1 belongs to.
And below is the snapshot of one of the protein sequences mapped.
The second last column shows the kmer matched to the ngram. You can see there is a ‘weight’ column. This is basically an index to tell the percentage of match between the ngram and the corresponding kmer. For example, the last ngram ‘HPPDYVIQNQIGMFLNYIC’ has a weight of 0.5278. It is calculated by dividing the length of the ngram (i.e. 19 aa) by 36 (i.e. the standardised kmer’s length). For other ngrams, you will find a weight of 1 and that means the match is exact.
Intermission
In next post, I will cover how to construct a graph with these matched kmers.
Stay tuned!