Protein Function Prediction — Ngram

Simon Tse
Learn about Cancer with Code
4 min readSep 18, 2023
Courtesy: Laguna Design / Science Photo Library / Getty Images

In last post, I covered the CAFA train data set used in Kaggle competition and gave you an overview of what the data set is like and which field describes what. Now it comes to the part that I am looking at the sequence a little deeper.

Background

As protein is coded by DNA and DNA has regular pattern to decide how each protein is manufactured, it is natural to assume that protein sequence should have similar characteristics. Proteins with similar function should share certain pattern(s).

DNA as the cookbook of protein should be written in ways that have syntax, just like natural language. And this is pretty much the same reason why natural language processing (NLP) technique has been employed to analyse DNA. N-gram model from NLP has been a natural choice to analyse protein sequence.

To those who are not familiar with n-gram model, n-gram [1] is a series of n adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. To extend the use in protein, it would be adjacent amino acids extracted from a protein sequence. In computational biology, for polymers or oligomers of a known size it is called k-mers.

Source: https://en.wikipedia.org/wiki/N-gram

Analysis

To analyse the proteins, I have following code snippet to generate the k-mer.

Prepared by author

The first two functions allows me to control what k-mer to generate. For example, I can generate a list of monomer and a list of trimer skipping the dimer. The reason I am doing this is that allows me the flexibility to combine different k-mers and use in downstream process instead of using the built-in ngram function in, for example, scikit-learn that restricts me to have a consecutive n-gram list.

Next I am going to analyse if there is a pattern among different k-mers. It roughly translates to if there is a ceiling number of different k-mers. To do so, I am using the text vectoriser function in scikit-learn library.

Prepared by author

Above is an example on how to obtain the result I intend to collect. This example demonstrates how I collect monomer from the train data set. The column dimension of the X_train would give me the number I want: that is the maximum number of different k-mer given the train data set.

Following table summarised the result after running through different k-mears.

Prepared by author

I went through all k-mer starting from 1 to 40. An interesting phenomenon is observed: the increase of dimension of different k-mers shows a parabolic trend. The number reaches a maximum at 36-mers and then it starts to decrease. Here I would hypothesise that 36-mer would capture all the essential information that is needed to predict a protein function.

Since 36 is the magic number, it’s better to check if any protein that does not have more than 36 amino acids. I ran the following code snippet to determine the situation.

Prepared by author

From the output, we know that there are 814 proteins that has fewer than 36 amino acids in the sequence. So these 814 proteins would not be subject to the analysis I am going to perform.

And I also ran another code snippet to determine how many protein has exactly 36 amino acids.

Prepared by author

From the output, there are only 24 proteins that is exactly 36 amino acids long.

Since the number of proteins with less than 36 amino acids is relatively small when compared with the total number of proteins, I would assume future analysis can generalise well beyond the protein sequences that are not covered in train data set.

Intermission

In next post, I would analyse the 36-mer to see if pattern exists for a designated function.

Stay tuned.

--

--

Simon Tse
Learn about Cancer with Code

Try to apply my ML/NLP knowledge to problems I am interested in and create a narrative with the data. Current Interest: Cancer Biology