Protein Function Prediction — CAFA Data Set

Simon Tse
Learn about Cancer with Code
3 min readSep 4, 2023
Courtesy: Courtesy: Laguna Design / Science Photo Library / Getty Images

In last post, I wrote about a recent Kaggle Competition on Protein Function Prediction. It is about using protein sequence to predict the Gene Ontology (GO) term. Most entries are using some derivatives from the LLM. It achieves excellent prediction accuracy on the public score board. If you are interested in some public solution, you may want to check out the competition website.

However, I am more interested to find out what combination of amino acids and their relative positions that endow the protein of a particular function(s). So I will take a different path.

To start, I will take a look at the dataset to see what is provided by the organiser.

Train Data Set Profile

The train file contains 3,754,570 entries and it includes protein coming from human being, virus, bacteria and others. That is why the total number of unique amino acids is 25 instead of 20 that are found in human beings.

Prepared by author

And these 3,754,570 entries contains 29,706 unique GO terms

Prepared by author

Among these three million entries, the shortest sequence has only 3 amino acids while the longest has over 35 thousands.

Prepared by author

You may notice there is a variable named “aspect” in the file. It actually encodes three broad categories of protein functions, namely

  1. Molecular Function (MFO),
  2. Biological Process (BPO), and
  3. Cellular Component (CCO)
Prepared by author

You would notice that most of the protein sequences would fall under Biological Process (BPO). And one protein could belong to more than one GO and it could fall under different aspect. This highly skewed data set would give problem in classification.

To give you an idea on what a GO term would be like, I extract the information with following code snippet.

Prepared by author

And then I prepare the following pandas data frame to capture the collected information.

You may notice some GO term have very similar identity such as GO:0005582, GO:0005590 and GO:0005597 and they all have to do with collagen trimer. I would hypothesise that the sequences of these three GO terms may share similarity in terms of composition and/or physical structure that could be exploited in the prediction.

Intermission

In next post, I will dig a little deeper into the sequence to see what they looks like in terms of composition.

Stay tuned.

--

--

Simon Tse
Learn about Cancer with Code

Try to apply my ML/NLP knowledge to problems I am interested in and create a narrative with the data. Current Interest: Cancer Biology