Protein Function Prediction — CAFA Data Set
In last post, I wrote about a recent Kaggle Competition on Protein Function Prediction. It is about using protein sequence to predict the Gene Ontology (GO) term. Most entries are using some derivatives from the LLM. It achieves excellent prediction accuracy on the public score board. If you are interested in some public solution, you may want to check out the competition website.
However, I am more interested to find out what combination of amino acids and their relative positions that endow the protein of a particular function(s). So I will take a different path.
To start, I will take a look at the dataset to see what is provided by the organiser.
Train Data Set Profile
The train file contains 3,754,570 entries and it includes protein coming from human being, virus, bacteria and others. That is why the total number of unique amino acids is 25 instead of 20 that are found in human beings.
And these 3,754,570 entries contains 29,706 unique GO terms
Among these three million entries, the shortest sequence has only 3 amino acids while the longest has over 35 thousands.
You may notice there is a variable named “aspect” in the file. It actually encodes three broad categories of protein functions, namely
- Molecular Function (MFO),
- Biological Process (BPO), and
- Cellular Component (CCO)
You would notice that most of the protein sequences would fall under Biological Process (BPO). And one protein could belong to more than one GO and it could fall under different aspect. This highly skewed data set would give problem in classification.
To give you an idea on what a GO term would be like, I extract the information with following code snippet.
And then I prepare the following pandas data frame to capture the collected information.
You may notice some GO term have very similar identity such as GO:0005582, GO:0005590 and GO:0005597 and they all have to do with collagen trimer. I would hypothesise that the sequences of these three GO terms may share similarity in terms of composition and/or physical structure that could be exploited in the prediction.
Intermission
In next post, I will dig a little deeper into the sequence to see what they looks like in terms of composition.
Stay tuned.