Protein Function Prediction

Simon Tse
Learn about Cancer with Code
3 min readAug 21, 2023
Courtesy: Laguna Design / Science Photo Library / Getty Images

After taking a hiatus, it is time to get back to my exploration of cancer biology. In past posts, my focus was to explore the mechanism of cancer development through text published in various sources such as KEGG or journal papers. Now I would like to direct the focus on individual component that will contribute to the cancer development such as protein interactions.

Overview

I believe those who have been working in the bioinformatics field for some time are familiar with all the approaches/methods used over past decade. For example, Quantitative structure–activity relationship (QSAR) models are the default approach before deep learning comes into the scene. QSAR models are regression or classification models used in the chemical and biological sciences and engineering. Like other regression models, QSAR regression models relate a set of “predictor” variables (X) to the potency of the response variable (Y), while classification QSAR models relate the predictor variables to a categorical value of the response variable. In QSAR modeling, the predictors consist of physico-chemical properties or theoretical molecular descriptors of chemicals; the QSAR response-variable could be a biological activity of the chemicals. QSAR models first summarise a supposed relationship between chemical structures and biological activity in a data-set of chemicals. Second, QSAR models predict the activities of new chemicals.

These days deep learning is becoming the de facto approach when AlphaFold 1 appeared in 2018 and was placed first in the overall rankings of the 13th Critical Assessment of Structure Prediction (CASP) in December 2018. A team that used AlphaFold 2 (2020) repeated the placement in the CASP competition in November 2020. The team achieved a level of accuracy much higher than any other group. It scored above 90 for around two-thirds of the proteins in CASP’s global distance test (GDT). [1] Though AlphaFold won the other competitors by a large margin in the aforesaid competitions, it also garnered criticism such as the accuracy was not high enough for a third of its predictions and that it did not reveal the mechanism or rules of protein folding for the protein folding problem to be considered solved. So the question of what determines the protein folding or function remains unanswered from the point of view of professionals.

Approach

First, I have to declare I am no expert in all bioinformatics models. So I won’t say I will be able to tell a different story from those professionals. However, I am still interested in trying to understand the mechanism behind because this is an important step if I am going to explore drug targeting in future. That is why I would focus on exploring the rules or mechanism that determines certain protein function.

In future articles under this heading, I will cover what I would do to determine the protein function based on protein sequence. The data sets come from a recent competition named CAFA 5 Protein Function Prediction hosted by Kaggle. [2]. If you are interested in this topic, you may want to explore the competition a little bit. (most people are using Large Language Model repurposed for this function and those models are hosted at Hugging Face.)

So I hope you will enjoy my upcoming articles. Thanks again. Ciao for now.

--

--

Simon Tse
Learn about Cancer with Code

Try to apply my ML/NLP knowledge to problems I am interested in and create a narrative with the data. Current Interest: Cancer Biology