Critical Assessment of Functional Annotation (CAFA) Kaggle Competition Review

Abish Pius

Published in

Computational Biology Papers

7 min readMay 9, 2023

Kaggle Competition: CAFA 5 Protein Function Prediction | Kaggle

1. Proteins

Proteins are vital biological molecules that perform various functions in cells, tissues, and organisms. They make up a significant portion of cell weight and are involved in activities such as catalysis, muscle contraction, structural support, defense against pathogens, signaling, regulation, protein folding assistance, and storage. Proteins function within a complex environment that includes other macromolecules like DNA and RNA, as well as small chemical compounds and environmental factors like temperature and pH. The modulation of protein function by these factors has been extensively studied.

Protein structure and dynamics

Proteins are composed of linear chains of amino acids connected by peptide bonds. They have a simple chemical organization, with a common backbone and variable side chains. Proteins can exist as short polypeptide chains (peptides) or longer chains (proteins). The 3D conformation (shape) of a protein is determined by its sequence of amino acids and can be described by a probability distribution or energy landscape. Protein structures with a single dominant conformation can be solved using X-ray crystallography or NMR spectroscopy and are available in the Protein Data Bank (PDB). Despite differences in sequence, proteins can have highly similar 3D structures (folds). Proteins are not rigid structures and undergo conformational changes over time. Some proteins exhibit intrinsic disorder, where their conformational distribution lacks a single dominant peak. Intrinsically disordered proteins can contain both ordered and disordered regions and are listed in the DisProt database. The study of protein dynamics and structure has revealed the importance of conformational changes in protein function. Proteins can also form complexes with multiple chains, such as hemoglobin.

Ontological annotation of proteins

To facilitate the exchange of knowledge and enable computer processing of protein function data, natural language descriptions of protein function are typically ontologized. Ontologies use hierarchical representations, such as trees or directed acyclic graphs, where nodes or terms describe specific activities and links represent relationships between terms. The Gene Ontology (GO) Consortium has developed three independent ontologies: Molecular Function Ontology (MFO), Biological Process Ontology (BPO), and Cellular Component Ontology (CCO). These ontologies are species-independent and aim to unify biology by providing a common knowledge representation. Species-specific ontologies, such as the Disease Ontology (DO) and the Human Phenotype Ontology (HPO), address organismal phenotypes and are used for the phenotypic annotation of proteins. Different researchers may refer to protein function differently based on their focus. Biochemists often concentrate on the MFO level, which includes enzymatic activities and structural/mechanistic aspects. Functional genomics researchers focus on the BPO level, which involves pathways and high-level cellular processes. Protein-protein interaction data can provide insights into protein function, but the aspect of function being considered must be specified. Biocurators play a crucial role in annotating protein function by interpreting experimental literature and summarizing it using ontological representation. They are essential alongside experimentalists and computer scientists for accurate and comprehensive functional annotation.

CAFA Data representation

Proteins can be represented as strings using an alphabet of 20 symbols corresponding to the amino acids they are composed of. This allows for the application of string algorithms and data structures in computational biology, including sequence alignment, which is essential for protein function prediction. Proteins can also be represented as graphs or networks, such as protein-protein interaction networks or gene regulatory networks. Geometric representations using 3D coordinates of atoms in the protein structure are useful for computer vision techniques. Proteins can be viewed as time series data, such as hydropathy profiles or gene expression measurements. Protein function prediction is complex due to events like alternative splicing and post-translational modifications, which result in different protein forms with distinct functions.

Protein data is stored in various biomedical databases, with Nucleic Acids Research featuring a special issue each year to showcase new and updated databases. Important databases for protein function prediction include UniProtKB for sequence and functional annotation, Pfam for sequence families, PDB for structures, and I2D for protein-protein interactions, among others.

2. Protein Function Prediction Problem

Protein function prediction involves predicting a protein’s functional annotation from a labeled dataset using consistent subgraphs in an ontology. Candidate gene prioritization aims to rank genes for a specific term. The two problems differ in input, output, methodology, and evaluation.

Why is protein function prediction challenging?

Protein function prediction faces various biological and computational challenges:

Biological Challenges:

Protein function is determined in the context of the organism and often requires multiple experiments, which may not be feasible for all organisms.
Some experiments are performed in vitro and may not accurately reflect the protein’s function in vivo, especially if post-translational modifications are involved.
Existing function data in biological databases is incomplete, biased, and noisy due to errors, curation issues, and experimental limitations.
Biological databases are gene-centric, making it difficult to associate specific functions with individual protein forms.

Computational Challenges:

Protein function prediction can be viewed as a multi-label classification or structured-output learning problem, where the goal is to output a consistent subgraph of the ontology.
Integrating diverse biological data, such as sequence, structure, and interactions, poses a challenge in data analysis and prediction.
Statistical learning on one species and making inferences about another species is challenging due to the lack of sufficient and diverse data for many organisms.
Evaluating performance requires developing similarity functions between pairs of consistent subgraphs in the ontology, considering the relationships and resolution differences between ontology terms.
Ontologies are large, containing thousands of terms, but proteins are typically annotated with only a small number of terms. Understanding the distribution of annotations and the significance of the number of terms is important.

Why is protein function prediction important?

Protein function prediction is vital for understanding molecular life processes, disease mechanisms, and guiding experimental and therapeutic endeavors. The increasing gap between annotated and unannotated sequences highlights the importance of computational annotation. The field also contributes to statistical and computational advancements and can be applied beyond biology.

Evaluation of protein function prediction algorithms

In the CAFA experiment conducted in 2010–2011, two types of evaluation methods were used: protein-centric evaluation and term-centric evaluation. The protein-centric evaluation is relevant to protein function prediction, while the term-centric evaluation is suitable for gene prioritization scenarios.

In protein function prediction, the output of a predictor is a score for each term in the ontology. A decision threshold is applied to determine the predicted terms based on the scores. To assess the quality of prediction, a similarity function is calculated between the predicted terms and the experimentally determined terms for each protein in the evaluation set.

The precision of a protein at a given threshold is defined as the number of true positive predictions divided by the total number of positive predictions, while recall is defined as the number of true positive predictions divided by the total number of experimentally determined terms. Average precision and recall can be calculated over a set of proteins based on individual scores.

The evaluation is performed on a set of proteins where predictions were made above the threshold. Average precision is calculated as the average of precision values for the proteins, and average recall is calculated as the average of recall values for all test proteins.

A precision-recall curve characterizes the performance of a prediction model across different thresholds. To provide a single-score evaluation, the maximum F-measure over all thresholds is used. The F-measure combines precision and recall into a single metric to assess the overall performance of the computational models.

Protein function at a residue level

Understanding the complete picture of protein function requires considering the mechanistic aspects of its activity and identifying the specific residues involved in particular functions. Assigning ontological terms to individual residues allows us to identify functional residues such as DNA-binding residues, catalytic residues, post-translationally modified sites, ligand-binding residues, protein-protein interaction residues, hot spots, and metal-binding residues. Computational approaches have been developed to predict functional residues, and these predictions can be incorporated into the prediction of protein function at the whole-molecule level.

3. The CAFA challenge

The Critical Assessment of Functional Annotation (CAFA) is a challenge that evaluates computational protein function prediction methods in an unbiased manner. Participants are provided with a set of unannotated or incompletely annotated proteins and asked to predict their functional annotations. Predictions are submitted before the deadline, and experimentally validated annotations are accumulated in UniProtKB. After a few months, the methods are evaluated on the newly annotated proteins. CAFA is designed to provide a fair evaluation of methods and gain insights into their performance and effectiveness. It is an open challenge that allows participants to use any data they find useful. More information about CAFA can be found at http://biofunctionprediction.org.

The main outcomes of CAFA 1 (2010–2011)

The conclusions of the experiment were:

Algorithms developed in the 2000s outperformed traditional function transfer using BLAST.
Performance in the MFO category was useful for guiding biological experiments, but performance in the BPO category, especially for eukaryotic species, was below expectations.
Sequence similarity between a target protein and the most similar annotated protein was not a reliable indicator of overall prediction quality.
Evaluating protein function prediction is challenging due to data biases and incomplete annotations.
The best-performing methods typically utilized multiple types of data, but there were exceptions.
Although some good prediction methods exist, there is a lack of regularly updated tools for practical use in experimental biology.

What is new in CAFA 2 (2013–2014)

In CAFA 2, several novel aspects are being introduced to expand the challenge and provide more detailed evaluation of protein function prediction tools. The new ontologies being added are CCO and HPO. The evaluation will include partially annotated targets to explore the use of existing annotations for better inference and to evaluate predictions on partially annotated proteins. The re-training of methods on updated data will help distinguish between improvements due to better data or more powerful algorithms. Predictions from all CAFA experiments will be stored for future assessments to track the progress of the field. Anonymity is allowed for participating groups, but the performance accuracy of published methods will not remain anonymous. The top 10 ranked methods will be de-anonymized.

FREE PDF to Text CONVERTER Click here: Convert pdf to text for free!

FREE ChatGPT Document Q&A: Get questions answered about any document type of any length!

Plug: Please purchase my book ONLY if you have the means to do so, I usually do not advertise, but I am struggling to stay afloat. Imagination Unleashed: Canvas and Color, Visions from the Artificial: Compendium of Digital Art Volume 1 (Artificial Intelligence Draws Art) — Kindle edition by P, Shaxib, A, Bixjesh. Arts & Photography Kindle eBooks @ Amazon.com.