Intro to CASP for Machine Learning Researchers

Published in

Deep Learning for Protein Design

9 min readJul 22, 2020

This blog post is the first in a series of blog posts on machine learning for protein structure prediction and protein design.

I wrote this to bring machine learning researchers up to speed on protein structure prediction, though it may also be helpful for others new to CASP. I’ll explain the various problems in CASP13 (the Critical Assessment of protein Structure Prediction) in terms of their inputs, outputs, and metrics of success. Most problems have multiple metrics, but I will only list the default metric on the CASP13 results page. I chose CASP13 because the evaluation metrics for CASP14 are not yet available. Further challenges have since been introduced, but this post should give you enough background to learn about the changes on your own.

At the end of this post, there is a glossary of chemistry terms and acronyms common in the protein structure prediction vernacular.

A general overview of protein structure prediction is given in DeepMind’s blog post, so I’ll leave that to them and focus on defining the specific CASP problems. Most of the content is compiled from various pages in the CASP website and from research papers.

The CASP13 Problems

In CASP, the goal is to predict protein structure that closely resembles experimentally determined protein structure. The “ground truth” is determined by methods like X-ray crystallography, and cryo-electron microscopy. Though the general purpose of CASP is protein structure prediction, the specific challenges vary from year to year. CASP13 had 7 categories:

High Accuracy Modeling (Template-based Modeling/TBM)

Objective: Precisely predict folded tertiary structure of a protein from its primary sequence, based on a similar template protein with known 3D structure.
Inputs: The primary amino acid sequence of a protein (provided by CASP), and in most cases, a template protein (easily found by sequence homology detection methods like BLAST) and its known 3D structure. Note that the FM/TBM domains are not distinguished on the Target List page, so you should submit predictions for all targets. See an example input here (note: the “template” on this page is a submission template, not a template protein structure).
Outputs: 3D coordinates for all of the non-hydrogen atoms in the folded tertiary structure (in TS format, the standard format for the PDB). See Example 1 on this page.
Metric: Global Distance Test Total Score (GDT_TS)/Z-score. First, the GDT_TS is computed with cutoffs at 1, 2, 4, and 8 Å (angstroms). Next, the mean and standard deviation of GDT_TS scores is calculated, and each model is assigned a statistical z-score under a normal distribution. Those scores are thresholded to a minimum of -2.0 or 0. Finally the sum of the z-scores for each group across all tasks is computed. This is an improvement over raw GDT_TS because it weighs the difficult proteins similarly to easy proteins, whereas raw GDT_TS weights scores on easy tasks more heavily. See this article for a description of GDT_TS, and Fig. 2 in this article for more info on how z-scores are calculated. CASP13 GDT_TS rankings are here — make sure to select only “TBM-easy” and “TBM-hard”, then click “Show”.

Topology (Free Modeling/FM)

Objective: Predict folded tertiary structure of a protein from the primary sequence without the assistance of template proteins.
Inputs: The primary amino acid sequence of a protein (provided by CASP). See an example input here (note: the “template” on this page is a submission template, not a template protein structure). Note that the FM/TBM domains are not distinguished on the Target List page, so you should submit predictions for all targets.
Outputs: 3D coordinates for all of the non-hydrogen atoms in the folded tertiary structure (in TS format, the standard format for the PDB). See Example 1 on this page.
Metric: Global Distance Test Total Score (GDT_TS)/Z-score. See description above under “High Accuracy Modeling”. CASP13 GDT_TS rankings are here — make sure to select only “FM”, then click “Show”.

Data Assisted

Objective: Improve protein structure prediction with auxiliary experimental data that is not x-ray crystallography data. This can include SAXS, NMR, cross-link, or SANS data.
Inputs: The primary amino acid sequence of a protein (provided by CASP), along with auxiliary data in some form. See an example input here (note: the “template” on this page is a submission template, not a template protein structure).
Outputs: 3D coordinates for all non-hydrogen atoms in the molecular structure (in TS format, the standard format for the PDB).
Metric: GDT_TS/Z-score. See description above under “High Accuracy Modeling”. CASP13 Data Assisted rankings are here.

Refinement

Objective: To improve structure predictions by structure refinement techniques (predominantly, via molecular dynamics simulations).
Inputs: A primary amino acid sequence and 3D coordinates for all of the non-hydrogen atoms from its folded tertiary structure (in TS format, the standard format for the PDB). Note — this is usually an easily-modeled subset of a target sequence which excludes portions of the sequence that are disordered in the crystal structure. See an example here.
Outputs: (Refined) 3D coordinates for all non-hydrogen atoms in the molecular structure (in TS format, the standard format for the PDB).
Metric: GDT_TS/Z-score, Assessors’ Formula/Z-score. For GDT_TS/Z-score, see description above under “High Accuracy Modeling”. There is also the Assessors’ Formula which uses more strict accuracy scores — see the links on the Assessors’ Formula here for a description. CASP13 Refinement rankings are here and here.

Contact Prediction (Residue to Residue)

Objective: Predict whether residues are in contact with each other in the folded tertiary structure.
Inputs: The primary amino acid sequence of a protein (provided by CASP). See an example input here (note: the “template” on this page is a submission template, not a template protein structure). Contact prediction is only evaluated on the Topology/FM inputs. None of the template-based proteins are considered for contact prediction. But the FM and TBM domains are not distinguished on the Target List page, so you should either submit predictions for all targets.
Outputs: For each pair of residues, a probability (in the range [0, 1]) of whether the C-beta atoms (C-alpha in the case of glycine) of those residues are within 8 angstroms of each other in the folded tertiary structure (in RR format, see Example 3 on this page).
Metric: F1 score/ES (Entropy Score)/Z-score. The metrics for contact prediction are complicated. First, they only consider medium-range, long-range, and extra-long-range contacts — only contacts between residues that are, for example >10 residues apart. Then, of the medium-range residues, they take the top-N predicted contact probabilities, and compare those residues to the true top-N (most commonly, where N=L/5, and L is the total sequence length) closest residues with standard confusion metrics (precision, recall, F1 score, etc.). Then, ES is calculated as described in Assessing the accuracy of contact predictions in CASP13. For each score (F1 and ES) the Z-scores are computed according to the procedure described above in the High Accuracy Modeling category. The final prediction score is 1.0*Z-score(F1) + 0.5*Z-score(ES), and groups are ranked according to the sum of their prediction scores across all target proteins. CASP13 Contact Predicton rankings are here and here.

Assembly Prediction (aka Multimers)

Objective: Predict the quaternary structure of a multiple-chain protein.
Inputs: Two or more primary sequences (in FASTA format) and a “Stoichiometry” variable giving the expected type of quaternary structure (e.g. A3 for a homotrimer, A3B1 for a tetramer composed of a homotrimer and a monomer). Targets for Assembly Prediction can be identified by having a stochiometry variable value that is not A1 in the Target List. See an example input here (note: the “template” on this page is a submission template, not a template protein structure).
Inputs — Nota Bene: Similarly to the “High Accuracy Modeling” and “Topology” categories, for multimers, some targets have similar template proteins whose tertiary structure is already known. In cases where a template is available, it is another (very useful) input for the prediction model.
Outputs: 3D coordinates for all non-hydrogen atoms in the molecular structure (in TS format, the standard format for the PDB).
Metric: F1/Jaccard/LDDT/GDT_TS/Z-score. First, we find the predicted “interface residues” (residues that are in within 5 Å of each other, but which come from two different chains). These are compared with the true interface residues to come up with the F1 score and the Jaccard Score as described in Assessment of protein assembly prediction in CASP12. Next, we calculate LDDT score for local model quality (described in LDDT: a local superposition-free score for comparing protein structures) and GDT_TS (described here) for global model quality. We then calculate Z-scores for F1, Jaccard, LDDT, and GDT_TS scores, as described above in the High Accuracy Modeling category. The final prediction score is Z-score(F1) + Z-score(Jaccard score) + Z-score(LDDT) + Z-score(GDT_TS), and groups are ranked according to the sum of their prediction scores across all target proteins. The scoring procedure is described in Assessment of protein assembly prediction in CASP13. CASP13 Assembly Prediction Rankings are here.

Accuracy Estimation

Objective: Estimate the quality of protein structure predictions.
Inputs: A target (i.e. the primary sequence of a protein structure that has been determined experimentally) and a structure prediction for that target (3D atomic coordinates in TS format).
Output Option 1: A single number giving the “global quality score” (between 0 and 1, in QA format, see Example 4 on this page). Identify Option 1 by specifying “model index 1”.
Output Option 2: A single number giving the “global quality score” (between 0 and 1) AND error estimates (in angstroms) for each residue (in QA format, see Example 4 on this page). Identify Option 2 by specifying “model index 2”.
Metric: Mean Absolute Difference between the predicted global quality score and the true GDT_TS across all targets. CASP13 Accuracy Estimation rankings are here.

Glossary

Primary structure: The amino acid sequence of a protein.
Secondary structure: The secondary folded structure of a protein — subsequences that fold into (most commonly) alpha-helices, beta-sheets, (less commonly) beta turns and omega sheets.
Tertiary structure: The tertiary structure of a folded protein — the 3D coordinates of the atoms relative to each other.
Quaternary structure: The 3D coordinates of a protein complex made up of two or more tertiary structures with non-covalent interactions between their side chain atoms.
Residue: A single amino acid, consisting of a main chain backbone (made up of a carboxyl group, an alpha-carbon, and an amine group) and a side chain (R-group).
R-group: The R-group is the part of the amino acid that extends off the backbone, and is unique for each different amino acid.
Side Chain: The side chain is the “R-group” that varies from amino acid to amino acid.
Contact Prediction: Prediction of whether each pair of residues is in contact with each other, defined as being within 8 Å (angstroms).
Target: A protein for which researchers can submit structure predictions.
Monomer: A single polypeptide chain interacting only with itself.
Oligomer: A molecule consisting of several units that are linked by non-covalent interactions (i.e. separate proteins interacting in a protein complex). A dimer is an oligomer made up of two monomers, a trimer is an oligomer made up of three monomers, etc.
Homodimer: An oligomer made up of two identical monomers.
Heterodimer: An oligomer made up of two different monomers.
Homotrimer: An oligomer made up of three identical monomers.
Heterotrimer: An oligomer made up of three different monomers.
Homo-oligomer: An oligomer made up of at least two identical monomers.
Hetero-oligomer: An oligomer made up of at least two different monomers.
Heteromer: The same as a hetero-oligomer.
Multimer: The same thing as an oligomer, but used only in the context of proteins.
Peptide: Short chain of amino acids (2–50 residues).
Polypeptide: Medium-length chain of amino acids (15–50 residues).
Protein: Long chain of amino acids (>50 residues).
Protein Complex: Two or more interacting proteins in a quaternary structure.

Acronyms

PDB: Protein data bank
RR: Residue-to-residue
TS: Tertiary structure
CA/CB/CG/CD/CE/CZ/CG1/CG2/CD1/NZ/NE/NE2/NH1/NH2/OE1/OE2/OG1: Abbreviations common in the PDB TS format. Carbon-alpha, carbon-beta, carbon-gamma, carbon-delta, carbon-epsilon, carbon-zeta, carbon-gamma branch 1, carbon-gamma branch 2, carbon-delta branch 1, nitrogen-zeta, nitrogen-epsilon, nitrogen-epsilon branch 2, nitrogen-eta branch 1, nitrogen-eta branch 2, oxygen-epsilon branch 1, oxygen-epsilon branch 2, oxygen-gamma branch 1. Here is a guide for decoding these names.
FM: Free-modeling, also known as ab initio, new fold, or non-template modeling. Refers to the task of predicting structure without the assistance of a template structure from a similar protein.
TBM: Template-based modeling. Refers to the task of predicting protein structure with the assistance of a template structure from a similar protein.

I hope this is helpful! If there are any errors or typos, let me know in the comments and I will update the post.