Revolutionizing Protein Structure Prediction: Alphafold vs. ESMFold - Comparison & Easy Guide

Published in

bain-inside-advanced-analytics

11 min readFeb 8, 2023

Made with Dall-e 2. Prompt: “Protein folding prediction, as yellow pink blue purple illustration in acuarela, with full blue and pink space-like color background, without words.”

Maybe you’re a researcher interested in exploring the latest trends in protein folding predictions, or maybe you’re just curious about this field and want to try things out at the intersection of AI and biotech. However, navigating the world of AI tools can be overwhelming, and using complex models can be time-consuming and costly.

In this article, we’ll show you how to easily and effectively use the latest AI tools, like Alphafold and ESMFold, to get started in this fascinating and promising space. We will compare the performance of both models, including installation time, prediction time, and error.

These powerful models may seem intimidating at first, but we’ll guide you through the process of how they can be easily used for protein and multimer folding. Join us as we show how these powerful tools can easily used.

The Role of Protein Structure: Understanding 3D Conformation for New Developments

Proteins are biomolecules that play a crucial role in driving nearly every biological process, essential for the proper functioning and survival of all living organisms. They are made up of a linear chain of amino acid residues. Its sequence defines its chemical properties, stability, and function. A protein sequence is defined by a gene, and this sequence determines how the chain will fold, which in turn determines its properties and role. They can have different functions such as catalytic, structural, cell signaling, immune response, cell adhesion, cell cycle, and more.

The reason why protein folding is so important is that it is closely related to protein functions. Being able to create accurate folds allows us to develop or improve drugs to treat diseases, enzymes that break down waste, or biosensors that detect environmental contaminants, and understand protein-related diseases.

Non-computer-based traditional methods such as nuclear magnetic resonance or X-ray crystallography are slow and expensive, and do not allow to explore theoretical sequences. Computer-based methods, on the other hand, use protein sequences to predict their folding, so they are faster and work with sequences that do not exist.

AI models for protein folding predictions

Deepmind’s AlphaFold-2 is the first AI model to successfully predict protein structures. Other models, such as those listed on this Github page, have since followed in its footsteps, each with their own advantages and disadvantages. In this article, we will compare AlphaFold-2 with Meta’s ESMFold protein folding prediction model.

AlphaFold-2

AlphaFold-2 can accurately predict protein folding. It is not only able to predict the structure of a single protein, but also the structures of multi-chain protein complexes with AlphaFold-Multimer. This is particularly useful as most proteins function within complexes. This model was used to predict the structure of most of the proteins in the UniProt database. The predictions are also available for bulk download via Google Cloud Public Datasets.

You can read more about AlphaFold-2 on Deepmind’s blog: AlphaFold: a solution to a 50-year-old grand challenge in biology.

Release date July 2021
GitHub deepmind/alphafold
Paper: Highly accurate protein structure prediction with AlphaFold
Multimer/Complexes: yes
Templates: yes (slower)
Official online version: Deepmind’s AlphaFold-2colab
Un-Official online version: Sokrypton’s AlphaFold-2 colab
Generated Database: https://alphafold.ebi.ac.uk/

ESMFold

Evolutionary Scale Modeling (ESM) was released one year after AlphaFold-2. ESM has a variety of trained models such as ESMFold (protein folding) and ESM-2 (general-purpose model to predict structure, function, and other properties from sequences), ESM-IF1 (Inverse folding), and others.

While also predicting protein folding, ESM’s goal is to solve a different challenge: generating a Metagenomic Atlas. This means predicting a large number of structures from DNA databases. To achieve this goal, ESMFold was designed to speed up current protein folding prediction times. According to Meta’s blog, ESMFold can make predictions 60 times faster than the state-of-the-art model (AlphaFold-2). In our test it was 55 times faster, using both AlphaFold-2 and ESMFold colab versions.

Release date August 2022
GitHub facebookresearch/esm
Paper Evolutionary-scale prediction of atomic level protein structure with a language model
Multimer/Complexes kind of, it needs a trick
Templates no (faster)
Official online version ESMAtlas
Un-Official online version Sokrypton’s ESM Advanced Colab
Generated Database https://esmatlas.com/

Multimer Prediction and Template Usage

Both AlphaFold-2 and ESMFold models can predict monomer and multimer/complex structures. Multimer prediction allows you to model a complex made up of two or more monomers. AlphaFold-2 added this option with AlphaFold-multimer. ESMFold does it with a trick: it adds a linker between the proteins in the complex so it is just “one protein” being modeled.
AlphaFold-2 model needs protein templates to use as part of its input. The top templates are found by searching methods like jackhmmer or mmseqs2, to build a diverse Multiple Sequence Alignment (MSA). AlphaFold-2 uses jackhmmer, but in Sokrypton’s ColabFold you can find a faster version that uses Many-against-Many sequence searching (mmseqs2). mmseqs2 speeds up batch predictions by ~90-fold by avoiding recompilation and adding an early stop criterion. On the other hand, ESMFold does not use templates, so it does not need this step to model a structure.

AlphaFold-2 vs ESMFold— comparative experiment

Note: we used default options for both tools, and just changed the sequence

In this experiment, we compare the performance of AlphaFold-2 and ESMFold using a 68 amino acid protein (6MRR). We also included the AlphaFold-2 Sokrypton version as it is faster while still providing similar results. After running the models, we aligned the 3D predictions with the original protein using PyMol and calculated the root mean square deviation (RMSD) in Angstroms (Å). For comparison, the original protein has a resolution of 1.18 Å (X-ray diffraction), so values with smaller errors than this indicate a highly accurate prediction.

Experiment sequence: 6MRR, a 68 amino acid long protein.
Experiment Collabs: Deepmind’s AlphaFold colab (templates with jackhmmer), Sokrypton’s AlphaFold-2 colab (templates with mmseqs2), Sokrypton’s ESMFold Advanced Colab (without templates).

Experiment takeaways

ESMFold is a faster alternative to the original AlphaFold-2, running 55 times faster with a smaller error (for this particular protein).
The Sokrypton version of AlphaFold-2 has a similar RMSD than the original, but runs 9 times faster (with a faster installation and giving 5 predictions as output).

Your first protein folding using ESMFold Colab

The first thing you need to know is that there is a community working to make these models more accessible. If you are interested in playing with these tools, you should follow Sokrypton’s ColabFold on GitHub and join their Discord community (you can also read this Nature article about ColabFold). You can find Google Colab notebooks for different models in the ColabFold repository, where you just need to follow the instructions to model your own proteins. So, go to the ESMFold’s Basic or Advanced Google Colab notebook and have fun!

That’s it! Good luck! 👍

But just in case you need a little help understanding the parameters, keep reading as we explain the best we can how to get the most out of this tool.

Instructions step by step

1. Find your protein

You can search for a protein on https://www.rcsb.org/ and click on the “Display File” button and select “as Fasta sequence” to get the amino acid sequence to be modeled.
You can add or delete letters and see how it affects the predictions.
Recommended maximum sequence length for this Colab is 900 amino acids.

2. Go to one of Sokrypton’s Colab notebooks

Google Colab notebooks allow for collaborative programming. Many projects publish their codes there, with the installation of all the necessary requirements, so users can easily run the code.

Basic ESMFold colab
Advanced ESMFold colab (we will explain this one here)

3. Define the parameters

Please note: This is our best understanding of the parameters. To clarify any doubts, check the Advanced ESMFold colab or ask in their Discord. We recommend to keep advanced options at their default values unless you have a clear understanding of what they do, as some parameters are intended for development team use only.

The main parameter is the amino acid sequence of your protein. You can run the notebook with the default parameters and just edit your amino acid sequence to explore the tool.

jobname: Name tag for the .zip file generated with the output. Tip: if you re-run the notebook, change the jobname, otherwise, there could be previous files in your new output.
sequence: Sequence to be folded. To model a multimer, separate sequences by “/”. Keep in mind that ESMFold models multimers by joining two sequences with a linker of “X”s. For example, if we want to model the complex AAAAA and BBBBBB, you should input the sequence AAAAA/BBBBBB, and ESMFold will model AAAAAXXXXXXXBBBBBB.
copies: If the sequence is homo-oligomeric (a complex formed by the repetition of only one monomer), change this number to the total quantity of copies instead of repeating the sequence.
num_recycles: Number of times to recycle the input sequence in the structure model to compute the output. This means using the predicted output from the previous iteration as input for the current iteration. Using a larger value for num_recycles may result in a more refined prediction, but it will also require more computational resources and take longer to run.

Define more parameters (Advanced notebook options)

chain_linker: default 25 poly-X (not editable on basic notebook). The length of the linker when working with a multimer complex, to trick the model and make it think it is just one sequence. Has no effect on single chain predictions.
get_LM_contacts: If true, then return_contacts = true, and generates the output variable lm_contacts generated by a language model and calculates amino acid contacts probability just with sequence information, in contrast with sm_contacts generated by a structure model and that calculates amino acid contact with both sequence information and 3D information. lm_contacts is a 2D matrix, where each element is a floating point value between 0 and 1, indicating the predicted likelihood of a contact between two residues. sm_contacts is also a 2D matrix with the same format. The dimensions of both matrices are equal to the length of the input protein sequence. sm_contacts gives more information, but if you want to see just the sequence contact prediction you can mark this box. Contact information can be shown later at the “Plot confidence” cell.
samples: Number of predictions to be generated (one seed per sample). There will be one .pdb file generated per sample. To define how to generate different predictions of the same structure, we can use a “random mask” (defined with masking_rate) and/or “dropouts” (defined with stochastic_mode). These regularization techniques can help to reduce overfitting by forcing the model to learn more robust features, but also help to get different predictions for the same sequence. Note: if we select samples = none, then both techniques are going to be ignored.
-masking_rate: A value between 0 and 1, indicating the fraction of the sequence that should be randomly masked (hidden) for the model to make its predictions. The model is trained using a masking rate of 15%, so it learns to predict the structure even when certain parts of the sequence are hidden. Increasing the masking rate can help reduce overfitting, and when making predictions, the random masking will result in different predictions for the same sequence. This option can only be used when stochastic_mode is set toLM or LM_SM.
-stochastic_mode: Can be set to LM, LM_SM, or SM. If SM or LM_SM is selected, dropout will be used to make predictions with train(mode=True), which is a property of the torch.nn class. Dropout involves randomly ignoring certain parts of the input or hidden layers, resulting in different predictions for the same sequence.

In summary, to generate diverse samples with masking_rate and stochastic_mode when predicting multiple samples:

— If samples ≠ None:

LM
- masking_rate = [our input]
- train(mode=False)
LM_SM
- masking_rate = [our input]
- train(mode=True)
SM - masking_rate = 0 (overwrite [our input] value)
- train(mode=True)

— If samples = None:

For any stochastic_mode ( LM, LM_SM or SM)
- masking_rate = 0 (overwrite [our input] value)
- train(mode=False)

4. Running ESMFold for Protein Folding Predictions

At the top of the Colab notebook, select Runtime > Run all and wait for the process to complete. For shorter sequences (<100 amino acids), the process should take less than 1 minute. The Colab will first install any necessary requirements and then run the code with the selected parameters.

Display (optional)

This is not the most beautiful plot you are going to get, but it is interactive and the colors indicate the confidence level of the prediction (if you select the confidence option). If you selected more than one sample, it should display the one with the highest confidence.

color: confidence (blue=high prediction confidence, pLDDT > 0.9 / red = low prediction confidence, pLDDT < 0.5), rainbow, chain (different color for every chain).
show_sidechains: Check the box to show sidechains (cartoon representation simplifies the shapes, but sometimes you want to see more details).
show_mainchainsshow_mainchains: Check the box to show mainchains.

Plot confidence (optional)

dpi: size of the confidence plot (dots per inch).

Explore Output Plots:

Predicted lDDT: predicted Local Distance Difference Test (pLDDT) per position. pLDDT is a measure of the model’s confidence in its prediction of the 3D structure for each amino acid in the protein. Values range from 0 to 1. The higher the value, the better.
Predicted Aligned Error: Matrix with “per-amino-acid” error for each amino acid in the input protein.
Contacts from LM: Contact probability of each aac pair, from the language model (get_LM_contacts= True).
Contacts from Structure Module : Contact probability of each aac pair, from the structure module.

Download predictions

Run the code to download the .zip file containing the folded .pdb file. The filename will depend on the number of samples specified and other parameters, as shown in the following code fragment:

if samples is None:
    pdb_filename = f"{ID}/ptm{ptm:.3f}_r{num_recycles}_{seed}.pdb"
  else:
    pdb_filename = f"{ID}/ptm{ptm:.3f}_r{num_recycles}_seed{seed}_{stochastic_mode}_m{masking_rate:.2f}.pdb"

Animation

If you have specified more than 1 sample, the code will create a simple animation of all the predicted samples.

5. Exploring the Predictions

After obtaining your protein structure prediction, you can further analyze it using PyMol or your prefered software. If you utilized a known protein with an available .pdb file, you can open both files together on PyMol, align the prediction to the original, and see the calculated Root Mean Square Deviation (RMSD) at the top of the screen. This prediction can also be used for identifying potential binding pockets, conducting protein-ligand docking simulations, and other tasks that require a protein structure as input.

6. Predictions for Larger Sequences

What can you do if you have larger sequences to predict? For larger predictions (>900 amino acid), the ESMFold Colab is not enough, but there are some options you can try if you have access to a Virtual Machine on GCP (which requires a Google Cloud account).

We did a test and predicted a multimer with a total of 1,746 amino acids in its four monomers. To run this prediction, we created a Virtual Machine on Vertex Workbench . We used a GCP VM with an NVIDIA A100 GPU (40GB of GPU memory), 12 CPUs, and 85GB of RAM. This prediction took approximately 10 minutes.

However, when we tested with a multimer > 2,000 amino acids, we were unable to get a prediction using the GPU mode. To resolve the issue, we used the cpu_only and cpu-offload options and reduced the chunk_size to 32, following this script, which took 6 hours and 46 minutes.

Final thoughts

Both AlphaFold-2 and ESMFold are powerful tools for predicting protein folding structures, and they can be easily used with the help of Sokrypton’s ColabFold. We recommend using ESMFold due to its faster prediction time and comparable error to AlphaFold-2. You can simply input a protein sequence into the ESMFold notebook on Colab, adjust a few settings or run it as is. With just a few clicks and a short wait, you’ll get a highly accurate prediction of your protein’s structure that you can use in your research. Give it a try!

Aknowledgements

This article was done with Joaquín Castro and David Ascencios, as part of our Biotechnology R&D initiative in Bain Advanced Analytics Group.