Alternatives to AlphaFold 3— how they work and when to use them

15 min readMay 24, 2024

Are you curious about what this text holds for you?

What are the current limitations of AlphaFold3 for protein-protein docking, and when will they be solved?
Which alternatives exist for protein-ligand docking?
How does DiffDock improve on previous docking methods using a generative diffusion model and confidence bootstrapping?
How does NeuralPlexer 1 and 2 work?
What can we expect in the future in the protein-ligand complex structure prediction field?

Current limitations of AlphaFold 3

As mentioned in my last text, AlphaFold 3 (AF3) sets the new standard for protein-ligand structure prediction. However, AlphaFold 3 has several limitations. Those include:

The predictions do not always respect the chirality of the molecules.
Only static structures are predicted.
Conformational changes upon binding are not modelled.
Hallucinations are introduced, which is especially relevant for disordered proteins.

Those limitations show a lack of physics (chirality of molecules, dynamics) in the AlphaFold 3 model. We will see below how integrating physics in generative protein structure prediction models can improve accuracy.

However, the most critical issue is that the code and weights of AlphaFold 3 are not publicly available. This limits the current usage of AlphaFold 3 to the AlphaFold web server, which restricts its usage for non-commercial purposes. Furthermore, only a limited number of built-in small molecules can be used for docking, which restricts nearly all current small-molecule drug discovery applications. Isomorphic Labs promised to release the code and weights in the next six months, but alternatives are required until then.

In summary, there are several reasons why people in the drug discovery field should look at alternative methods, at least until the code and weights are published. In this text, I will present some of them:

DiffDock
NeuralPlexer2
DynamicBind

As usual, the focus is more on the architecture of those tools and less on their performance.

DiffDock

The structure of protein-ligand conformations depends on the dynamical modification of the protein upon binding to the ligand. Those conformational changes influence the protein's final structure in the protein-ligand complex, which can differ from the apo structure, especially in the binding region. Most binding regions in proteins are — on average — characterised by a higher degree of flexibility than the rest of the protein, allowing those regions to adjust their conformation to several binding partners. This means that those proteins have different conformations with different binding partners and that those conformations cannot be extracted from the apo structure.

How do you get those structures? The natural choice would be to run a Molecular dynamics (MD) simulation for the protein-ligand complex, starting from the protein's apo structure and placing the ligand close to the proposed binding region. However, the timescales of conformational protein changes upon binding can range from nanoseconds to milliseconds. Reaching the slowest transitions with classical MD simulations can take months or require specialised expensive hardware like the Anton supercomputer. Using enhanced sampling techniques (presented in a future article), those transitions can be significantly accelerated but are still computationally expensive. Molecular docking methods like AlphaFill can find native and non-native protein-ligand structures. AlphaFold 3 demonstrates that combining transformers and diffusion models leads to quickly estimating accurate protein structures. However, most of those models do not use physical constraints. Those constraints can increase the quality of the predictions.

One docking method that uses physical constraints is DiffDock from Tommi Jaakkola’s lab at MIT. DiffDock builds on the previous success of deep learning-based docking methods like EquiBind from the same group or TANKBind from Galixir Technologies. Those models use geometric constraints for the prediction of the protein-ligand complex structure in the case of EquiBind via an equivariant neural network and in the case of TankBind via a trigonometry module using a trigonometry update, a self-attention modulation and a non-linear transition module. EquiBind and TANKBind showed promising results on the PDBBind database of 369 protein-ligand structures. They beat SOTA docking methods like VINA, which use a physics-based scoring function for the docking prediction, with a fast one-shot prediction. However, they struggled to beat commercial score-based software like GLIDE and sometimes produced atom clashes between proteins and small molecules.

However, those models produced the foundations for DiffDock. The main difference between DiffDock and EquiBind/TANKBind is that DiffDock is a diffusion model, which is a generative model in comparison to EquiBind and TANKBind, which are regression models (see also my text about AlphaFLOW from the same group about the differences between regression and diffusion models).

DiffDock consists of three layers: an embedding layer, interaction layers and an output layer.

In the embedding layer, molecules are represented as graphs. The nodes are either heavy atoms in the case of the ligand or entire residues positioned at the CA atom in the case of the protein. Edge connections are added for all atom interactions with a distance cutoff of 5 A, and all residue interactions in the protein with a distance cutoff of 15 A. Interactions between protein and ligands are included up to a distance cutoff, which depends on the noise of the diffusion model to ensure that messages are passed between protein and ligand in the docked poses the diffusion model generates. Protein residue features are extracted from ESM2, and ligand atom features include the atom number, chirality, degree, charge, valence state, number of hydrogen atoms and radical electrons, hybridisation state and the correspondence to and properties of an aromatic ring. Note that these atom features include more about the chemistry of the ligands than the geometrically focused ligand features in AlphaFold 3. This is a generally different approach: Tools like DiffDock assume to learn from the physics features of the ligand in the training data. In contrast, AlphaFold 3 assumes to learn the physics directly. If the training data is dense enough, it should be able to learn the physics of the ligands, but embedding physics features can lead to better results (see below for NeuralPlexer). Finally, those features are concatenated with sinusoidal encodings, and the result passes a two-layer Multilayer perceptron (MLP) to generate new scalar features for each node and edge of the graph before entering the interaction layers.

The interaction layers pass the messages between all nodes by concatenating the edge embeddings and scalar features of all incoming and outgoing nodes. Different edge types have different sets of weights in edge type-specific MLPs. After concatenation, the feature vectors of all neighbours of a node are added and normalised, and the result is used to update the previous feature vector of this node. This procedure is performed through several interaction layers, and the result of the last interaction layer is passed through the output layer.

In the output layer, translational, rotational and torsional scores are calculated to predict the position of the ligand and its interaction with the protein. Here, the ligand is considered a rigid body and a convolution between the ligand atoms and the centre of mass is calculated to get the translational and rotational scores of the ligand around the protein. The torsional score is calculated using a pseudo torque convolution of spherical harmonics and atomic embeddings. The result is a scalar at each rotatable bond of the ligand.

DiffDock showed clear improvements in predicting protein-ligand complexes compared to TANKBind, EquiBind, SMINA, and GLIDE. More importantly, for a deep learning-based model using geometric constraints, less than 3% of the highest confidence structures predicted in the test data set had steric clashes, compared to 26% for EquiBind and 6.6% for TANKind. This improvement is due to the diffusion model's generative approach, which allows the identification of all true poses of a ligand instead of the centre position of all true poses.

DiffDock demonstrated that with a generative diffusion model, protein-ligand structure predictions can be significantly improved, leading to 38 % of the highest confidence structures with an RMSD of less than 2 A predicted successfully in the PDB-Bind dataset, and most of those generated structures do not contain atom clashes. DiffDock-L, which improves on DiffDock using confidence bootstrapping, a training mechanism that refines a diffusion generator based on feedback from a confidence model, can increase this success rate to 50%.

The confidence bootstrapping training mechanism is shown in the figure above. The confidence level of the generated poses of the diffusion process is used to update the weights of the score model in early diffusion steps. This allows us to improve on the early steps of the diffusion model, which are responsible for the general docking region within the protein and usually have a higher influence on wrong predictions than later stages, which are responsible for fine-tuning. In this way, confidence bootstrapping also improves the model's generalisation to unseen targets as general features of the protein-ligand interaction from early steps are improved.

Despite all the methodological advances from DiffDock for protein-ligand complex prediction (first diffusion model, confidence bootstrapping, etc.), the predictions still need to reach the accuracy of AlphaFold 3. Are there other methods that introduce physical constraints and can sample alternative states from conformational changes?

NeuralPlexer and DynamicBind

On 12th April 2024, NeuralPLexer was published in Nature Methods. NeuralPlexer can predict both apo and bound forms of proteins and model the structural differences between the two states. Furthermore, NeuralPLexer outperformed AlphaFold2 for accurately predicting the large structural plasticity associated with ligand binding in proteins undergoing significant binding-induced conformational change and recently determined ligand-binding proteins. So far, comparisons to AlphaFold 3 are not available, but NeuralPLexer shows improvements for rigid receptor blind protein-ligand docking in the PDBBind2020 dataset. Compared to AlphaFold 3, the code is freely available under the BSD-3-clause clear license, enabling private and commercial usage.

But how does NeuralPLexer work internally? Let’s have a look at its architecture. If you are not interested in the structure, you can directly skip to the end of the NeuralPLexer part.

NeuralPLexer architecture. Figure extracted from the preprint.

NeuralPLexer takes protein sequences and small molecules as inputs. In addition, features from template structures of the protein and protein language models (PLM) are used. The ligands and protein amino acids are represented as graphs. The inputs enter a contact prediction module, which calculates contact maps. Those results are used in an equivariant structure denoising module (ESDM), which generates new ligand-bound structures and estimates the confidence of the prediction.

This is the algorithm of NeuralPLexer in detail:

Like my last text about AlphaFold 3, let’s go through it step by step.

First, embeddings f from the ESM2–650M language model are extracted for all sets of protein sequences, e.g. one sequence for single proteins and multiple chains for complexes of different proteins (Line 1).

Second, if template structures should be used, they are generated with AlphaFold2 or — if available — extracted from alternative experimental structures and template features are computed from those structures (Lines 2–4).

After those precomputed feature extraction, NeuralPlexer runs its process through all its sampled conformations (Lines 5–49). Here, first, a diffusion time schedule is defined according to the following formula (Line 6):

This schedule is used to solve multivariate stochastic differential equations during diffusion. Then, initial protein coordinates are sampled from a prior Gaussian distribution (Line 7). Next, this initial noisy geometry generates a residue-scale and atomic-scale graph representation (Line 8). There are five different types of nodes in the generated graph: One for every protein atom P, one for every ligand atom L, one for every backbone frame B, one for every ligand local frame F and one for every selected patch S as a subset of backbone frames. Atom representations are one-hot encodings of the element group index and the periodic index of the atom. Frame representations contain the bond type (single bond, double bond, triple bond, bond in the aromatic ring) features of all incoming and outgoing bonds and the centre atom features of the frame after going through a 2-layer MLP. Stereochemistry edges S contain information about the relative topological orientation between two frames (e.g. which node is oncoming, which node is outgoing), information about the presence of ring structures of three frames and information about the polyhedral chiral centre stereochemistry and stereochemistry of potential double and π bonds between two nodes.

Edges characterise interactions between nodes. Different backbone frames (BB interaction) are initially connected via a randomised k-nearest-neighbour (kNN) scheme with an exponentially decaying probability with respect to the distance between those nodes. Interactions between backbone frames and selected frames (BS) are encoded similarly via an outer sum of source and destination node features, relative positional encodings of residue indices, relative geometrical encodings of residue backbones and — if available — of the template protein structures. The steps are shown in the following algorithm:

Interactions between selected frames (SS) are embedded in the same way. In the next step, 96 backbone frame nodes are generated as described above, and the features of those nodes are set (Line 9).

Next, contact maps and block-adjacent matrices are calculated for all ligand graphs (Lines 11–27). All ligand graphs are embedded via a new method called Multi Heat Transformer (MHT) (Line 12). This MHT is effectively an attention-based neural network. The encoder of the MHT consists of 8 blocks. A single block contains a multi-head self-attention using the edge embeddings of the molecular graph, a pair update block for updating pair representations between the atom nodes and the frame nodes, and a node update block. After embedding the graph features, 32 anchor nodes from all ligand frame nodes are sampled, and the MHT embeddings are symmetrised.

The embeddings of the sampled ligands enter the single contact prediction (CPM) module of NeuralPLexer (Lines 14–21). The module's architecture is shown in the following figure.

Block of the CPM module of NeuralPLexer. Figure extracted from the preprint.

There are notable similarities to the Evoformer module in AlphaFold2 (AF2). For example, the triangular gated self-attention of the starting and end nodes is adapted from AF2. Note that gated self-attention was removed in AF3. Node and edge representations share information in both directions. This is also similar to AF2, while the information in the Pairformer module of AF3 flows only from the pair representations to the single representations but not in the other direction. 6 blocks of the CPM module are passed before the graph network is finally constructed to be passed to the next stage.

The output passes through a 3-layer MLP with a GELU activation function and a linear layer (Line 16). The embeddings are then used to calculate a distogram of contact maps with 32 bins between 2 A and 22 A (Line 17). From those, one-hot encodings are extracted (Line 19) and used to calculate block adjacency matrices. Those are used in the next iteration (for the next sampled ligand). Therefore, the distogram of pair-wise distances is constantly updated in every iteration.

The final distogram is used to calculate the required matrix for solving the stochastic differential equation (Line 22). initial ligand conformations can be sampled (Line 23). This is done for all graphs of ligands.

After generating those initial ligand coordinates, 3D structures for all heavy atoms are generated using diffusion (Lines 28–38) with different step sizes (Line 29). In every diffusion step, residue and atomic graphs are created (Line 32). The residue graphs are updated by passing through the CPM module (Line 33). The atomic and residue graphs are used to generate a graph representation with the interaction features mentioned above (Line 34). Those enter the Equivariant Structure Denoising Module (ESDM) of the NeuralPLexer (Line 35), shown in the following Figure.

Block of the ESDM module of NeuralPLexer. Figure extracted from the preprint.

The ESDM's task is to predict denoised structures using the noisy input and the graph representation calculated in the last step. To achieve this, a stack of 4 ESDM blocks is passed. Attention weights are computed via Invariant Point Attention (IPA), adapted from AF2.

Finally, the pLDDT score is calculated from the denoised structures (Lines 39–48) using the pre-trained CPM network and a 6-layer MLP to get a confidence estimation for the protein and ligand.

In general, NeuralPLexer's architecture is similar to that of AF2 and adopts several parts of it (e.g. IPA, gated self-attention). Due to the introduction of the diffusion model and several other methodological improvements, NeuralPLexer reaches higher TM scores than AF2 (0.942 vs 0.927) for protein predictions with high confidence (pLDDT>0.8) on 33 apo-holo pair systems from Pocket miner. Looking more closely, NeuralPLexer generates structures corresponding to the different apo and holo states, while AF2 produces a mixture of structures. Those improvements are mainly due to using a diffusion model but also show the importance of equivariant neural networks. It will be interesting to see how the diffusion-based AF3 compares to NeuralPLexer.

Iambic Therapeutics, who developed NeuralPLexer together with Caltech and NVIDIA, announced on February 12, 2024, that they developed NeuralPlexer 2, which significantly improves the quality compared to the first version. Changes include hardware and memory-optimized geometrical attention building blocks for higher inference throughput, an expansion of the training data to protein-nucleic acid complexes and the presence of post-translational modifications (PTM) and cofactors for more applications (see also the comparison to AF3) and better scaling of the model. Based on the reported values in the press release on the PostBusters dataset, it reaches a success rate, measured as correctly predicted protein structure with an RMSD of less than 2 A, of 54.9% and 76.8% if pocket residues are not specified or are specified, respectively. This beats all non-AlphaFold docking methods like Vina, DiffDock and TankBind but is inferior to the new AF3. However, NeuralPLexer comes with the advantage that it is 50 times faster than AF2 and a similar comparison will probably also be the same compared to AF3. Together with the fact that NeuralPlexer can handle conformational changes upon binding, which AF3 cannot, makes the tool the current SOTA for commercial usage of protein-ligand complex prediction.

Another alternative is DynamicBind, published in Nature Communications on February 5, 2024 and developed by Galixir Therapeutics. Like NeuralPlexer, DynamicBind also uses a deep equivariant generative diffusion to predict protein-ligand complexes. The model randomly places the ligand around the protein. The model gradually translates and rotates the ligand using smaller steps in every iteration cycle. After 5 of 20 iterations, not only is the ligand but also the protein, which initially is set as rigid, is translated and rotated, and the side chains are additionally adjusted.DynamicBInd also uses an SE(3)-equivariant interaction module. The diffusion does not start from a completely noisy structure, but the native conformation is gradually changed in the direction of the AlphaFold-predicted conformation using a morph-like transformation.

On the PDBBind dataset, DynamicBind shows a better success rate than DiffDock, TankBind, Glide, or Vina. Comparisons to NeuralPlexer (1) were not reported, and AF3 did not report results on PDBBind. More interestingly, DynamicBind can detect conformational changes, similar to NeuralPlexer.

Summary and outlook

In the last 50 years, protein structure prediction methods have seen significant advancements, marked by periods of rapid progress and occasional slow phases. A significant milestone for the community was the introduction of the Critical Assessment of Protein Structure Prediction (CASP) experiments, starting in 1994. CASP provided a benchmark for evaluating prediction methods fostering innovation and collaboration within the scientific community. Despite this, progress in the late 1990s and early 2000s was incremental, mainly due to computational limitations and a need for more diverse structural data. The mid-2000s saw the advent of more sophisticated algorithms and increased computational power, leading to improvements in ab initio (de novo) methods, which predict structures from amino acid sequences alone. However, these methods were still less accurate than homology modelling for most proteins. A significant leap forward occurred in the 2010s through the integration of machine learning and statistical methods. Techniques like Rosetta and the use of co-evolutionary data significantly improved the accuracy of predictions. The introduction of deep learning models, particularly with the release of AlphaFold by DeepMind, marked a revolutionary change. AlphaFold demonstrated unprecedented accuracy in predicting protein structures, even outperforming experimental techniques in some cases. AlphaFold 3, introduced by Google DeepMind and Isomorphic Labs in 2024, marks the next step forward in the direction of protein structure prediction and extends its capabilities to predict the structures and interactions of a broad range of biomolecules, including DNA, RNA, and small molecules like ligands, which are crucial in drug design.

However, AlphaFold 3 has several limitations. Currently, the code is unavailable, limiting its usage for commercial purposes. And in general, AF3 cannot predict conformational changes in protein complexes. Here, different alternatives are presented. DiffDock is the first docking program that used generative diffusion for protein-ligand structure prediction methods. The success of DiffDock inspired others to follow this concept and build better diffusion-based protein-ligand structure prediction tools. While AF3 is the new SOTA, NeuralPLexer 2 and DynamicBind, presented in the text above, are competitive alternatives that can generate the correct structures of a protein that undergoes slow conformational changes.

Understanding the success of previous tools will help generate better ones in the future. In addition, retraining previous model architectures can lead to valuable insights into how these models learn. OpenFold recently retrained AF2, and we can expect something similar with AF3.

Main resources:

Corso, Gabriele, et al. “Deep Confident Steps to New Pockets: Strategies for Docking Generalization.” arXiv preprint arXiv:2402.18396 (2024).

Qiao, Z., Nie, W., Vahdat, A. et al. State-specific protein–ligand complex structure prediction with a multiscale deep generative model. Nat Mach Intell 6, 195–208 (2024).

Lu, W., Zhang, J., Huang, W. et al. DynamicBind: predicting ligand-specific protein-ligand complex structure with a deep equivariant generative model. Nat Commun 15, 1071 (2024).