Prediction of Protein Movement Upon Ligand Binding Using Equivariant Graph Neural Networks

Harnessing Graphein to transform protein structures into PyTorch-Geometric-ready residue-level spatial graphs and deploying equivariant graph neural networks (EGNNs) on them.

Aviv Korman
Stanford CS224W GraphML Tutorials
12 min readMay 14, 2023

--

By Nathaniel Chien, Aviv Korman, and James Swomley*, as part of the Stanford CS224W course project.

*All authors contributed equally and are listed alphabetically.

Please cite this article and the Graphein paper if you make use of the information presented in this article in your work.

Background and Motivation

In essence, proteins are the natural molecular machines that give birth to the functionality of cells, and therefore, lay the molecular foundation for all life as we know it. Naturally, understanding the oftentimes non-trivial association between these molecules and their molecular and cellular functions is of great interest not only to biologists but also to biomedical professionals who seek to improve human health. Therefore, given the large amount of data that has been collected on proteins over the past decades, applying machine learning methods to learn such associations seems to be a natural step toward future major scientific advances with a potential impact on human health. Depending on the choice of how we might want to go about representing proteins, different machine-learning methods can be deployed on them for typical downstream prediction tasks, such as classification or regression. But which computational representation might be best suited for typical biologically relevant prediction tasks?

Before we dive into deep learning, let’s first talk biochemistry. Proteins are simply long chains of amino acids (mostly the 20 so-called “essential” amino acids), also referred to as polypeptides (Figure 1A). When it comes to their computational representation, however, proteins can be represented in several ways.

One trivial way to represent them is through sequences. By simply encoding amino acids into alphabetical letters, we can easily represent polypeptides and proteins as the sequences of the amino acids that comprise them (Figure 1B), analogously to how sentences are made up of words in natural language.

Figure 1: Proteins and polypeptide chains as sequences and as three-dimensional structures. (A) Proteins can be simply thought of as long polypeptide chains made of sequences of amino acids. (B) Each amino acid can be abbreviated with three letters or a single letter, allowing the encoding of proteins as simple alphabetical sequences. (C) The three-dimensional structure of proteins is the result of protein folding on several levels of organization and scale. The polypeptide chain initially forms secondary structures, such as beta-sheets and alpha-helices. These secondary structures further fold into larger, macromolecular tertiary structures. Finally, protein tertiary structures can also form structural and functional complexes on the level of so-called quaternary protein structures. Adapted from the National Human Genome Research Institute and Biochemistry Sixth Edition, 2007 W.H. Freeman and Company.

On the other hand, physically speaking, molecules are three-dimensional objects made of atoms, and thus proteins can also be represented as three-dimensional structures, known as folds, which can be characterized on multiple organizational levels and spatial scales (Figure 1C). A possible way of computationally encoding these three-dimensional structures is via spatial graphs. However, encoding these structures into spatial graphs leaves us with a certain degree of freedom as to how we might want to construct those graphs. A typical example of such graph construction is the so-called residue-level protein graph, which encodes amino acid residues as the nodes of the graph and chemical or spatial interaction between these residues as the edges. While other protein graph concepts, such as atom-level and secondary-structure-level graphs, are also valid representations of protein structures, for simplicity reasons this blog post will focus only on residue-level protein graphs.

Figure 2: Representing protein structures as residue-level graphs and defining graph convolution. Amino acid residues constitute nodes and chemical interactions define edges. In graph neural networks, the convolution takes place in form of message passing and aggregating from neighboring nodes within a k-hop receptive field. Source: Fout et al. [1].

For several years now, bioinformaticians have utilized protein sequences and structures for protein function prediction and annotation [2]. However, using graph neural networks (GNNs) to encode protein structural information is a promising and relatively new area of applied deep learning research [3].

Within this context, the work presented in this blog post seeks to examine the inductiveness of various protein graph representations and novel spatially-informed GNN architectures, such as EGNNs [4], for a particular task of biological and pharmaceutical relevance: the prediction of the molecular movement of proteins, such as receptors, in response to ligand binding, solely based on the unbound protein structures.

Data

Dataset

We use the PSCDB Database for Protein Structural Change Upon Ligand Binding [5] which can also be found as a CSV file in the official Graphein repository. This dataset consists of 891 proteins (with names and IDs), the PDB IDs of their free form and bound form, and the ligands that bind to them. As a part of this dataset, the types of movement the protein undergoes in transitioning from free to bound form are classified into seven characteristic movements and accordingly labeled, as seen in Figure 3.

Figure 3: The seven classes of protein movement defined in the PSCDB dataset available on the Graphein GitHub repository, originally curated by Amemiya et al. Source: Amemiya et al. [5].

Unfortunately, as is often the case with real-world data, a large amount (over a third) of the protein structures in this dataset had to be entirely removed due to duplicate entries and/or PDB protein structure anomalies that resulted in downstream processing errors. Moreover, the dataset remained highly imbalanced, a fact we tried to accommodate for in several ways, including stratified splits and class-specific loss weights. More information can be found in the accompanying Google Colab notebook.

Preprocessing

From the PSCDB dataset, we use the PDB IDs of each protein to construct their residue-level graphs with an open-source software package called Graphein [6]. This software streamlines the conversion of protein sequences and structures to graphs (e.g. via NetworkX) and then converts those to PyTorch-Geometric-ready data objects. There are two important aspects of residue-level graph construction using Graphein: (1) determining the connectivity of the graph and (2) choosing node (and potentially edge) features.

The following code snippet shows how an example configuration can be defined using Graphein and then be used to generate a graph, similar to the one seen in the figures below:

from functools import partial
import graphein.protein as gp
from graphein.protein.graphs import construct_graph
from graphein.protein.visualisation import plotly_protein_structure_graph

# 1:
dist_edge_func = {"edge_construction_functions": [partial(gp.add_distance_threshold, threshold=5, long_interaction_threshold=0)]}

# A:
one_hot = {"node_metadata_functions" : [gp.amino_acid_one_hot]}

config_1A = gp.ProteinGraphConfig(**{**dist_edge_func, **one_hot})

g1 = construct_graph(config=config_1A, pdb_code="1nxu")

p1 = plotly_protein_structure_graph(
g1,
colour_edges_by="kind",
colour_nodes_by="degree",
label_node_ids=False,
plot_title="my first 3D protein graph",
node_size_multiplier=1)

For connectivity, we define three edge construction methods. The first is creating edges between nodes (amino acids) that are within five angstroms of each other, an optimal distance as found in Viloria et al. [7], which we will refer to as sub-config 1.

Figure 4: Interactive protein graph with edges constructed using 5Å distance threshold, yielding an average node degree of 4.77 (sub-config 1).

The second edge construction method is connecting nodes (amino acids) that have certain select chemical bonds: peptide bonds, hydrogen bonds, disulfide interactions, ionic interactions, and salt bridges, which we will refer to as sub-config 2.

Figure 5: Interactive protein graph with edges representing certain select chemical bonds, yielding an average node degree of 2.33 (sub-config 2).

For the final edge construction method, nodes (amino acids) are connected if they have virtually any type of direct chemical interaction, defined by a list of seven different types of interaction (see Colab for more detail), which we will refer to as sub-config 3.

Figure 6: Interactive protein graph with edges representing a large variety of chemical interactions, yielding an average node degree of 7.28 (sub-config 3).

These three edge construction methods form graphs with significantly different types and degrees of connectivity, as evident in the interactive example visualizations and by their average node degrees.

In addition, we have augmented the nodes of the graph with residue-specific features that correspond to two distinct levels of details of biochemical information.

Our first set of node features, termed sub-config A, uses a minimum amount of biochemical detail, simply encoding the amino acid types via one-hot vectors. Theoretically, if the model is expressive enough, it might be able to learn the effect of biochemical interactions between nodes based only on their amino acid types (similar to what different letters represent in a language model).

Our second set of node features, termed sub-config B, was set to explicitly capture a large amount of biologically relevant information using a combination of one-hot encodings and a selected set of important biochemical and structural features, such as secondary structure annotations, residue bulkiness, relative solvent accessibility, pKa values, hydrophobicity, and more (see our Colab for more details).

Combined, these three edge-construction methods and the two-node feature sets yield 6 distinct full graph generation configurations for each protein (termed configs 1A, 1B, 2A, 2B, 3A, 3B), encapsulating different protein characteristics and levels of biochemical detail.

Now all that is left is the conversion of the NetworkX graphs to PyTorch-Geometric (PyG) data objects (torch_geometric.data) before we can pass those to the data loaders (torch_geometric.loader) that are fed to the GNN. This conversion could be easily done with Graphein’s wrapper functions:

# imports and some variable decelerations not shown for brevity reasons.
# See GitHub for full code.

curr_config = configs_dict[CONFIG]
convertor = GraphFormatConvertor(src_format="nx", dst_format="pyg", columns=["coords", "edge_index",
"amino_acid_one_hot", "bulkiness",
"meiler", "rsa",
"esm_embedding_A", "esm_embedding",
"pka_rgroup", "isoelectric_points",
"polaritygrantham", "hphob_black",
"transmembranetendency"])
train_ds = InMemoryProteinGraphDataset(
root=DATA,
name=f"{CONFIG}_train",
pdb_codes=train,
graph_label_map=train_label_map,
graphein_config=curr_config,
graph_format_convertor=convertor,
graph_transformation_funcs=[],
num_cores=4)

valid_ds = InMemoryProteinGraphDataset(
root=DATA,
name=f"{CONFIG}_valid",
pdb_codes=valid,
graph_label_map=valid_label_map,
graphein_config=curr_config,
graph_format_convertor=convertor,
graph_transformation_funcs=[],
num_cores=4)

test_ds = InMemoryProteinGraphDataset(
root=DATA,
graph_label_map=test_label_map,
name=f"{CONFIG}_test",
pdb_codes=test,
graphein_config=curr_config,
graph_format_convertor=convertor,
graph_transformation_funcs=[],
num_cores=4)

# Create dataloaders
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True, drop_last=True, num_workers=16)
valid_loader = DataLoader(valid_ds, batch_size=64, shuffle=False, drop_last=False, num_workers=16)
test_loader = DataLoader(test_ds, batch_size=64, shuffle=False, drop_last=False, num_workers=16)

By passing the resulting batched graphs (in the corresponding data loaders) as inputs into an EGNN, we can now investigate which types of configurations might best inform our model with respect to possible protein molecular movements. But first, let’s briefly introduce our model, namely EGNNs.

Model

Graph neural networks (GNNs) are a natural choice to model proteins since they can process the 3D structure of amino acids as well as their physical properties and the bonds between them. In order to take advantage of the spatial data from protein structures, however, it is important that any architecture we choose is equivariant not only to permutations, but also to rotations, translations, and reflections (Figure 7). Without this equivariant characteristic, a GNN would yield different results given an input of two identical input protein graphs if those graphs had different spatial orientations (for example rotated or reflected).

Figure 7: Illustration of equivariance to transformations and permutations. Source: Santorras et al.[4].

Because of this, we will be applying a modified GNN architecture called equivariant graph neural networks (EGNNs), as described by Satorras et al. [4]. EGNNs were specifically designed to be equivariant to relevant spatial transformations.

EGNNs are made up of equivariant graph convolutional layers, which take as input the set of node embeddings h and coordinate embeddings x for layer l, along with edge attributes a. The next layer is then informed through the following process:

This differs from a classical graph convolutional layer in the first two equations by incorporating particle coordinate positions x as radially oriented vector fields, which is why EGNNs preserve translation, rotation, and reflection equivariance while classical GNNs do not.

Now that we explained how the protein dataset was processed, how it was transformed into different graphs according to 6 configurations, and what the EGNN model is all about, let’s proceed to the results!

Results

We split the 517 proteins in our cleaned dataset into three partitions (train/validate/test) in a stratified fashion according to class labels in order to accommodate for class imbalance.

Furthermore, to prevent the model from overfitting the well-supported classes, we included class-specific weights in our cross-entropy loss function, corresponding to the reciprocal values of the class sizes. In addition, we adjusted the learning rate of our Adam optimizer and the batch size in order to optimize the training.

The resulting best-performing classifiers (best F-1 macro out of two runs) are shown in Table 1, alongside the support values for each class in the entire dataset.

Table 1: the results of the best models for each graph configuration. Configuration 2A seems to result in the best performance based on the F-1 macro score, which at 0.27 is only considered modest. Some of the models (like 1B, 2B) seem to strongly overfit one class. Class numbering corresponds to the following movement labels: burying ligand, coupled domain, coupled local, independent domain, independent local, and no significant.

While these results indicate that we were not able to train highly successful classifiers for this particular task, we turn to further analysis of our models to understand the reasons for the relatively weak classification performance and gain insights as to which graph configurations might still be best suited to this classification task.

The resulting loss curves for the training and validation can be seen in Figure 8. Looking at the plotted loss in depth might give us some insights into which node and edge construction functions might be best suited to our task. For example, these plots seem to suggest that the model easily overfits when given a smaller set of edges: The validation loss for configs 1A and 2A starts increasing soon after training begins, although the training loss continuously decreases. On the other hand, the train and validation losses for config 3A are much more similar to each other. This indicates that distance-based constructions may be too simple to represent our problem, especially when using the less expressive one-hot encoded node features.

But even on the most complex models, the model still performs better on the training data than on the validation or test. This is most likely a symptom not only of our chosen model parameters but also of the size of our dataset. Since we have relatively few training examples, the model is able to easily learn dataset-specific features rather than the truly informative features that we are looking for.

Figure 8: Train and validation loss for config 2A (above) and 3A (below). The validation loss for config 2A begins to increase, even though the training loss continues to decrease.

Another interesting observation was that many of the models trained on the more complex sub-config B became trapped in a local minima, and only predicted a single class, as seen in Figure 9. This is likely caused by the greater complexity due to the increase in feature dimensionality: since optimization is more difficult, the local minima might constitute more attractive solutions for the model.

Figure 9: Selected class accuracies for our config 2B model. Although the model initially learns effectively, it eventually becomes trapped in a local minimum where it predicts only a single class.

While many of our models trained using more detailed node features (B sub-configs) eventually got stuck in the local minima of predicting only one class, we still believe that the more explicit biochemical properties have the potential to produce much better results with further fine-tuning. One of our training runs using config 3B showed promise, with no evidence of the overfitting seen on the simpler sub-config A models (Figure 10). Not only that, but results continued to improve even after the end of our full 500 epochs. By further adjusting the learning rate, training time, weights, and regularization we hypothesize that we might be able to optimize these results further.

Figure 10: Training and validation loss and macro F1 scores for our model trained on config 3B. While this model did not achieve our highest accuracy, it shows much less evidence of overfitting than simpler models, and the metrics are continuing to improve even at the end of the training

Conclusion

Graph neural networks are a natural and expressive way to represent the structure, connectivity, and biochemical properties of proteins. In this project, we have shown how to use Graphein to quickly and easily generate graphs for any protein in the Protein Data Bank, as well as how to create an equivariant graph neural network to perform classification tasks on these proteins.

In the process of creating the model and processing our data, we experienced firsthand the difficulty of working with real-world data. Inconsistencies in experimentally determined data caused major downstream problems that required us to perform extensive dataset cleaning. However, the ability to generate graphs for any protein has the potential to fuel research and deepen understanding in countless areas of biology.

While our best-performing model was not able to achieve the state-of-the-art performance we were aiming for, by experimenting with different graph construction methods we were able to gain important insights into the best features to include when constructing a protein graph. Using features that are too simple, such as one-hot encodings and distance-based edge construction, led to overfitting of the training data. On the other hand, using hand-picked features that reflect the proteins’ biochemical features led to a more difficult optimization problem and the tendency to get caught in local minima. However, given more time and using similar modeling choices, further experimentation with a smaller set of more selectively chosen features, and a more exhaustively hyper-parameter search might still achieve better results than those reported here.

However, while this report has gone to great detail in analyzing possible modeling choices and their results, these models were deployed on real-world, human-annotated data. This data was also annotated using the bound forms of the proteins that were given to our models in their unbound forms. In the absence of a well-performing baseline model, these models cannot be objectively discarded or endorsed as useful with a great degree of certainty. Thus, the poor performance might also simply reflect that this specific annotated protein dataset might be insufficiently informative for this task as we have used it. One way to possibly overcome this hurdle might include feeding the models further information about the nature of protein data (e.g. evolutionary information) or the possible movement modes of receptors (e.g. molecular dynamics data).

Accompanying Colab: https://colab.research.google.com/drive/1teEQHQte3ExKteIAmC6ciplCwuj_hAyF?usp=sharing

Accompanying GitHub repository: https://github.com/avivko/cs224w

[1] Alex Fout, Jonathon Byrd, Basir Shariat, and Asa Ben-Hur. 2017. Protein interface prediction using graph convolutional networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6533–6542.

[2] Radivojac, P., Clark, W., Oron, T. et al. A large-scale evaluation of computational protein function prediction. Nat Methods 10, 221–227 (2013). https://doi-org.stanford.idm.oclc.org/10.1038/nmeth.2340

[3] Gligorijević, V., Renfrew, P.D., Kosciolek, T. et al. Structure-based protein function prediction using graph convolutional networks. Nat Commun 12, 3168 (2021). https://doi-org.stanford.idm.oclc.org/10.1038/s41467-021-23303-9

[4] Victor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E(n) equivariant graph neural networks. In International conference on machine learning, pages 9323–9332. PMLR, 2021.

[5] Takayuki Amemiya, Ryotaro Koike, Akinori Kidera, and Motonori Ota. PSCDB: a database for protein structural change upon ligand binding. Nucleic Acids Research, 40(D1):D554–D558, 11 2011.

[6] Arian R. Jamasb, Ramon Viñas, Eric J. Ma, Charlie Harris, Kexin Huang, Dominic Hall, Pietro Lió, and Tom L. Blundell. Graphein - a python library for geometric deep learning and network analysis on protein structures and interaction networks. bioRxiv, 2021.

[7] Salamanca Viloria J, Allega MF, Lambrughi M, Papaleo E. An optimal distance cutoff for contact-based Protein Structure Networks using side-chain centers of mass. Sci Rep. 2017 Jun 6;7(1):2838. doi: 10.1038/s41598–017–01498–6. PMID: 28588190; PMCID: PMC5460117.

--

--