GAT for PPI + Protein Modulator Prediction — BrainCodeCamp 2023

spycoder
3 min readNov 6, 2023

--

Predicting protein-protein interactions (PPI) is one of the most important problems in modern-day computational biology. PPIs are the basis of almost all functions in the human body, and the understanding and regulation of PPIs are critical to finding new medical approaches to old, longstanding problems.

In this project, I reproduced the results of a paper called Struct2Graph, which uses a GAT to create 3D structural embeddings, which are then fed into a feed-forward neural network alongside the embeddings of another protein in order to predict interaction probability.

I then develop my own extension to this project where I try to do the same for protein modulators.

Problem Statement

Can we use protein structural information, and only structural information (no other chemical information) to predict protein interaction probability?

  • Secondary question: Can we use protein and PPI modulator structural information to predict modulator interaction probability?

Datasets:

For Struct2graph reproduction:

  • STRING, BioGRID, IntAct, MINT, BIND, DIP, HPRD, APID, OpenWetWare (a compiled database was downloaded from the paper’s git repo)
  • PDB Protein Databank for 3D conformers

For modulator prediction:

Model and workflow

Struct2Graph model workflow

Note: I’ll provide the model and workflow for both the Struct2Graph and protein modulator experiments together

Step 1: Download and preprocess data

This isn’t as much of a problem for the reproduction of the paper, because the data is already cleanly downloaded and available from the Protein Databank. However, for the protein modulator experiment, because modulators are smaller molecules and not proteins, they don’t have PDB IDs or conformers (predicted 3D structures of molecules), so the 3D conformers had to be sourced from PubChem.

Step 2: Creating embeddings

For Struct2Graph, atoms were first grouped into vertices that structurally represent amino acids, while in my experiment I left the atoms alone as the molecules are much smaller.

Then, the vertices are run through a Weisfeiler Lehman-like algorithm that embeds the 1-hop neighborhood of each vertex as the graph embedding.

Example 3D conformer visualization

Note that there is a maximum for the amount of residues (amino acids) that are counted into the graph embedding.

Step 3: GCN + Attention = GAT
The embeddings of the graph are run through a standard GCN training procedure, and the weight matrices used for this step are trained weight matrices.

There are also attention matrices that are trained to attend to two different protein embeddings.

Step 4: Prediction

Lastly, the concatenated embeddings of 2 proteins that have passed through the attention matrices are passed through a simple 2-layer feed-forward neural network. Then a softmax converts this to an interaction probability between 1 and 0.

Results

I have the results of the paper reproduction, but I’m not done with my extension of this paper.

Paper reproduction results reflect the paper:

My reproduced results

Conclusion and Future Work

The structural embeddings and the structural embeddings of proteins alone are sufficient to predict protein-protein interactions at the SOTA.

However, this accuracy is limited to protein-protein interactions that are well-known and similar to the training data provided. This model may be less generalizable across the PPIs of different species due to lower context.

This model is useful as a proof of concept of the large influence that structure has on protein function, and it can be used as a screener for drug discovery.

In addition, since PPIs can be predicted based on structure alone, it means that structure can be one of the most important factors to consider in generative/RL approaches to the de novo (from the beginning) design of drugs.

--

--