Protein-Ligand Binding Site Prediction using 3D Image Segmentation

Roshan1999
Bayes Labs
Published in
6 min readNov 16, 2020

INTRODUCTION

Inferring knowledge from a highly complex, high-dimensional data has always been a challenge in biology. Recently, deep-learning algorithms have taken the world by storm, achieving the state-of-the-art results in a variety of tasks like image classification, speech recognition, language translation and object detection. Deep learning algorithms take in raw inputs defined with a set of features and give out predictions for the given task based on patterns buried inside. These algorithms perform exceptionally well with a massive amount of data. Since biology is a data-rich field with complex and unstructured data, scientists can apply deep learning for almost all tasks related to biology with the potential to revolutionize this field. Deep learning approaches have already provided improvements over previous scores achieved using traditional methods in specific tasks, although the gains in some studies are modest. This method can answer a biological or medical question, identifying essential features and predicting outcomes, by harnessing heterogeneous across several dimensions of natural variation. Computer-aided drug design aims to make the drug discovery process faster and cheaper. Current research is more focused on the docking and scoring part of the drug discovery pipeline. But these methodologies already assume that the binding site of the protein is already determined with high confidence. This is not the case in the practical scenario where the current methods are lacking to locate the druggable binding sites with high accuracy.

TRADITIONAL APPROACHES

Traditional approaches for binding cavity detection are typically geometry-based but there are also examples of tools using binding energy to different chemical probes, sequence conservation (template or evolutionary methods) or a combination of these. For example, ProBiS — similarity-based tool, uses local surface alignment with sub-residue precision, allowing to find sites with similar physicochemical properties to the templates stored in the database. Such methods simultaneously detect binding sites and provide some insight into their expected properties — they are most probably similar to the templates they were matched to. Other approaches rely on a two-step algorithm, in which potential pockets are first identified and then scored to select the most probable binding sites. For example, Fpocket is a geometry-based method, which first finds cavities in a protein’s structure and then scores them. The reverse approach is used in P2RANK, which uses a random forest (RF) model to predict “ligandibility” score for each point on a protein’s surface, to then cluster points with high scores. This method uses 3D convolution layers to classify each atom in the protein space whether it belongs to a binding site or not similar to a 3D segmentation task. Predictions can then be saved as .cmap or .cube files, that can be later analyzed in molecular modelling software. The binding site can also output parts of the protein that form pockets and save them as .mol2 or .pdb files.

Difference between classification and segmentation task in deep learning

METHOD FOR 3D SEGMENTATION

The scPDB database which contains 16034 annotated druggable binding sites from 4782 proteins and 6326 ligands was used for training the deep learning model. The dataset contained protein structures originating from 952 different organisms, from which the most abundant were human (34.4%), E. coli (5.6%), Human immunodeficiency virus (4.2%), rat (2.9%), and mouse (2.4%). The input and output of the model were represented as 3D grids where each voxel in the grid contained 18 features extracted from individual atoms for the input using open babel python package.

The 18 features used to describe an atom are:

  • 9 bits (one-hot or all null) encoding atom types: B, C, N, O, P, S, Se, halogen and metal
  • 1 integer (1, 2, or 3) with atom hybridization: hyb
  • 1 integer counting the numbers of bonds with other heavy atoms: heavy_valence
  • 1 integer counting the numbers of bonds with other heteroatoms: hetero_valence
  • 5 bits (1 if present) encoding properties defined with SMARTS patterns: hydrophobic, aromatic, acceptor, donor and ring
  • 1 float with partial charge: partialcharge

The output grid was also of the same size, centre and resolution but with binary masks for the presence of site atoms instead of atomic features. The output grid was converted to 3D probability densities for loss calculation. The input grid was of the shape (18,36,36,36) while the output grid was of the shape (1,36,36,36).

Example: representation of Protein 3D grid (INPUT)
Example: representation of Binding site 3D grid (OUTPUT)

Deep learning model used is similar to the U-net architecture modified for the binding site prediction task. The model contains an encoder and a decoder network where the encoder compresses the input representation into a latent space and the decoder makes predictions based on the latent space which can localize features for highly accurate predictions. The model was developed using the PyTorch framework containing 4 encoder and 4 decoder blocks with one convolutional block in the bottleneck latent space. All the 2D blocks used in the original U-net architecture were modified to 3D blocks as the input was a 3D grid. Each block consists of two convolutional layers with the same number of filters (32, 64, 128, 256, or 512), kernel size of 3×3×3 pixels and ReLU activation function, combined either with a max-pooling layer or with an up-sampling layer. The two first max-pooling layers and the two last up-sampling layers have 2×2×2 patch sizes, while layers in the middle have 3x3x3 patch sizes. The feature maps in the middle of the network have spatial sizes of 1×1×1 and can be used as feature vectors. The model was trained with a batch size of 32 for 100 epochs after which the dice loss didn’t converge. Dice loss was used as the loss function for backpropagation of the neural network. Discretized volume overlap (DVO) was used as the metric for evaluation which is used to determine whether the predicted site is similar to the binding site or not. The dataset was split into a training and test set and the model achieved a DVO score of 0.623 on the test set.

Deep learning model used for 3D segmentation
Example input protein to the neural network
Actual binding site
Predicted binding site by the trained model (Note the similarities)

CONCLUSION

The inference can be done using a CPU and enables fast detection of single or multiple binding sites in just under 10 seconds. Deep learning methods gained popularity in recent years because of their flexibility and potential for capturing complex relationships hidden in the data. Therefore, this work can also be seen as an example of adapting deep learning methods developed in other fields to structural bioinformatics.

REFERENCES

  • Stepniewska-Dziubinska, M. M., Zielenkiewicz, P., & Siedlecki, P. (2020). Improving detection of protein-ligand binding sites with 3D segmentation. Scientific Reports, 10(1). https://doi.org/10.1038/s41598-020-61860-z
  • Stepniewska-Dziubinska, M. M., Zielenkiewicz, P., & Siedlecki, P. (2018). Development and evaluation of a deep learning model for protein-ligand binding affinity prediction. Bioinformatics, 34(21), 3666–3674. https://doi.org/10.1093/bioinformatics/bty374

--

--