Speeding up drug discovery with machine learning for protein-ligand interaction prediction

Sofiya Garkot
What’s Next In…?
11 min readOct 27, 2020

--

co-authors: Maksym Druchok, Dzvenymyra Yarish, Oleksandr Gurbych

Introduction

Drug development is a complex and time-consuming process: it costs billions and takes years of research from the initial idea to pills on the market. Many of the drugs are protein oriented, since proteins form the basis of living cells and play one of the central roles in biological processes. Proteins catalyze reactions in our bodies, transport molecules such as oxygen, keep the immune system healthy and transmit messages from cell to cell. Malfunctioning proteins might lead to disorders in biological processes, as well as viruses might mask themselves to mimic normal thus exploiting properly working processes. The primary stages of a drug discovery start with localization of a disease or malfunction cause and understanding of its mechanism on molecular level.

Once the target protein causing the disease is localized, one can proceed with designing a molecule that will inhibit it. Actually, it is not a single inhibitor molecule, rather a list of potential inhibitors. Such inhibitor molecules might be designed to occupy an active site(s) on a target protein and block protein’s function, thus influencing a disease. The strength of interaction between an inhibitor molecule and a protein can be quantified in terms of binding affinity. The binding affinity reflects the strength of the interaction between a given receptor-ligand pair (the receptor is the target protein and the ligand is a potential inhibitor molecule). Such a measure can be quantified in various ways, for example, by a binding free energy, inhibition constants, or half-maximal inhibitory concentration, etc.

Typically, a list of drug candidates includes millions of compounds, which should be screened and rated with respect to their binding affinity. Running in silico high throughput screening helps shrink the initial pool of in vitrotests by eliminating weakly scored candidates. On one hand, nowadays we witness an increasing rate of availability of phys- and biochemical data. On the other hand, massive data arrays become difficult to generalize by classical analytic approaches, which are often computationally intense. Thus building machine learning (ML) models that can digest big data from available biological assays and previous calculations are of great importance.

The importance of binding affinity in the drug discovery process can be illustrated on one of the recent hottest topics in healthcare — the search for an active protease’s inhibitor that will block the SARS-CoV2 infection mechanism.

Left — SARS-Cov-2 capsid approaches a host cell; right — a spatial structure of the protease, active site (blue), candidate inhibitor (green).

Coronavirus capsid invades cells via a multi-step mechanism [left]: first, the SARS-CoV2 spike is “activated” by a protease (protein), then the activated spike interacts with the ACE2 (a protein on the cell’s membrane). In case of successful interaction the virus capsid penetrates the cell and releases its encapsulated RNA strands, thus infecting the cell. A potential strategy to inhibit SARS-CoV2 activity is to design a molecule with a strong affinity to the protease’s active site [right], thus blocking its function and cell infection. In the case of SARS-CoV2 pandemia, screening large data arrays fast and suggesting effective leads can help shorten the vaccine development time.

Methods and materials

The problem of binding affinity assessment, which is often used on the first line of high throughput screening, motivated us to undertake this study. Looking towards most effective algorithms for protein-ligand interaction, we have overviewed various approaches [1, 2, 3, 4, 5, 6, 7]. In particular, there are several studies applying ML techniques to the affinity prediction problem and the methods vary from supporting vector machines, decision trees, random forest to deep neural networks. Of course, the list is not complete, it rather references few representative papers in the field. It is difficult to compare them directly because they use different metrics, affinity characteristics, and datasets. The early researches started with only hundreds of affinity samples, while more mature studies advanced with growing availability of data. Notably, all of them discuss the problem of efficient representation of interacting molecules, which we also approach in our study. In this study we consider the regression problem of predicting the affinity value expressed as inhibition constant Kᵢ. The machine learning approaches of choice are: Catboost Regressor [8], Graph Neural Network (GNN), and Bidirectional Encoder Representations from Transformers (BERT) [9]. Further we compare the implementation details and discuss their effectiveness for high throughput screening.

Dataset and feature engineering

A number of publicly available datasets provide data for inhibition constants Kᵢ, while we compiled our master dataset out of four databases: BindingDB [10], PDBbind [11], Binding MOAD [12], BioLIP [13]. After joining the datasets the samples were inspected for unphysically extreme concentrations and duplicates. As a result the master dataset consisted of ≈350 000 values. The inhibition constants are ligand concentrations required to produce half maximum inhibition; they are expressed in nanomoles. Obviously, the concentration measurement (and/or prediction) errors are proportional to the concentration itself. Therefore, it is convenient to operate with the decimal logarithm of inhibition constants log₁₀(Kᵢ). Such a conversion helps balance the error contributions at different concentration ranges.

The dataset was split into train, validation, and test subsets with the ratios of 80%, 10%, and 10%. The train and validation parts were used to train the ML models and monitor their accuracy during the train routine, whereas the final models were evaluated with the test subset. Such a unified data split is intended to evaluate all three approaches on the same footing.

The methods we used describe receptors and ligands with the help of various descriptors and fingerprints. These features are mostly related to different structural characteristics — hydrogen bond donors/acceptors, aromaticity/aliphaticity, polarity, flexibility, etc. Most stable ligand-receptor pairs are likely to be formed via matching binding mechanisms — hydrophobic groups to hydrophobic groups, negative charges to positive charges, H-bond donors to H-bond acceptors, etc. The success in binding is likely to be achieved if:

a) steric dimensions of a ligand comply with steric dimensions of an active site on the receptor;

b) the resulting conformations of receptor and ligand expose interacting fragments toward the contact interface;

c) the ligand-binding mechanism complies with the mechanism of the active site on the receptor.

According to Fischer’s postulate back from 1894, binding can be explained with the key-and-lock analogy, meaning that only certain ligands fit certain receptors due to their composition constraints [14]. Thus, we train the ML models to recognize matching mechanisms and affine pairs of receptors and ligands.

CatBoost

The first technique of choice is CatBoost — gradient boosting algorithm on decision trees. Within this approach a pair of receptor and ligand is represented as a single input. In particular, for each receptor we generate a set of descriptors, which include amino acid charges, volume, surface area accessible by solvent, average buried area, hydrophobicity, isoelectric point, polarity, refractivity, aliphatic index, numbers of H-bond donors and acceptors, number of disulfides, chain flexibility, and folding index, etc. The ligands are represented by standard ECFP4 fingerprints, which are often used for molecular characterization, similarity searching, and structure-activity modeling. Then all descriptors for the receptor-ligand pair are combined into one vector and regressed towards corresponding log₁₀(Kᵢ) value.

GNN

Recently Graph Neural Networks became one of the ‘hot’ topics in Machine Learning and presented promising results on graph-like systems [15, 16, 17, 18]. Due to the graph-like structure of molecules we decided to apply this approach in the binding affinity prediction. Each receptor-ligand pair is represented as a pair of graphs. The ligand graph has atoms as nodes and the intramolecular bonds as edges. The receptor graph contains nodes only (amino acids along the protein chain), since all peptide bonds uniting amino acids are identical and there is no use of the edge featurization. The graphs for ligands were crafted with the help of the Weave featurizer, coming with DGL-LifeSci package (https://lifesci.dgl.ai/), whereas for receptor graphs we developed a custom featurizer. These featurizers produce a set of descriptors including charge, flexibility, number of H-bond donors and acceptors, hydrophobicity, solvent accessible area, molecular weight, etc.

We fed two graphs used for prediction into separate “arms”. The receptor arm is based on the Graph Attention network [19] from the DGL-LifeSci package. In the SMILES arm we utilized the AttentiveFPGNN realization of Attentive FP network [20]. The outputs of the arms are concatenated and regressed with a multilayer perceptron. The whole network is trained in an end-to-end fashion to predict values of binding affinity.

BERT

The third applied approach is BERT: Bidirectional Encoder Representations from Transformers. We had chosen the Transformer model since it incorporates the self-attention mechanism, which might prove successful by focusing on molecular fragments that are involved in binding.

Important to mention that BERT is a natural language processing model mostly used to represent dependencies in sentences. To our best knowledge, our application of BERT is one of the first attempts to utilize natural language processing Transformer-like models as sole predictors of the binding affinity. The distinctive feature of this implementation is that it works with string sequence representations (in our case these are the SMILES and FASTA sequences), thus not taking into account chemical properties of molecules.

Similar to the GNN scheme, the BERT model consists of two separate arms (for ligands and receptors), then united and regressed with a multilayer perceptron. The Transformer in the ligand arm takes a SMILES string and embeds it by summing the token embeddings and positional embeddings into a vector of the length 128.The outputs were further fed to the next layers. The same algorithm using attention masks and Feed-Forward Neural Networks is applied within the FASTA arm. A receptor is embedded by summing the token embeddings and positional embeddings into the vector of length 512. The larger dimension than the ligand one is because proteins are larger and much complex structures than small organic ligands.

The Transformers in both arms consist of 6 layers with 12 heads each, constituting the attention matrices. The purpose of such a multi-heading is to allow the model to concentrate on different parts of a molecule predicting the binding affinity value.

Attention visualization

We visualized the attention layers of the BERT model for the ligand arm. We expected the attention layers to emphasize and highlight the fragments responsible for binding in both ligands and receptors. As we mentioned above, the attention within the BERT architecture is represented by 6 layers with 12 matrices of weights, each responsible for highlighting various regions of a molecule. We tried two ways of matrix-to-vector transformation in order to map the weights onto the string representation vector: the first way takes the first row of matrix weights only, the other averages over the columns.

However, the attention focus within both methods was spread across molecules instead of particular sites. We illustrate it with one of the examples in the Figure below.

Left — snapshot of human thrombin (orange surface) and ligand (blue sticks) with binding atoms (green ovals) defined experimentally. Right — 2D projection of the ligand with binding atoms highlighted by attention. Green and red spots mark atoms and bonds with non-zero attention weights <5% and ≥5%.

In the left plot we show an active site of human thrombin (receptor, shown as orange surface) and a bound ligand (blue sticks) as taken from the Protein Data Bank (http://www.rcsb.org/); the green ovals tag the actually interacting groups. The right plot shows a planar projection of the ligand: again green ovals map the interacting groups, while the red and green spots indicate molecular fragments assumed by attention to be responsible for binding of the ligand with the receptor. One can see that the attention is spread beyond the interacting groups, which might showcase that binding is actually dependent on a full topology of a structure rather than attributed to particular sites.

Results and discussion

Here we discuss the results achieved on the test part of the dataset. The best achieved Mean Absolute Error (MAE) scores (in log₁₀(Kᵢ) units) are: CatBoost — 0.60, GNN — 0.64, BERT — 0.51. For easier interpretation of errors let us consider the MAE of ±0.6 for log₁₀(Kᵢ) — it corresponds to a factor 4 (multiplier or divider) to the bare Ki value. Worth noting that our values of MAE are lower than 0.71÷1.03 values reported in study [6].

Below we present a bin-averaged MAE as a function of actual log₁₀(Kᵢ) for the three methods of choice. One can see high MAE peaks at low and high log₁₀(Kᵢ), with a broad well inbetween. We attribute these peaks to a low number of data points — the ML models lack the examples to properly learn extreme cases .

Mean absolute errors as a function of actual inhibition constant log₁₀(Kᵢ)

In the case of a binary classification “inhibitor vs. non-inhibitor” (either log₁₀(Kᵢ) is less or greater than 4) even high MAE values for low log₁₀(Kᵢ) (left band of peaks in above figure) still provide a fair chance that compound will be correctly classified as inhibitor. In particular, the inhibitor vs. non-inhibitor classification yields accuracy of 94% for CatBoost and BERT, while GNN provides 93%. These values are comparable with accuracy of 95% reported in a study by Li et al. [1] utilizing Bayesian additive regression trees.

To sum up the outcome of the study, on the first stage one needs to note the lightweight and speed of the proposed ML approaches. Despite the apparently high accuracy on the inhibitor vs. non-inhibitor classification the task of regression towards values of inhibition constants is of greater importance. Thus we believe that the evaluated methods if accompanied with reliable data can be used in high-throughput screening.

References

[1] L. Li, C.C. Koh, D. Reker, J.B. Brown, H. Wang, N.K. Lee, H. Liow, H. Dai, H.-M. Fan, L. Chen, D.-Q. Wei. Predicting protein-ligand interactions based on bow-pharmacological space and bayesian additive regression trees. Scientific Reports, 9:7703, 2019.

[2] R.D. King, J.D. Hirst, M.J.E. Sternberg. Comparison of artificial intelligence methods for modeling pharmaceutical qsars. Applied Artificial Intelligence, 9(2):213–233, 1995.

[3] R.N. Jorissen, M.K. Gilson. Virtual screening of molecular databases using a support vector machine. Journal of Chemical Information and Modeling, 45(3):549–561, 2005.

[4] K. Yugandhar, M.M. Gromiha. Feature selection and classification of protein–protein complexes based on their binding affinities using machine learning approaches. Proteins: Structure, Function, and Bioinformatics, 82(9):2088–2096, 2014.

[5] H. Öztürk, A. Özgür, E. Ozkirimli. DeepDTA: deep drug–target binding affinity prediction. Bioinformatics, 34(17):i821–i829, 2018.

[6] Y. Li, M.A. Rezaei, C. Li, X. Li. Deepatom: A framework for protein-ligand binding affinity prediction. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 303–310, 2019.

[7] Y. Kwon, W.-H. Shin, J. Ko, J. Lee. AK-Score: Accurate protein-ligand binding affinity prediction using the ensemble of 3D-convolutional neural network. ChemRxiv:12015045, 2020.

[8] A.V. Dorogush, V. Ershov, A. Gulin. Catboost: gradient boosting with categorical features support. arXiv:1810.11363, 2018

[9] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018

[10] X. Chen, M. Liu, M.K. Gilson. BindingDB: A web-accessible molecular recognition database. Combinatorial Chemistry and High Throughput Screening, 4(8):719–725, 2001.

[11] R. Wang, X. Fang, Y. Lu, S. Wang. The PDBbind Database: Collection of binding affinities for protein−ligand complexes with known three-dimensional structures. Journal of Medicinal Chemistry, 47(12):2977–2980, 2004.

[12] L. Hu, M.L. Benson, R.D. Smith, M.G. Lerner, H.A. Carlson. Binding MOAD (Mother Of All Databases). Proteins: Structure, Function, and Bioinformatics, 60(3):333–340, 2005.

[13] J. Yang, A. Roy, Y. Zhang. BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions. Nucleic Acids Research, 41:D1096–D1103, 2012.

[14] E. Fischer. Einfluss der configuration auf die wirkung der enzyme. Berichte der deutschen chemischen Gesellschaft, 27:2985–2993, 1894.

[15] M. Zitnik, M. Agrawal, J. Leskovec. Modeling polypharmacy side effects with graph convolutional networks. arXiv:1802.00543, 2018.

[16] N. Park, A. Kan, X.L. Dong, T. Zhao, C. Faloutsos. Estimating node importance in knowledge graphs using graph neural networks. arXiv:1905.08865, 2019.

[17] R. Kojima, S. Ishida, M. Ohta, H. Iwata, T. Honma, Y. Okuno. kGCN: a graph-based deep learning framework for chemical structures. Journal of Cheminformatics, 12, 32, 2020.

[18] C.W. Coley, W. Jin, L. Rogers, T.F. Jamison, T.S. Jaakkola, W.H. Green, R. Barzilay, K.F. Jensen. A graph-convolutional neural network model for the prediction of chemical reactivity. Chemical Science, 10:370–377, 2019.

[19] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, Y. Bengio. Graph attention networks. arXiv:1710.10903, 2017.

[20] Z. Xiong, D. Wang, X. Liu, F. Zhong, X. Wan, X. Li, Z. Li, X. Luo, K. Chen, H. Jiang, M. Zheng. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. Journal of Medicinal Chemistry, 63(16):8749–8760, 2020.

--

--