Use machine learning to solve problems in chemistry

5 min readNov 29, 2019

Drug discovery from lead discovery to clinical development pipelines are long and complex. Machine learning (ML) approaches provide a set of tools that can improve drug discovery and decision making for well-specified questions with abundant, high-quality data [1]. One of them is chemical space, where scientists apply machine learning models to predict various molecule’s properties such as its activity and target binding[2] or use it for drug discovery called chemoinformatics.

I have done Ph.D. in Quantum Crystallography and learning new tools in chemoinformatics, I decided to show some of the useful libraries and code snippets to solve the Quantitative Structure-Activity prediction (QSAR) for the dataset published by Martel et al [3]. The dataset contains 707 observations and 5 attributes including SMILE string, ID, experimental log P, and pH values. As this is a preliminary study, we are interested in finding which featurization methods work best for predicting log P. The octanol-water partition coefficient (log P), is one of the most important properties for determining a compound’s suitability as a drug. This coefficient strongly affects how easily the drug can reach its intended target in the body, how strong an effect it will have once it reaches its target, and how long it will remain in the body in an active form, hence used in decision-making by medicinal chemists in pre-clinical drug discovery. The raw data is downloaded fromhttps://ochem.eu/article/27772.

I will focus on the three question below:

Q1: What are are options for log P prediction are based on physical descriptors, such as atom type counts, or polar surface area, or on topological descriptors?

Q2: What’s the log P distribution like? How is it related to drug likeliness?

Q3: How is molecular descriptor have effect on log P? Can we predict log P based on molecular descriptor set?What are are options for log P prediction are based on physical descriptors, such as atom type counts, or polar surface area, or on topological descriptors?

Q1. Explore the chemical space and see how molecule differ from each other

For the above plot, we have generated a connectivity fingerprint using Morgan Fingerprint and based on this fingerprint calculated the similarity measure. From the result and plot from similarity and pairwise similarity, it seems that most of the molecules are quite different from the reference molecule (molecule 1, just and random choice). This is a good indication for having a diverse set of molecules covering large chemical space.

Visualize some of the molecules in the dataset

Visualizing random 20 molecules in the grid size 5*4,

A2. log P distribution

From the distribution the log P in our dataset has normal distribution with some extent right skewed.

Summary statistics describing the location of quantities where N is the number ID, logPexp is the experimental log P value and pH is the scale used to specify how acidic or basic a water-based solution is.

A3. Prepare Data (feature engineering)

Next, we use RDKit to calculate some of molecular descriptors. As a baseline, we calculate the performance of RDKit’s calculated MolLogP vs the experimental log P.

r2_train       0.919272
r2_test        0.489389
mse_test       0.701924

As we can see above scatter plot, RDKit’s log P predictions have a relatively high mean square error and a weak 𝑅² of determination for this dataset. RDKit’s MolLogP implementation is based on atomic contributions. Hence, we will first try to train our own simple log P model using the RDKit physical descriptors that we generated above.

Now using Random Forest Regressor from Scikit learn we can apply ML technique to predict the drug log P values from the calculated molecular fingerprints. We also apply hyperparameter tuning.

Hyperparameter tuning which essentially relies more on experimental results than theory, and thus the best method to determine the optimal settings is to try many different combinations to evaluate the performance of each model. In our case, it seems to improve the estimates a bit but not very significant.

A3. Calculating some more advanced moleculating fingerprint

Here, we will calculate different physical descriptors, as well as structural fingerprints for the molecules using MayaChemTools and benchmark their performance using three different regression models: neural network, random forest, and support vector machines.

The Topological Pharmacophore Atomic Triplets Fingerprint (TPATF) performed the best-even outperforming the simple descriptor model. The default random forest had the best performance out of all the regression methods, although it is subject to change after hyperparameter tuning.

The above exercise shows some aspect of machine learning in chemistry and classically called Quantitative Structure-Activity prediction (QSAR) and is in practice for several decades. There is a lot of use cases of machine learning in the field of physical chemistry such as designing new materials, new energy source, etc.

Vamathevan, J., Clark, D., Czodrowski, P. et al. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov 18, 463–477 (2019) doi:10.1038/s41573–019–0024–5
Pirashvili, M., Steinberg, L., Belchi Guillamon, F., Niranjan, M., Frey, J. and Brodzki, J. (2019). Improved understanding of aqueous solubility modeling through topological data analysis.
Sophie Martela et. al., Large chemically diverse dataset of log P measurements for benchmarking studies. European Journal of Pharmaceutical Sciences, 48, 21–29, 2013.

Source code that created this post can be found here. I would be pleased to receive feedback or questions on any of the above.

Author: Prashant Kumar, Data Engineer @ Royal Bank of Scotland

Use machine learning to solve problems in chemistry

Visualize some of the molecules in the dataset

A3. Calculating some more advanced moleculating fingerprint

Written by Prashant Kumar