Predicting water solubility from SMILES

5 min readOct 7, 2022

In this article, we will make use of Machine Learning to predict the water solubility of molecules which is quite an important property in the field of drug discovery. This article has been inspired by the work of John S. Delaney¹. The main idea is to fit a supervised model to a molecular structure dataset in order to predict water solubility just by making use of the SMILES representation of molecules. To do that the rdkit Cheminformatics package has been used to create molecular descriptors from SMILES and the mol2vec² approach has been exploited to generate vector representations of molecules. In order to solve the high dimensionality problem the PCA techinque has been applied.

The field of artificial intelligence is nowadays continuously expanding and invading other disciplines. Having a Ph.D. in Computational Chemistry and a strong interest in Data Science, I have costantly been looking for a nice project where both my competencies could be harnessed. Wandering the internet I came across a small dataset of 1144 molecules represented as SMILES together with their aqueous solubility. Hence, I decided to fit a Linear Regression model to a dataset composed of molecular features generated from SMILES in order to predict water solubility. The features have been created in two different ways:

Deducing molecular properties from SMILES by means of the rdkit package
Applying the mol2vec method, which is an unsupervised machine learning approach used to obtain high dimensional embeddings of molecular substructures

The features created with these two approaches have been ultimately joined in a single dataset to increase the r² score of the regression at maximum.

SMILES stands for “simplified molecular-input line-entry system”, and it’s a string representation used for encoding molecular structures. They can by easily converted into two-dimensional and three-dimensional models of the molecules by most molecule editors.

Before going into the details, we need to install the necessary packages and download the dataset, that’s how it’s done on google colab

Let’s take a look at the dataset

Molecular Descriptors

We will make use of the rdkit package to get structural information about the molecule and molecular properties such are molecular weight, logP, topological polar surface area and the number of hydrogen bond donors and acceptors purely on the basis of the SMILES string of a molecule. In order to do that we first have to turn the SMILES structure into a MOL object.

df['MOL'] = df.SMILES.apply(lambda x: Chem.MolFromSmiles(x))

Let’s visualize some molecules:

molecules = df[‘MOL’][:10]Draw.MolsToGridImage(molecules, molsPerRow=5, legends=list(df[‘SMILES’][:10].values))

Normally molecules are stored in the RDKit with the hydrogen atoms implicit. Hence we need to add explicit hydrogens atom to each molecule. After that, by exploiting the rdkit library we can add new features describing the structure of the molecules present in the dataset

Let’s just check if the number of assigned aromatic rings is correct for the last 10 molecules of the dataset:

Let’s define some other Descriptors which will be used as features to train the model

That’s how our dataframe looks like

Modeling

Baseline Model

Let’s apply a Linear Regression model to set a baseline score

We can see a fairly random, uniform distribution of the residuals around 0 against the target, indicating that the linear regression model is appropriate for the data.

Vector encoding of molecules: Mol2Vec

Another way to fit a regression model on a molecular dataset made just of SMILES is to convert the smiles into vector representation. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures.

Again we will make use of the mol representation obtained from the smile of each molecule as starting point

mol_vec_df = df[[‘MOL’,’Target’]]

Let’s load a mol2vec model which has been trained on 20M of different compounds.

Let’s create molecular embeddings by generating molecular sentences with mol2alt_sequence() and then vectorize them with sentence2vec()

The resulting X_vec has a shape of: (1144,300), meaning that each molecule is represented by a 299 dimensional vector. This high dimensionality could cause some problems for such a small dataset. We will deal with it afterwards.

Let’s fit the Linear Regression model to the dataset obtained with the mol2vec approach and see the results:

In this case our model is clearly overfitting the dataset providing r² score on the test set of about 0.8 and a r² score on the training set of about 0.9. This is probably due to the fact the we have a high dimensionality ratio. Let’s try to reduce it by applying dimensionality reduction techniques

np.sum(pca.explained_variance_ratio_)

The total explained variance obtained by using 50 features instead of 300 is still very high (0.9958), meaning that we are keeping almost all the information encoded in the dataset

Let’s again fit the model

By using PCA we solved the overfitting problem and we also improved the r² score on the test set of about 8% (which is huge)

Mol2Vec + Molecular descriptors

The last step is to concatenate the descriptors features created in the first part by using rdkit package with the PCA components we just obtained and fit the Linear Regression model.

By using the two joined methodologies we finally obtained a score of 89% with an improvement of 3% compared to the baseline model.

References

John S. Delaney. ESOL:? Estimating Aqueous Solubility Directly from Molecular Structure. J. Chem. Inf. Comput. Sci. 2004, 44, 3, 1000–1005
J. Chem. Inf. Model. 2018, 58, 1, 27–35