Analysis of ChEMBL COVID-19 Dataset using basic ML Techniques

Nerd For Tech
Published in
10 min readDec 30, 2020


This article aims to build conventional machine learning models in python languages, such as Random Forest and Linear Regression, to predict Bioactivity values of a molecule from the ChEMBL dataset.

What you will learn

  • How to Use ChEMBL Python API for Data collection
  • Practical Application of Lipinski's descriptors, PaDEL.
  • Using rdkit python library

Installing the required libraries

pip install pandas
pip install np
pip install matplotlib
pip install sklearn
pip install chembl_webresource_client
pip install seaborn
##Now in order to install rkdit use
conda install -c conda-forge rdkit

In case you face any error, please visit this link.

Also, note that you must have python version 2.7 - 3.6; otherwise, rkdit won’t work.

Importing libraries

a and for One-Hot Encoding
import numpy as np # numpy is used to calculate the mean and standard deviation
import matplotlib.pyplot as plt # matplotlib is for drawing graphs
import matplotlib.colors as colors
from sklearn.model_selection import train_test_split # split data into training and testing sets
from sklearn.preprocessing import scale # scale and center data
from sklearn.svm import SVC # this will make a support vector machine for classificaiton
from sklearn.model_selection import GridSearchCV # this will do cross validation
from sklearn.metrics import confusion_matrix # this creates a confusion matrix
from sklearn.metrics import plot_confusion_matrix # draws a confusion matrix
from sklearn.decomposition import PCA # to perform PCA to plot the data
from rdkit import Chem
from chembl_webresource_client.new_client import new_client
from chembl_webresource_client import *
from collections import Counter
from operator import itemgetter
from chembl_webresource_client.new_client import new_client
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski
import seaborn as sns
from scipy.stats import skew
import pandas_profiling
%matplotlib inline
from sklearn.ensemble import RandomForestRegressor
# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB

Data Collection

In this section, I have explained how to collect the relevant data for your model from the ChEMBL database using the ChEMBL API, i.e., chembl_webresource_client

First, you have to download a sample dataset. Let me talk a bit about this dataset. It is downloaded from the ChEMBL official website. It is a dataset of 6900 molecules related to the COVID-19 with their intrinsic properties like ‘Name,’’ Synonyms,’’ Type,’’ Max Phase,’’#RO5 Violations’,’#Rotatable Bonds,’’ CX ApKa,’’ CX BpKa,’’ Structure Type,’’ Inorganic Flag,’’#RO5 Violations (Lipinski)’,’ Molecular Weight (Monoisotopic),’’ Molecular Species,’’ Molecular Formula,’’ Passes Ro3',’ Molecular Weight,’’ Targets,’’ Bioactivities,’’ QED Weighted,’’ CX LogP,’’ CX LogD,’’ Aromatic Rings,’’ Heavy Atoms,’’ HBA Lipinski,’’ HBD Lipinski.’

DON’T PANIC only a few columns are of interest, i.e., “Smiles,” “ChEMBL ID,” ‘AlogP,’’ HBD,’ ‘HBA,’ ‘PSA,’ ‘MW’…

You can play with the dataset using the following code; it converts the smiles to their structural representation. This part is optional.

df = pd.read_csv('covid\covid.csv',sep=';')

m =Chem.MolFromSmiles(df['Smiles'][1])
Fig 1
m =Chem.MolFromSmiles(df['Smiles'][2098])
Fig 2
m =Chem.MolFromSmiles(df['Smiles'][34])
Fig 3
m =Chem.MolFromSmiles(df['Smiles'][61]) 
Fig 4
m =Chem.MolFromSmiles(df['Smiles'][60])
Fig 5

Our target variable is ‘bioactivity,’ and there is a column of this in the dataset, but it actually is not the actual bioactivity value. Rather it is just the frequency of the number of bioactivities a molecule has. SO the tricky part here is HOW to get the correct bioactivity values?

There are a few types of bioactivity, but we are interested in the standard type “ic50” with standard unit “nM.

The following function uses the API of the chembl to retrieve the bioactivity of a molecule with the desired standard_type and standard_units. Here for simplicity, I have hard-coded the value of standard_units to 'nM.'
This function takes a considerable amount of time for fetching the value from the internet.

chembl_id : chemble id of the molecule
standard_type : IC50, ec50 etc

A (chembl_id, Bioactivity of the molecule)-tuple
def get_standard_value(chembl_id, standard_type):
records = new_client.activity.filter(molecule_chembl_id=chembl_id).filter(standard_units='nM').filter(standard_type=standard_type)
df = pd.DataFrame.from_dict(records)
if len(df)>0:
return (chembl_id,df['standard_value'][0])

The function declared above is called from the following code. Here we have a loop on the dataframe’s column containing the chembl ids. The above function is called for each id, and the above function checks if the asked value is present for that chembl_id. If, yes it returns the tuple of chembl_id and bioactivity value.

x_list = list()
for chem_id in df['ChEMBL ID']:
if f!= None:
print(f"{i}/{len(df['ChEMBL ID'])}")

In the above loop, we take some time, so I have already saved my file here.

Now we have to check wheater the compound fetched just now, Are they and ACTIVE or not. The easiest way to find out is to check if the standard value that we fetched earlier is less than 1000 or not. If it is more than 1000, then the molecule is of no use to us.

for num in sd['standard_value']:
if num<=1000:
print(f"Number of the active class is {i}")
>>>Number of the active class is 502

Feature Selection

The column containing CHembl ID, Alogp, HBD, HBA, Smiles are the columns are of our interest. Therefore we will drop all other columns.

new_df = df.drop(['Name','Synonyms','Type','Max Phase','#RO5 Violations','#Rotatable Bonds','CX ApKa','CX BpKa','Structure Type','Inorganic Flag','#RO5 Violations (Lipinski)','Molecular Weight (Monoisotopic)','Molecular Species','Molecular Formula','Passes Ro3','Molecular Weight','Targets','Bioactivities','QED Weighted','CX LogP','CX LogD','Aromatic Rings','Heavy Atoms','HBA Lipinski','HBD Lipinski'],axis=1).copy()

Following is the resulted dataframe:

Merging the following two CSV file on the ChEMBL ID column:

  • covid.csv
  • bioactivity.csv

Data Preprocessing

The idea for this is to remove that row that has any column containing “None,” “Nan,” or “none.”

Checking for None values in the final_df

Removing None, Nan, none values from HBD columns

print('Unique value in HBD column',final_df['HBD'].unique())>>>Unique value in HBD column ['2' '0' '1' '5' '3' '4' '12' 'None' '9' '6' nan '7' '8' '10' '11']final_df = final_df.replace(['None','NONE','none'],np.nan)
final_df = final_df.dropna()
print('Unique value in HBD column',final_df['HBD'].unique())
>>>Unique value in HBD column ['2' '0' '1' '5' '3' '4' '12' '9' '6' '7' '8' '10' '11']

Now let's check if the Alog column does not have None, Nan, or none values.

print('Unique value in AlogP column',final_df['AlogP'].unique())

Fortunately, there is no empty value for AlogP, so no need to remove the row.

Removing row containing None, Nan, none values in PSA columns

print('Unique value in PSA column',final_df['PSA'].unique())

There is no empty value for PSA, so no need to remove the row.

Calculate Lipinski descriptors

Christopher Lipinski, a scientist at Pfizer, came up with a rule-of-thumb set for evaluating the drug-likeness of compounds. Such drug-likeness is based on the Absorption, Distribution, Metabolism, and Excretion (ADME), also known as the pharmacokinetic profile. Lipinski analyzed all orally active FDA-approved drugs to formulate what is to be known as the Rule-of-Five or Lipinski’s Rule.

Lipinski’s Rule stated the following:

  • Molecular weight < 500 Dalton
  • Octanol-water partition coefficient (LogP) < 5
  • Hydrogen bond donors < 5
  • Hydrogen bond acceptors < 10

Using Lipinski’s Rule to calculate Molecular Weight

We will use Lipinski Rule to get the Molecular weight of only those active compounds that means they have Mol. Wight < 500 dalton

# Inspired by: lipinski(smiles, verbose=False):moldata= []
for elem in smiles:

baseData= np.arange(1,1)
for mol in moldata:

desc_MolWt = Descriptors.MolWt(mol)
#desc_MolLogP = Descriptors.MolLogP(mol)
#desc_NumHDonors = Lipinski.NumHDonors(mol)
#desc_NumHAcceptors = Lipinski.NumHAcceptors(mol)

row = np.array([desc_MolWt,

baseData=np.vstack([baseData, row])

descriptors = pd.DataFrame(data=baseData,columns=columnNames)

return descriptors
df_lipinski = lipinski(final_df.Smiles)

Merging the mol. Weight with our previous data

final_df_combined = pd.concat([final_df, df_lipinski],axis=1)

Removing Row Containing MW as None, Nan ….

final_df_combined = final_df_combined.replace(['None','NONE','none'],np.nan)
final_df_combined = final_df_combined.dropna()
#print('Unique value in MW column',final_df_combined['MW'].unique())

Convert IC50 to pIC50

To allow IC50 data to be more uniformly distributed, we will convert IC50 to the negative logarithmic scale, which is essential -log10(IC50).

This custom function pIC50() will accept a DataFrame as input and will:

  • Take the IC50 values from the standard_value column and converts it from nM to M by multiplying the value by 10−9−9
  • Take the molar value and apply -log10
  • Delete the standard_value column and create a new pIC50 column
import numpy as npdef pIC50(input):
pIC50 = []
for i in input['standard_value_norm']:
molar = i*(10**-9) # Converts nM to M
input['pIC50'] = pIC50
x = input.drop('standard_value_norm', 1)

return x
def norm_value(input):
norm = []
for i in input['standard_value']:
if i > 100000000:
i = 100000000
input['standard_value_norm'] = norm
x = input.drop('standard_value', 1)

return x

We will first apply the norm_value() function to normalize the values in the standard_value column.

df_norm = norm_value(final_df_combined)

Finalizing the dataframe

df_final = pIC50(df_norm)

Notice that the pIC50 column replaces the IC50 column.

Data Visualisation

Relationship between Features and Bioactivity

The plot below shows The relation of ALogP, PSA with pIC50 value.

Note: The first plot is blank due to the error in the seaborn Library. You can check it here.

sns.pairplot(df_final, x_vars=['AlogP','AlogP','PSA'], y_vars='pIC50', height=7, aspect=0.7);

The plot below shows The relation of HBD, HBA, MW with pIC50 value.

ns.pairplot(df_final, x_vars=['AlogP','HBD','HBA','MW'], y_vars='pIC50', height=7, aspect=0.7);

Generating a Detailed Report of the data

The following cell contains the detailed report for the data like the relation between each feature, count, Empty cell, ……

df_final.profile_report(title="Data Report")
profile = pandas_profiling.ProfileReport(df_final)
profile.to_file("Data Report.html")

The above command will create a file HTML file that contains a detailed analysis of the data; You can download that file from here. Please check it once before moving further. Also following is the Phik Correlation


Form all the above graphs; It is too easy to see that there is no clear linear, polynomial relation between pIC50 and other features.

Applying Different Models

Multiple Linear Regression

from sklearn.linear_model import LinearRegression
feature_cols = ['AlogP','HBA','HBD','MW','PSA']
X = df_final[feature_cols]
y = df_final.pIC50
# instantiate and fit
lm1 = LinearRegression(), y)
# print the coefficients
# pair the feature names with the coefficients
list(zip(feature_cols, lm1.coef_))

Following is the output:

list(zip(feature_cols, lm1.coef_))

The meaning of the above results is

bioactivity = (0.25 * AlogP) + (0.106 * HBA) — (0.105 * HBD) + (0.00052 * MW) + (0.0039 * PSA)

But This approach is not much efficient on the train Data set itself, which is somewhat same for our intuition as we saw in the above graphs.

Linear Regression

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)lm4 = LinearRegression(), y_train)
lm4_preds = lm4.predict(X_test)
print("RMSE :", np.sqrt(mean_squared_error(y_test, lm4_preds)))
print("R^2: ", r2_score(y_test, lm4_preds))
from yellowbrick.regressor import PredictionError, ResidualsPlot
visualizer = PredictionError(lm4), y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
visualizer = ResidualsPlot(lm4), y_train)
visualizer.score(X_test, y_test)

This time I tried to split the dataset into train and test datasets and Tried to fit linearly. Even in this approach, the r2 score on the test data set was 0.12, which confirms that a linear model can not fit this data set. Following are some graph related to this approach

Now here comes the final part

PaDEL Descriptors

I have calculated the PaDEL descriptor using and a zip folder. To get the PaDEL Descriptors first, you have to create .smi file containing Smile and CHEMBL_id In this order only. Once you have this file, you need to have and the unzipped padel folder in your working directory. Then Run the terminal. Once completed, this will take some time; you will find a ‘descriptors_output.csv’ in your working directory; it will look something like this.

df_final.to_csv(r'extracted_data.csv', index = False)
df3 = pd.read_csv('extracted_data.csv')

Making a molecule.smi file which has Smiles and Chenmbl_id as two columns

selection = ['Smiles','ChEMBL ID']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)
df3_X = pd.read_csv('descriptors_output.csv')
df_final2=pd.merge(df_final, df3_X, on='ChEMBL ID')
df3_X = df_final2.drop(columns=['ChEMBL ID','AlogP','PSA','HBD','HBA','Smiles','MW','pIC50'])
df3_Y = df_final2['pIC50']
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3 = dataset3.dropna()dataset3.head()

Now I have applied RandomForestRegressor using dataset3, which results in a score of 0.62, which is fairly good.


X_train, X_test, Y_train, Y_test = train_test_split(df3_X, df3_Y, test_size=0i=0
for i in range(10):
model = RandomForestRegressor(n_estimators=100), Y_train)
r2 = model.score(X_test, Y_test)
Y_pred = model.predict(X_test)
print('R2 score of this model is',r2_score(Y_test, Y_pred))
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.regplot(Y_test, Y_pred, scatter_kws={'alpha':0.4})
ax.set_xlabel('Experimental pIC50', fontsize='large', fontweight='bold')
ax.set_ylabel('Predicted pIC50', fontsize='large', fontweight='bold')
ax.set_xlim(0, 12)
ax.set_ylim(0, 12)
ax.figure.set_size_inches(5, 5)


Random Forest Regressor works fine in comparison to the other methods. But as there is no clear relationship between descriptors and bioactivity, The conventional Methods are not enough to predict bioactivity with high accuracy, we need something more powerful than conventional machine learning method like we need CNN, neural network or GANs to predicts these value more accurately.

You can get the whole code from Github



Nerd For Tech

Backbencher and Not Successful, but reluctant to give up.