Unlocking the Power of Cheminformatics with RDKit and DeepChem

Code Halwell
3 min readFeb 10, 2024

--

Cheminformatics is a crucial field in drug discovery, chemical analysis, and materials science, combining the power of chemical informatics and computational techniques. The tools preferred for chemists are libraries like RDKit and DeepChem, which together offer an unparalleled toolkit for professionals and researchers. This article will guide you through using RDKit for basic cheminformatics tasks, followed by integrating DeepChem to leverage deep learning for advanced molecular feature generation.

Getting Started with RDKit

RDKit is an open-source cheminformatics software that provides a wide array of functionalities for molecule manipulation, substructure searching, and property calculation.

Installation

RDKit can be easily installed via Conda, ensuring all dependencies are managed:

conda install -c conda-forge rdkit

Basic Operations with RDKit

Creating Molecules

RDKit allows you to create molecules from SMILES strings, a compact way to describe a molecule’s structure:

from rdkit import Chem
molecule = Chem.MolFromSmiles('CC(=O)NC1=CC=C(C=C1)O')

Once you’ve saved the smiles into an RDKit object, you can start to perform property searches and more…..

Calculating Molecular Properties

Quickly calculate essential properties like molecular weight and LogP:

from rdkit.Chem import Descriptors

mol_weight = Descriptors.MolWt(molecule)
log_p = Descriptors.MolLogP(molecule)
print(f"The molecular weight is {mol_weight} and the LogP is {np.round(log_p, 2)}")

#The molecular weight is 151.165 and the LogP is 1.35

This makes the retrieval of chemical information easy, where as you may have had to search and read through a webpage or even add it yourself into a (usually paid) software package.

Visualization

Visualize molecules directly within your scripts, an invaluable tool for presentations and data analysis:

from rdkit.Chem import Draw

Draw.MolToImage(molecule)
Structure of Paracetamol

You an also use some advanced features, with additional packages such as py3Dmol to view the compound in 3D but also allows you to move the compound around and inspect it. This is great when working in Jupyter notebooks.

from rdkit import Chem
from rdkit.Chem import AllChem
import py3Dmol

molecule = Chem.MolFromSmiles('CC(=O)NC1=CC=C(C=C1)O')
molecule = Chem.AddHs(molecule)
AllChem.EmbedMolecule(molecule, AllChem.ETKDG())

mb = Chem.MolToMolBlock(molecule)
view = py3Dmol.view(width=400, height=400)
view.addModel(mb, 'mol')
view.setStyle({'stick': {}})
view.zoomTo()
view.show()

Enhancing Feature Generation with DeepChem

DeepChem extends the capabilities of RDKit by providing advanced feature generation techniques and integration with machine learning for predictive modeling.

Installation of DeepChem

Install DeepChem within a Conda or virtual environment:

pip install deepchem

Generating Molecular Features

DeepChem offers a variety of featurizers, such as graph features and fingerprints, for complex molecular representations:

import deepchem as dc

# Using ConvMolFeaturizer for graph-based features
featurizer = dc.feat.ConvMolFeaturizer()
graph_features = featurizer.featurize([molecule])

The featurizer by deepchem can then be used to support a machine learning application of your choosing. For example, you could use the featurizer as input for chemical solubility predictions or predicting chromatographic behaviour.

Machine Learning Integration

DeepChem provides a seamless workflow for using molecular features in machine learning models:

# Splitting datasets and training a model
splitter = dc.splits.RandomSplitter()
train_dataset, valid_dataset, test_dataset = splitter.train_valid_test_split(graph_features)
model = dc.models.GraphConvModel(n_tasks=1, mode='regression')
model.fit(train_dataset)

# Predicting molecular properties
predictions = model.predict(test_dataset)

This code snippet splits the data and feeds the data into a Graph Convolutional Model, which is a type of neural network designed for learning on graph-structured data, like molecules.

Conclusion

The combination of RDKit and DeepChem provides a comprehensive toolkit for cheminformatics, enabling researchers and professionals to manipulate molecules, calculate properties, visualize structures, and generate complex features for machine learning models. Whether you’re involved in drug discovery, materials science, or chemical analysis, mastering these tools will enhance your research capabilities, streamline workflows, and contribute to innovative discoveries in your field.

Remember, the key to effective cheminformatics lies in a deep understanding of both the chemical concepts and the computational tools at your disposal. Happy researching!

--

--

Code Halwell

I'm currently a scientist at a top tier pharmaceutical company planning to move into the world of data science and programming. Please follow me on my journey