TorchDrug: A Drug Discovery Platform in PyTorch

Published in

PyTorch

6 min readOct 25, 2021

Photo by Volodymyr Hryshchenko on Unsplash, adapted by the author

Drug discovery is a long and costly process, taking on average 10 years and $2.5 billion to develop a drug. Machine learning can be used to reduce the time and cost of developing drugs, by making predictions about biomedical entities based on large amounts of data. Such a methodology is more stressed than ever in today’s world pandemic of COVID-19, when our vaccines struggle to keep up with the mutation of coronavirus. As our effort to fighting the pandemic, we built a machine learning platform, TorchDrug, to accelerate the research of drug discovery with PyTorch.

The library is open-sourced, and you can get it through pip if you have PyTorch and torch-scatter installed pip install torchdrug, or through conda conda install -c milagraph -c conda-forge torchdrug .

Why do we need a platform for drug discovery?

Drug discovery has a very steep learning curve. Take tasks on molecules as an example, we have to learn much chemical knowledge to process molecules into PyTorch tensors. We have to be familiar with the scalable implementation of graph operations, if we want to deploy algorithms on large-scale datasets. What’s even worse is that there isn’t a uniform code paradigm across drug discovery projects, and we are likely to suffer when integrating multiple models.

Yes, we have experienced all these situations. Therefore, we decide to help other practitioners to solve these problems. TorchDrug is such a solution to minimize the cognitive load of drug discovery and maximize the flexibility of algorithm development. We expect TorchDrug could serve as a cornerstone for the machine learning community to conduct research in drug discovery.

What does TorchDrug provide?

In TorchDrug, we cover many recent techniques from graph machine learning, deep generative models to reinforcement learning. We also provide reusable training and evaluation routines for popular drug discovery tasks. These include property prediction, pretrained molecular representations, de novo molecule design, retrosynthesis and biomedical knowledge graph reasoning. Based on these techniques and modules, it is easy to build a prototype for your own dataset and application.

For advanced users, we provide multiple levels of building blocks for different customization demands. These include low-level data structures and operations (e.g. molecules and graph masking), mid-level layers and modules (e.g. graph convolutions and GNNs) and high-level task routines (e.g. property prediction). With this being said, you may find TorchDrug flexible for all kinds of customization.

Getting Started

TorchDrug provides graph data structures and operations for manipulating biomedical objects, as well as reusable layers, models and tasks for building machine learning models. Let’s get started with the graph data structures first.

Playing with Graph Data Structures

Having trouble dealing with non-grid structures and operations? Struggling with hand-crafted features and complicated interfaces of molecule libraries? With TorchDrug, you can handle these problems in only a few lines using its well-designed data structures.

The core data structures of TorchDrug are graphs, which can be used to represent a wide range of biological objects, including molecules, proteins and biomedical knowledge graphs. For example, the following code creates a graph data structure from an edge list. We can use the visualization API in the library to check our graph object.

One practical issue in graph machine learning is how to create a batch of variable-size graphs. In TorchDrug, we solve this problem by introducing a PackedGraph data structure, which builds a unified large graph and re-index each small graph in the batch. The following example shows how to build a PackedMolecule object from SMILES strings. The data structure can be easily transferred between CPUs and GPUs using .cpu() and .cuda() methods, in the same way as PyTorch tensors.

Just like original PyTorch tensors, graphs support a wide range of indexing operations. Typical usages include applying node masking, edge masking or graph masking. We may combine these operations and tensor arithmetics to implement a lot of transformations on graphs. The following example shows how to select edges that contain at least one carbon. All these operations can be accelerated by GPUs.

Result of edge masking. Only edges that contain at least one carbon are kept.

Building Property Prediction Pipeline

Besides the basic data structures, TorchDrug also provides us with a large collection of common layers, models and tasks. Here we demonstrate how to leverage these classes to quickly build a prototype for property prediction.

First of all, we load the dataset with TorchDrug. TorchDrug will automatically download the dataset into the path you specify. We choose the ClinTox dataset here, which aims to predict whether a molecule is toxic in clinical trials, and whether it is approved by FDA.

Examples of molecules in ClinTox dataset. Each molecule in ClinTox has two binary labels: toxicity and FDA approval.

We define a graph neural network to encode the molecule graphs. Specifically, we use a Graph Isomorphism Network (GIN) with 4 hidden layers. Note the model is simply a neural network without any training target. To adapt it for classification, we wrap it with a property prediction module. We define the classification task by the binary cross entropy (BCE) criterion.

We create an optimizer for the parameters in the task, and combine everything into core.Engine. The engine provides convenient routines for training and test. To test the model on validation set, it only takes one line.

Hierarchical Interface

To make flexible use of TorchDrug, it’s better to understand the hierarchical interface in TorchDrug. TorchDrug is designed to cater to all kinds of development. This ranges from low-level data structures and operations, mid-level layers and models, to high-level tasks. We can easily customize modules at any level with minimal efforts by utilizing building blocks from a lower level.

The correspondence between modules and hierarchical interface is

torchdrug.data: Graph data structures and graph operations. e.g. molecules.
torchdrug.datasets: Datasets. e.g. QM9.
torchdrug.layers: Neural network layers and loss layers. e.g. message passing layer.
torchdrug.models: Representation learning models. e.g. message passing neural network.
torchdrug.tasks: Task-specific routines. e.g. molecule property prediction.
torchdrug.core: Engine for training and evaluation.

For more details about the interface, you can go to our document and get involved in the development of TorchDrug.

Conclusion

Machine learning for drug discovery is a fast growing area, and we expect that TorchDrug could help more and more people get involved in this interdiscipline area. To learn more about TorchDrug, check out our Colab tutorials for basic usage and several drug discovery tasks. We are also trying to build a community to help developers contribute, learn and discuss questions about TorchDrug. Please refer to the contributing guidelines for more details.

Finally, we would thank our wonderful colleagues at Mila for their support and feedback during the development of TorchDrug. We would also thank the authors of PyTorch and PyTorch Geometric, of which the awesome design and implementation inspired us a lot. Last but not the least, thanks for your interest in this article and our platform.

TorchDrug Team