From ChatGPT to cMolGPT: Harnessing GPT for Drug Discovery

3 min readApr 19, 2024

Figure 1: The workflow of our cMolGPT design. A: Pre-training cMolGPT architecture. B: Fine-
cMolGPT architecture. C: Target-specific conditional molecular generation.

Introduction

Small molecule drug design aims to identify novel compounds with desired chemical properties. From the computational perspective, we consider this task an optimization problem, where we search for the compounds that will maximize our quantitative goals in chemical space. However, this optimization task is computationally intractable because of the unbounded search space. While it has been estimated that the range of potential drug-like molecules is on the order of 10⁶⁰ to 10¹⁰⁰, only about 10⁸ molecules have ever been synthesized. Numerous computational methods, such as virtual screening, combinatorial libraries, and evolutionary algorithms, have been developed to search the vast chemical space in silico and in vitro. Computational chemistry has reduced the experimental efforts of molecular design and overcome the experimental limitations. Recent works have demonstrated that deep learning methods can produce new small molecules with desired biological activity. We provide insights to apply generative pre-trained transformer (GPT), incorporating as much chemical domain knowledge as possible, for directed navigation toward a desired location in the various chemical search spaces.

2. Conditional Generative Pre-trained Transformer

As in Figure 1, the training process of our task can be summarized as follows

(1) We first pre-train the base model of cMolGPT by setting the target-specific embeddings as zero embeddings (without feeding target-specific information) on the MOSES database, as shown in Figure 1-A. We do not have any target constrain on the sequential generation and solely focus on learning the drug-like structure from the data.
(2) To feed target-specific information, we fine-tune cMolGPT with <compound, target> pairs and enforce conditions of the corresponding target by feeding target-specific embeddings to the attention layer as “memories’’, as shown in Figure 1-B. We use data where each SMILES sequence is manually tagged with a target (e.g., target proteins), indicating the specific physicochemical property of the small molecule.
(3) We can generate a drug-like structure by auto-regressively sampling tokens from the trained decoder, as shown in Figure 1-C. Optionally, we can enforce the desired target by feeding a target-specific embedding. The new generation will condition the target-specific information and likely has the desired property.

cMolGPT code is accessible from GitHub.

3. Evaluation

To evaluate the capability of generating active target-specific compounds, we build a regression-based QSAR (Quantitative structure-activity relationship) model for each target.

Figure 2: Target data sets and the performance of the QSAR models. The active compounds are used
for training target-specific generative models. QSAR models are trained on both active and non-active
compounds for each target. R stands for Pearson correlation.

Figure 3: TMAP of the generated top-5000 compounds (A):EGFR, (B):HTR1A, (C):S1PR1.

4. Conclusion

We demonstrate that the Transformer-based molecular generation achieves state-of-the-art performances in generating drug-like structures. To incorporate the protein information, we present a target-specific molecular generator by feeding the target-specific embeddings to a Transformer decoder. We apply the method to three target-biased datasets (EGFR, HTR1A, and S1PR1). Additionally, we visualize the chemical space, and the generated novel target-specific compounds largely populate the original sub-chemical space.

Paper link

“cMolGPT: A Conditional Generative Pre-Trained Transformer for Target-Specific De Novo Molecular Generation”

From ChatGPT to cMolGPT: Harnessing GPT for Drug Discovery

Written by Wenlu Wang