Can AI Create Molecules?

4 min readNov 27, 2018

Drug discovery is a long and tedious process; it may take years of development and exploration by expert chemists and pharmacologists. In drug discovery, de novo design (i.e., designing an entirely new molecule from scratch) plays a crucial role. Aiming at bringing the power of AI to chemical research, we introduce a new machine learning approach to graph generation and successfully apply it to automate the molecule design process.

The recent years have witnessed rapid progress in the development of machine learning techniques for the generation of a wide variety of data, including images and sequences. There remain, however, substantial challenges for generating graphs. One of the key challenges lies in the difficulty of ensuring semantic validity in context. For example, in molecular graphs, the number of bonding-electron pairs must not exceed the valence of an atom, whereas in protein interaction networks, two proteins may be connected only when they belong to the same or correlated gene ontology terms. These constraints are not easy to incorporate in a generative machine learning model.

Figure 1: A molecule graph example. Numbers indicate valence.

To address this challenge, we propose a regularization framework for variational autoencoders (VAEs) as a step toward semantic validity. We focus on the matrix representation of graphs and formulate penalty terms that regularize the output distribution of the decoder to encourage the satisfaction of validity constraints.

Figure 2: Overview of the generation framework.

Generative models aim at learning the distribution of training examples. The emergence of deep architectures for generative modeling has enabled generation of new samples resembling the existing ones in the training data. One of the most popular frameworks is the VAE. A VAE is a neural network which extends the traditional autoencoders by probabilistically modeling the observation. Just like regular autoencoders, the goal of VAEs is to reconstruct the input through autoencoding. It consists of an encoder and a decoder. The encoder maps the input data into a latent space. Different from that of a usual encoder, however, the encoder of a VAE outputs parameters of a distribution in the latent space. Then, the decoder samples the latent distribution and generates a reconstruction of the original input. Trained with stochastic gradient descent, the VAE learns a distribution of the latent representations. Thus, through sampling the latent distribution and decoding, the VAE successfully generates a new data sample — in our case, a novel molecular graph.

We represent a graph as a matrix concatenating its node features and edges and treat it as a pseudo image. Then, a VAE used for image generation produces new graphs. However, graphs generated in this way are not necessarily valid ones (for example, the bonding constraints may be violated). In order to resolve the validity challenge, we enforce constraints in the training of VAEs. First, in the graph representation of a molecule, the configuration of bonds (edges) must meet the valence criteria of the atoms (nodes). Second, it is desired that the generated graph is connected, i.e., there is a path between any pair of nodes. Third, in some cases we have extra compatibility constraints for node types. We want the samples produced by the decoder to always meet these constraints in the training of the VAE, regardless what latent vector it starts with. Thus, in addition to the usual reconstruction loss, we add an additional regularization component in the objective function. We appeal to a Monte Carlo approach to sample additional latent vectors unseen in the training data and penalize the generated graphs if they violate the validity constraints. In this way, we substantially improve the validity ratio of the generated graphs.

The effectiveness of the proposed model is demonstrated on two datasets for molecule generation: QM9 and ZINC. On the QM9 dataset which contains only small molecules, our model achieves an appealing performance with over 95% validity and 95% novelty. On the larger dataset ZINC, we achieve approximately 35% validity, which significantly outperforms previous models. Moreover, we also experiment with denoising node-incompatible graphs into compatible ones and obtain nearly 95% valid graphs. This result indicates that our model can be used as a general framework for generating graphs beyond molecules.

We will present this work in a paper titled “Constrained Generation of Semantically Valid Graphs via Regularizing Variational Autoencoders” (authors: Tengfei Ma, Jie Chen, Cao Xiao) at the 2018 Conference on Neural Information Processing Systems on Tuesday, December 4, during the morning poster session from 10:45 am — 12:45 pm in Room 210 & 230 AB #9.

References:

1. Kingma, D.P. and Welling, M. Auto-Encoding Variational Bayes. In ICLR 2014.

2.Kusner, M.J., Paige, B. and Hernández-Lobato, J.M. Grammar Variational Autoencoder. In ICML 2017.

3. Ma, T., Chen, J. and Xiao, C. Constrained Generation of Semantically Valid Graphs via Regularizing Variational Autoencoders. In NIPS 2018.

Can AI Create Molecules?

Written by MIT-IBM Watson AI Lab