Creating Molecules from Scratch II: AAE, VAE, and the Wave Transform

Neuromation
Neuromation
Published in
10 min readDec 26, 2018

--

It’s been quite a while, but the time has finally come to return to the story of deep learning for drug discovery, a story we began in April. Back then, I presented to you the first paper that had an official Neuromation affiliation, “3D Molecular Representations Based on the Wave Transform for Convolutional Neural Networks”, published in a top biomedical journal Molecular Pharmaceutics. By now, researchers from Neuromation have published more than ten papers, we have told you about some of them in our Neuromation Research blog posts, and, most importantly, we have already released our next big project in collaboration with Insilico, the MOSES dataset and benchmarking suite. But today, we finally come back to that first paper. Once again, many thanks to the CEO of Insilico Medicine Alex Zhavoronkov and CEO of Insilico Taiwan Artur Kadurin who have been the main researchers on this topic. I am very grateful for the opportunity to work alongside them in this project.

A Quick Recap: GANs for Drug Discovery

For a more detailed introduction to the topic, you are welcome to go back and re-read the first part; but to keep this one self-consistent, let me begin with a quick reminder.

Drug discovery is organized like a highly selective funnel: at the first stage, you have doctors coming up with the properties that a molecule should have to be a good drug (binding with a given protein, dissolving in water and so on) and then with plausible candidates for molecules that might have these properties. Then these lead molecules are sent to the lab, and if they survive pre-clinical studies, they go to the official process of clinical trials and, finally, approval of the FDA or similar bodies in other countries.

Only a tiny part of the lead molecules will ever get FDA approval, and the whole process is extremely expensive (developing a new drug takes about 10 years and costs $2.6 billion on average), so one of the main problems of modern medicine is to try and make the funnel as efficient as possible on every stage. Deep learning for drug discovery aims to improve the very first part, generating lead molecules. We try to develop generative models that will produce plausible candidates with useful properties.

We have already talked about GANs many times. Some of our latest posts have been almost exclusively devoted to GANs (e.g., this CVPR in Review post), so I will not repeat the basic structure. But let me repeat the main figure of the paper titled “The Cornucopia of Meaningful Leads: Applying Deep Adversarial Autoencoders for New Molecule Development in Oncology”, whose lead author Artur Kadurin is the current CEO of Insilico Taiwan, my Ph.D. student, and a co-author of the Deep Learning book we released about a year ago. Here is the architecture:

This is, in essence, a so-called conditional adversarial autoencoder:

  • an autoencoder receives as input a SMILES fingerprint (basically a bit string that represents a molecule and makes a lot of chemical and biological sense) and the drug concentration; it learns to produce a latent representation (embedding) on the middle layer and then decode it back to obtain the original fingerprint;
  • the condition (GI on the bottom) encodes the properties of the molecule; the conditional autoencoder trains on molecules with known properties, and then can potentially generate molecules with desired combinations of properties by supplying them to the middle layer;
  • and, finally, the discriminator (on top) tries to tell apart the distribution of latent representations (embeddings) and some known distribution, e.g., a standard Gaussian; this is the main idea of AAE that is supposed to make an autoencoder into a generative model: if we can make the distribution of embeddings indistinguishable from a known distribution, we can sample from the known distribution and decode these samples to get reasonable objects.

Again, we have been through this in the first part, so I refer there for more details. But today, we go further.

druGAN: AAE or VAE?

Our next paper on generative models for drug discovery had a laconic title of “druGAN: An Advanced Generative Adversarial Autoencoder Model for de Novo Generation of New Molecules with Desired Molecular Properties in Silico”, and it appeared in Molecular Pharmaceutics in 2017. The Cornucopia paper that we reviewed above actually solved a relatively small and limited problem: the conditional AAE was trained on a dataset with only 6252 available compounds profiled on a single cell line (MCF-7). This limited scope, naturally, could not satisfy the ambitious team of Insilico Medicine. And it only considered one type of generative models, GANs… wait, what? There’s more?

Well yes, there is! There exists a wide variety of generative models even if you concentrate only on deep learning, i.e., models that have neural networks somewhere. I recommend the well-known tutorial by Ian Goodfellow: a lot has happened in GANs since that tutorial but the taxonomy of generative models is still very relevant.

One of the main classes of generative models in deep learning today are variational autoencoders (VAE). The idea of VAE is exactly the same as in AAE: we want to make the distribution of latent embeddings z similar to some known distribution (say, a Gaussian) so that we can sample embeddings directly and then decode to get sample objects. But VAE implements this idea in a completely different way.

VAE makes the assumption that the embeddings are indeed normally distributed, z ~ N(μ, Σ), where μ is the mean and Σ is the covariance matrix. The job of the encoder now is to produce the parameters of this normal distribution given an object, that is, the encoder outputs μ(x) and Σ(x) for the input object x; Σ is usually assumed to be diagonal, so it’s basically a vector of dimension 2d, where d is the dimension of z. VAE also adds a standard normal prior distribution on μ(x) and Σ(x). Then VAE samples a vector z from the distribution N(μ(x), Σ(x)), decodes it back to the original space of objects and, as a good autoencoder should, tries to make the reconstruction accurate. Here is how it all comes together in the druGAN paper:

Notice how z is not sampled directly from N(μ(x), Σ(x)) but rather comes from a standard normal distribution which is then linearly transformed by μ(x) and Σ(x). This is known as the reparametrization trick, and it was one of the key ideas that made VAEs possible.

I’m not being entirely honest here: there is some beautiful mathematics behind all this, and it is needed to make this work, but, unfortunately, it goes way outside of the format of a popular article. Still, I recommend explanations such as this one, and maybe one day we will have a detailed NeuroNugget about it.

In the druGAN paper, Kadurin et al. compared this VAE with an AAE-based architecture, an improved modification of the one proposed in the Cornucopia paper. Here is the architecture; comparing it with the picture above, you can see the difference between AAE and VAE:

We trained several versions of both VAE and AAE on a set of MACCS fingerprints produced from the PubChem database of substances that contains more than 72 million different molecules, quite a step up from the six thousand used in Cornucopia. The results were promising: we were able to sample quite varied molecules and also trained a simple linear regression that predicted solubility from the features extracted by the autoencoders. Generally, the best AAE models outperformed the best VAE models, although the latter had some advantages in certain settings.

The most meaningful conclusion, however, was that we always had a tradeoff between two most important metrics: quality of reconstruction (measured by the reconstruction error) and variability of the molecules sampled from the trained model (measured by various diversity metrics). Without the former, you don’t get good molecules; without the latter, you don’t get new molecules. This tradeoff lies in the heart of modern research on generative models, and it is still very hard to keep the results both reasonable and diverse.

Molecules in 3D: the Wave Transform Representation

And with that, we finally come to the paper that made us write these posts: the joint work between Insilico Medicine and Neuromation (okay, mostly Insilico) titled “3D Molecular Representations Based on the Wave Transform for Convolutional Neural Networks”; it also appeared in Molecular Pharmaceutics.

This work had a slightly different emphasis: instead of devising and testing new architectures, we tried to look at the descriptions of molecules that are fed as input to these architectures. One motivation for this was that the entire framework of deep learning for drug discovery that we had seen in both Cornucopia and druGAN presupposes that we will screen predicted fingerprints against a database of existing molecules. Not all fingerprints are viable, so you cannot take an arbitrary MACCS fingerprint and reconstruct a real molecule: you have to screen against actually existing fingerprints and find the best matches among them. If we could use a more informative molecular representation, we might not have to choose the proposed molecules from a known database, leading to the holy grail of drug discovery: de novo generation of molecular structures.

So how can we encode molecular structure? People have tried a lot of things: a comprehensive reference by Todeschini et al. (2009) lists several thousand of molecular descriptors, and this list has grown even further over the last decade. They can be broken down into string encodings, such as MACCS itself, graph encodings that capture the molecular graph (there are some very interesting works on how to make convolutions on graphs, e.g., (Kearns et al., 2016; Liu et al., 2018)), and 3D representations that also capture the bond lengths and mutual orientation of atoms in space.

In molecular biology and chemistry, the 3D structure of a molecule is called a conformation; a given molecule can have many different conformations, and it may turn out that it’s important to choose the right one. For example, is the part of the molecule that is supposed to bind with a protein gets hidden inside the rest of the molecule, and the drug will simply not work. So it sounds like a good idea to feed our models with 3D structures of the molecules in question: after all, it’s basically a picture in 3D, and there are plenty of successful CNNs with 3D input.

But it proves to be not that easy. Let’s look at the main picture from the paper:

Part (a) shows how the original molecule looks in 3D space: it’s a 3D structure composed of different atoms shown in different colors on the picture. How do we represent this structure to feed it to convolutional networks? The most straightforward answer would be to discretize the space into voxels (fun fact: the linear size of a voxel here is 0.5Å; that’s Angstrem, 0.1 nanometers!) and represent each atom as a one-hot representation in the voxel; the result is shown in part (b).

But this representation is far from perfect. First, it’s very sparse: less than 0.1% of the voxels contain atoms. Second, due to this sparsity interactions between atoms are also hard to capture: yes, some atoms are near each other and some are farther away, but there is a lot of empty space around atoms, the data does not have enough redundancy, and CNNs just don’t work too well with this kind of data. Sparse voxels lead to sparse gradients, and the whole thing underfits.

Therefore, it is better to smooth out the 3D representation in some way. Parts © and (d) of the picture above show Gaussian smoothing: we take each atom and “blur” it out with a Gaussian kernel, getting an exponentially decaying “ball” around each atom. The kernel in (d) has a much higher variance than in ©, so the result is more “blurry”. This also introduces the necessary redundancy, and the resulting representation is also more robust to errors:

In the paper, we proposed a different kind of “blurring” based on the wave transform; its kernel is a Gaussian multiplied by a cosine function of the distance to center, so the “ball” still decays exponentially but now spreads out in waves. The result is shown in part (e) above. In the paper, we show that this transform has better theoretical properties, deriving an analytical inverse operation (deconvolution) for the wave transform.

This converts to practical advantages, too. In the paper, we trained a simple autoencoder based on the Xception network, but even with this experiment you can see how the wave transform representation performs better. The picture below shows reconstruction results from the autoencoder at different stages of training:

We can see that the voxel-based representation never allowed to reconstruct anything except carbon (and even that quite poorly), and Gaussian blur added nitrogen; the wave transform, however, has also been able to reconstruct oxygen atoms, and the general structure looks much better as well. Our experiments have also shown that the wave transform representation outperforms others in classification problems, e.g., in reconstructing the bits from MACCS fingerprints.

Conclusion

In this post, we have seen how different generative models compare for generating molecules that might become plausible candidates for new drugs. Insilico Medicine is already testing some of these molecules in the lab. Unfortunately, it’s a lengthy process, and nothing is guaranteed; but I hope we will soon see some of the automatically generated lead molecules confirmed by real experiments, and this may completely change medicine as we know it. Best of luck to our friends and collaborators from Insilico Medicine, and I’m sure we will meet them again in future NeuroNuggets. Stay tuned!

Sergey Nikolenko
Chief Research Officer, Neuromation

--

--