On the shortcomings of continuous representations of chemical space
Introduction
The paper (Gomez-Bombarelli et al., 2018), which introduced Chemical VAE and provided the core exploratory technique for the GENTRL model described in (Zhavoronkov et al., 2019), contains a few stark claims in its introduction. We quote an excerpt here:
“…computational molecular design is limited by the search strategy used to explore chemical space. Current methods either exhaustively search through a fixed library, or use discrete local search methods such as genetic algorithms or similar discrete interpolation techniques…The genetic generation of compounds requires the manual specification of heuristics for mutation and crossover rules. Discrete optimization methods have difficulty effectively searching large areas of chemical space because it is not possible to guide the search with gradients.”
The emphasis in the quotation is ours. In this article, we at Norachem show that these overreaching assertions do not stand up to scrutiny. In particular, we demonstrate the following:
Claim 1. Discrete genetic operations can access regions of chemical space that lie beyond the horizon of the latent space of Chemical VAE.
Claim 2. The projection of the large, unstructured, and high dimensional chemical space down to a lower dimensional latent space results in an unacceptable loss of information and structural diversity.
But first, some background for the ensuing discussion.
Gradient-based search with Chemical VAE
We begin with a brief description of the gradient-based approach of Chemical VAE, the details of which can be found in the original paper. Figure 1 in the paper, reproduced below, provides an excellent overview:
Figure 1(a) shows that the canonical SMILES representation of a molecule is converted to a high dimensional one-hot vector, which is then mapped down to a lower dimensional latent space by the encoder. This latent space is jointly trained upon three molecular properties: cLogP, synthetic accessibility score (SAS), and quantitative estimate of drug-likeness (QED). Each point in the latent space corresponds to an ordered triple of values of these properties via the function f.
Figure 1(b) illustrates how gradient descent may be employed to search the latent space for points with optimal combinations of molecular properties. When an optimal point is found, it is transformed into a high dimensional one-hot vector by the decoder. This vector is then converted to a canonical SMILES string.
Figure 1(b) also prefigures the limitation inherent in this particular computational approach to molecular design. It is self-evident that a molecule can be claimed to reside in the latent space only if its decoding is recognisable as an equivalent chemical structure. Any molecule that cannot be decoded correctly must perforce remain inaccessible via this gradient-based method of searching. Consequently, unless the decoder is a surjection from the latent space upon all of drug-like chemical space, there will be a large — and usually, very large — class of molecules that will remain invisible with this particular method of searching.
But the decoder of Chemical VAE is manifestly not a surjection. And therein lies the problem. The best molecules that meet the design goals of any exercise are almost guaranteed to reside in the invisible portion because the latent space represents a very thin slice of the drug-like chemical space. We shall have more to say on this point later in the article.
Discrete genetic search
The discrete genetic approach manipulates molecules in their original graph-theoretic complexity without projecting them down to a simpler linear form. This faithful representation of molecules admits of the reconstruction of any chemical structure, thereby giving the approach access to potentially all of drug-like chemical space. The authors of (Gomez-Bombarelli et al., 2018) criticise the need for “the manual specification of heuristics for mutation and crossover rules.” We shall show that a straightforward randomised approach, without any elaborate heuristics, produces excellent molecules that cannot be decoded by Chemical VAE.
A simple randomised approach to the genetic creation of molecules might feature the following two operators:
Operator 1 (randomised crossover operator). Given two parent molecules, the randomised crossover operator produces an offspring by combining the parents at randomly selected locations, as shown in Figure 2:
In the parent molecules on the left, the bonds highlighted in red are one possible random selection. The bond in green on the right indicates one possible way in which the resultant fragments from the parents can be coupled to form the offspring.
Operator 2 (randomised mutation operator). Given a parent molecule, the randomised mutation operator produces an offspring by substituting a randomly selected atom of the parent with another atom, as shown in Figure 3:
In the parent molecule on the left, the red hydrogen atom was selected at random and replaced with the green chlorine atom — itself a random selection — to form the offspring on the right.
Notice that the two operators pictured above are straightforward to define and implement. They do not need cumbersome or elaborate hand-crafted rules.
Armed with these discrete operators, we can search the drug-like chemical space for molecules that have very good values of the three properties mentioned previously: cLogP, SAS, and QED. And as we conduct our search, we can record the fraction of the encountered molecules that are decoded correctly by Chemical VAE. The resulting sequence will provide us with a good comparison of the relative sizes of the chemical space scanned by the two search methods.
A comparison of the two search methods
Our computational experiment began with a set of 150 molecules sampled uniformly at random from the same collection of 250,000 molecules upon which Chemical VAE was trained — which ensured that Norachem did not have a larger view of the drug-like chemical space before the start of the simulation. Then we ran 128 iterations of Norachem’s generative design, with the goal of producing molecules with the best combinations of the three properties of interest.
In the course of the simulation, the randomised operators constructed and evaluated many different molecular structures. For each iteration, we calculated the number of structures that were decoded correctly from the latent space of Chemical VAE as described below:
For each new molecule constructed during our discrete search, the decoder of Chemical VAE returned a list of canonical SMILES strings of probable decodings. A molecule m was deemed to have been decoded correctly if one of the following was true:
- There was an exact match between the canonical SMILES representation of m and one of the canonical SMILES strings returned by Chemical VAE, or
- One of the canonical SMILES strings returned by Chemical VAE represented a molecule with a graph structure that was isomorphic to that of m.
The latter condition ensured that trivial differences in the SMILES representations did not obscure the underlying structural equivalence of the molecules. The former check was included because it was cheaper to conduct.
This calculation yielded a sequence that established an upper bound for the number of discretely generated structures that could have been found through Chemical VAE. The graph of the cumulative percentage of correctly decoded structures against the iteration number is shown in Figure 4:
Approximately 8% of the molecules of the first iteration were decoded correctly. This number fell rapidly as the iterations progressed, before stabilising around 3.5%, which means that 96.5% of the molecules constructed and evaluated by the discrete search method were inaccessible through Chemical VAE! This provides a stark demonstration of Claim 1.
Even with a “dumb” randomised strategy, the discrete operators can scan regions of the drug-like chemical space that are inaccessible to Chemical VAE. The difference becomes even greater when we employ Norachem’s artificial intelligence to construct operators with near-optimal heuristics for a given exercise.
One observation that must be borne in mind as we compare the two approaches is the persistently low trend of Chemical VAE as the number of iterations increases. Norachem’s generative design creates progressively better molecules with each iteration, so Figure 4 above is a vivid indication of the troubling fact that the gradient-based search of Chemical VAE fails to see most of the good molecules in the later stages of the design.
Figure 5 shows one such good molecule uncovered by the discrete operators that Chemical VAE fails to decode:
This molecule has excellent values for all the properties of interest. Yet, it remained invisible to Chemical VAE, as evidenced by the fact that the canonical SMILES that was returned was invalid.
We have shown that the discrete operators can find molecules that Chemical VAE cannot. But is the converse possible? Can Chemical VAE find molecules that the discrete operators cannot represent?
The answer is a categorical no! This point has already been made in the section on discrete genetic search. With very simple provisions, the discrete operators can reconstruct any drug-like molecule — and especially those that are retrosynthesisable. In other words, these discrete operators scan a chemical region that is a proper superset of the region represented by the latent space of Chemical VAE.
Some consequences of relying on VAEs for drug design
In the introduction we stated that the variational autoencoder approach forms the central exploratory technique of GENTRL (Generative Tensorial Reinforcement Learning), the main deep learning model of Insilico Medicine.
An important consequence of the continuity of the latent space is that there is a very low probability that the gradient descent operator will encounter a diverse range of molecular structures before reaching a minimum. This, in turn, makes scaffold-hopping unlikely, and results in molecular recommendations that are not far from the known literature — an undesirable circumstance. This is the basis of Claim 2.
The substance of Claim 2 is corroborated by the fact that the original publication of (Zhavoronkov et al., 2019) was swiftly followed by criticisms such as (Walters and Murcko, 2020), in which the critics point out that the behaviour of GENTRL’s best compound was a plain consequence of its structural similarity to known DDR1 kinase inhibitors. In (Zhavoronkov and Aspuru-Guzik, 2020) the authors attempt to reply to the criticisms — inadequately, we believe — while acknowledging that GENTRL’s compounds lacked optimisation.
Norachem’s generative design is uniquely suited to answer the challenges raised by (Zhavoronkov et al., 2020) and the subsequent commentary. We postpone a detailed comparative analysis between Norachem and GENTRL to a future article.
Conclusions
The main conclusion is that gradient-based methods are a sub-optimal way of searching the immense and unstructured drug-like chemical space. Any attempt to project the nearly infinite chemical space down to a continuous representation of finite dimension will result in a loss of information and lead to inferior drug candidates.
Norachem’s generative design overcomes the limitations of gradient-based approaches by not restricting itself to continuous representations, and by using its artificial intelligence to optimise heuristics for searching different areas of chemical space.
References
Gomez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hernandez-Lobato, J. M., Sanchez-Lengeling, B., Sheberla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., Aspuru-Guzik, A. (2018). Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Cent. Sci. 4 (2), 268–276
Zhavoronkov, A., Ivanenkov, Y. A., Aliper, A., Veselov, M. S., Aladinskiy, V. A., Aladinskaya, A. V., Terentiev, V. A., Polykovskiy, D. A., Kuznetsov, M. D., Asadulaev, A., Volkov, Y., Zholus, A., Shayakhmetov, R. R., Zhebrak, A., Minaeva, L. I., Zagribelnyy, B. A., Lee, L. H., Soll, R., Madge, D., … Aspuru-Guzik, A. (2019). Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nature Biotechnology 37, 1038–1040
Walters, W.P., Murcko, M. (2020). Assessing the impact of generative AI on medicinal chemistry. Nat Biotechnol 38, 143-145. DOI: 10.1038/s41587–020–0418–2
Zhavoronkov, A., Aspuru-Guzik, A. (2020). Reply to ‘Assessing the Impact of Generative AI on Medicinal Chemistry’. Nature News, Nature Publishing Group. www.nature.com/articles/s41587-020-0417-3