A new metric for generative models for molecules, by JKU Linz

Published in

The AI Lab

4 min readApr 2, 2018

There’s a new paper about a metric for generative models of molecules. If you followed my blog (like here, here and here) you already know that deep learning generative models can produce new molecules for chemistry and drug discovery, but that their evaluation is difficult, especially diversity.

The paper is by Kristina Preuer, Philipp Renz, Thomas Unterthiner, Sepp Hochreiter (famous for his co-invention of LSTM) and Günter Klambauer, from JKU Linz, Austria:

[1803.09518] Fr\'echet ChemblNet Distance: A metric for generative models for molecules

Abstract: The new wave of successful generative models in machine learning has increased the interest in deep learning…

arxiv.org

The Fréchet ChemblNet Distance

This formula seems a little complicated, but after some thinking, it becomes quite simple: the FCD is the distance between the means and the covariance matrices of two distributions: the real r and the generated g.

The second term is like that because a²+b²-2ab= (a-b)², and because the trace (the sum of the eigenvalues) is the common way to measure the size of this type of matrices.
So just think of FCD as a statistical distance between two distributions.

These two distributions are taken from representations by the pen-ultimate layer of a “ChemblNet”, which is a neural network trained to predict various biological activities from the ChEMBL chemical database.

ChemblNet is a kind of Inception neural network for chemistry, and actually, the same lab previously introduced the Fréchet Inception Distance to evaluate generators in computer vision. So the concept is not new, but the application to chemistry is.

Which use cases for this metric? ‘Me-too’ or ‘first-in-class’ molecules?

In the paper, my favorite quote is:

A generative model should produce a) diverse molecules (SMILES) which possess similar b) chemical and c) biological properties as already known molecules.

There are indeed use cases when data scientists want to generate new molecules with properties similar to known molecules. For example, when a Pharma company is looking for ‘me-too’ molecules, to bypass patents from competitors.

However, there is a more challenging case: suppose we want ‘first-in-class’ molecules, having unseen combination of different properties. For example, we have one dataset of molecules with property A, another dataset of molecules with property B, and we want to generate molecules combining both properties A and B (say, A=active on the disease, and B=soluble in the water).

In this case, the FCD of a good generative model should deviate from both A and B. So how to generalize the FCD ?

There’s a similar question about image generation in computer vision: for example, how the Fréchet Inception Distance deals with a model generating images of women with glasses, when we have no such image in the training set ? (by the way, has this question already been raised in the computer vision community ?)

DCGAN: generates women with glasses, with none in the training set

I have no idea about the answer: if you have anything, please share it in the comments, or on Telegram, or on the DiversityNet draft.

That’s why I think it still helps to use measures of internal diversity like variance or entropy. Their definitions are ‘intrinsic’ and do not require any comparison with a real-world distribution of reference. As a result, they are more general (details here).

Which advantages over previous metrics?

Quite surprisingly, the paper claims:

The FCD’s advantage over previous metrics is that it can detect if generated molecules are a) diverse and have similar b) chemical and c) biological properties as real molecules.

I don’t think this advantage is significant. It’s easy to build a single score combining all old-fashioned metrics. For example, take the product:

Internal diversity x Nearest Neighbor diversity x Solubility x Activity x …..

I still imagine an advantage for metrics like FCD (or Earth Mover Distance, see here) over variance/entropy: they seem harder to be ‘gamed’ by the AI. With reinforcement learning, people just stuff the reward function with anything that improves their metrics, as in this recent paper by Insilico Medicine. This hack is not very elegant, and appropriate metrics should be designed to counter this ‘cheating’ strategy.