AI in drug discovery is overhyped: examples from AstraZeneca, Harvard, Stanford and Insilico Medicine
Investments in AI for drug discovery are surging. Big Pharmas are throwing big bucks. Sanofi signed a 300 Million dollars deal with the startup Exscientia, and GSK did the same for 42 Million dollars. The Silicon Valley VC firm Andreessen Horowitz launched a new 450 Million dollars bio investment fund, with one focus area in applications of AI to drug discovery.
In this craze, lots of pharma/biotech companies and investors wonder whether they should jump on the bandwagon in 2018, or wait and see.
In this post, I argue that they must be careful, because pretty often, AI researchers overhype their achievements, to say the least. This practice is widespread, and for illustration purposes, I looked at recent research from one big Pharma, AstraZeneca, two universities, Harvard and Stanford, and one startup, Insilico Medicine. These labs are quite reputable, they are producing otherwise interesting research.
Update: all of these labs answered. See the end of the post.
I am only talking about the tip of the iceberg here, the situation is often worse elsewhere. For example, I won’t even talk about companies like IBM Watson, who overhype their proprietary solutions. This secrecy helps them escape public criticism, until they are caught up by the reality.
However, I am not saying that AI should be discarded completely. In an innovative environment like drug discovery, staying behind is not an option. First-movers enjoy a huge competitive advantage. A good compromise is then to move quickly but carefully, and to solicit counter-expertise along the way.
For these counter-expertise services, pharma players can ask Startcrowd, by clicking here. Startcrowd is an online network of AI experts and enthusiasts, well-positioned to deliver independent counter-expertise. We tap into the talent pool emerging from online education. It keeps Startcrowd away from the conflicts of interests spoiling the Pharma industry.
This counter-expertise is a way to avoid new disappointments with computer-aided methods. Pharma veterans remember the epic failure of rational drug design in the 1980’s. At that time, big Pharmas were promising the next industrial revolution. It didn’t happen.
I am optimistic that things can be different in 2018. It’s not really because of breakthroughs in artificial intelligence, but rather because R&D organization can be improved: stronger checks and balances are possible now, with the rise of online education (Coursera, EdX…) and social medias. They present new opportunities for open peer-review, which can deflate the bubble. The mission of Startcrowd is to accelerate this trend.
Now, let’s get into the technical part, with some recent examples of overhyped AI research.
In this paper, AstraZeneca researchers (joint with others) want to generate novel molecules using recurrent neural networks and reinforcement learning. This question is important because a creative AI should bring more diversity to the lead generation pipeline.
This paper caught my attention because of the large part devoted to the evaluation of the model. It has the appearance of depth. They introduce various metrics, based on Tanimoto-similarity and on Levenshtein distance. They provide an impressive number of visualizations, using histograms, Violin plots and t-SNE.
However, all their measures are made between AI-generated molecules on the one hand, and natural molecules on the other hand. They always ‘omitted’ to measure the distance of AI-generated molecules with each other. This omission allows to build an illusion of diversity: a large distance between AI-generated and natural molecules allows to think that the AI got creative, and that it explored new directions in the chemical space. It would mean that we got something like this graphic:
However, if the distance of AI-generated molecules with each other remains small, it means that we fell into the trivial situation where the model generated a stream of molecules located all at the same place. No diversity is generated, and we are actually left in a situation like this:
For a more technical discussion, see pages 6–7 of my paper here.
At Harvard, a team noticed this diversity issue. By looking at the samples generated by the AI, they felt something was going wrong. They tried to do something about it, and they proposed the ORGAN model, here and here.
Their idea is to bring more chemical diversity, and chemical realism, by correcting the generator with a second neural network, called discriminator. It penalizes the generator if the molecules look too unnatural. This idea is drawn from the literature in Generative Adversarial Networks (GAN), a hyped topic within the AI community.
This idea is interesting, but their execution is terrible. They conclude that their ORGAN is better, but this claim is only based on their personal visual observation, without any quantitative support (see page 3 of their paper). Their quantitative experiments don’t support their conclusion.
This had to be expected somehow, because as AstraZeneca, they only compare AI-generated molecules with natural ones, and they never compare AI-generated molecules with each other.
Moreover, the way they train their model is problematic. This can be seen by looking at the log file of their training (they had the good idea to make it public too). Their discriminator always highly penalizes the generator. They have a perfect discriminator problem, which essentially nullifies any practical benefit of using GAN.
To excuse them, they might have inherited this perfect discriminator problem from the SeqGAN paper, on which ORGAN is built. We can’t know for sure, because contrary to the ORGAN team, the SeqGAN team didn’t make public their training log file, and nobody bothered to reproduce their experiments.
A more technical discussion is available in my paper, pages 5–6. I tweeted my paper to Alan Aspuru-Guzik, the ORGAN team leader. He answered:
I am still waiting for an adequate response.
Stanford has a big team dedicated to AI and deep learning for chemistry. The team leader is Vijay Pande, who is also a startup investor at Andreessen Horowitz, co-managing their $450 Million bio fund. Their flagship project is MoleculeNet, a ‘benchmark specially designed for testing machine learning methods of molecular properties’. It has the appearance of seriousness, with lots of chemical compounds, lots of graphics, and lots of deep learning models. In particular, a large place is devoted to graph-CNN and other chemistry-specific neural networks, developed by this Stanford team.
However, there’s an elephant in this room too: the Pande team did not bother to plug their data into a character-level Convolutional Neural Network. Char-CNN is routinely used in AI for text processing since 2015, and it is much simpler than their graph-CNN. To use char-CNN, it is sufficient to plug the SMILES strings.
Why did they avoid such an easy task? Page 17 of their paper, we can read:
“Recent work has demonstrated the ability to learn useful
representations from SMILES strings using more sophisticated methods,
so it may be feasible to use SMILES strings for further learning tasks
in the near future.”
I honestly doubt that char-CNN is too sophisticated for this Stanford team. They even use char-CNN in another paper.
A more plausible, and embarrassing explanation is that they were scared that char-CNN would be better. This would mean that their beloved model, graph-CNN, would be beaten at their own MoleculeNet benchmark. This wouldn’t fit their agenda.
Which agenda? MoleculeNet is closely related with the DeepChem library, which implements MoleculeNet models. DeepChem is open-source and Stanford-led. If char-CNN is better than graph-CNN, then practitioners don’t really need DeepChem, because for a state-of-the-art model, they can simply use plain TensorFlow or PyTorch. We are in 2018, and adoption of open-source frameworks is a strategic asset. For example, by open-sourcing Android, Google dominated the mobile OS market. In AI software, Google is leveraging its open-source library TensorFlow to expand its Google Cloud Platform. Likewise, DeepChem might be looking for a domination in the AI for drug discovery niche, and that may be why MoleculeNet ‘omitted’ char-CNN. One concrete possibility is that Andreessen Horowitz could quietly invest in a Cloud drug discovery platform powered by DeepChem.
This conjecture is reinforced by my personal experience with DeepChem. I naively tried to use DeepChem in my project, until I realized that I couldn’t mix DeepChem models and non-DeepChem models. That would be useful for adversarial training, with a DeepChem discriminator and a non-DeepChem generator. On the contrary, I got completely locked-in DeepChem code. I didn’t expect something so vicious. In order to escape from this trap, and to make DeepChem truly open, I had to dig into complex code (my unlocked fork of DeepChem is here). It will be much more difficult to do that for a more mature project. So my impression is that with this strategy of technology lock-in, DeepChem wants to eat the world of AI for chemistry. This would not surprise me from an investor partnering with Marc Andreessen .
While MoleculeNet team members avoided to benchmark char-CNN, they still found time to design fancy landing pages for MoleculeNet and DeepChem, which suggests that their priority is PR fluff, and not solid science. It’s a strategy typical in the Silicon Valley, where startups design mock products to attract traffic, and then rely on their community to build the real thing.
Insilico Medicine is a pioneer in generative models, among AI startups. In this paper (the paywall can be hacked using Sci-Hub), Alex Zhavoronkov and his team proposed ‘DruGAN, an advanced generative adversarial autoencoder model’. I keep wondering what is so advanced about this model.
It is definitely not advanced with respect to the demands of drug discovery: it suffers from the same flaws as the other generative models, which can lead to disappointments down the road.
It’s not advanced either with respect to the previous papers in the literature, which use more sophisticated tools. They make allusions to them pages 9-10 (without citing them):
This study uses the MACCS molecular fingerprints that are not ideal representation of molecular structure. Direct SMILES [here], InChI, molecular graphs [here] and other more chemically- and biologically-relevant representations of the molecular structures may serve as better types of training.
It is not even advanced with respect to their own Variational Auto-Encoder (VAE) used for benchmarking. In the paper, they claim that DruGAN is better than VAE, but on Github, one DruGAN author acknowledges the contrary:
Actually, we didn’t tune VAE network as much as AAE [DruGAN], so it isn’t very fair comparison. I mean that one can introduce upgrades for VAE and outperform our AAE.
So I am left thinking that DruGAN is only advanced with respect to their previous paper, published 8 months earlier. All over the paper, they keep mentioning the improvements they made to their previous work. So maybe ‘advanced’ was just a term for self-congratulation.
In conclusion, many researchers in AI for drug discovery are overhyping their results. To navigate in the middle of this AI bubble, it is important to hire strong counter-expertise services. Startcrowd provides such services, they can be ordered here.
Edit: Lot of discussion on Hacker News.
Edit: Interesting feedback on Reddit.
Edit: The DeepChem team (Stanford) disagrees on Facebook.
Edit: Commentary by Derek Lowe on his blog:
Here's a piece to start some arguing: "AI in Drug Discovery is Overhyped", by Mostapha Benhenda. I realize that a lot…blogs.sciencemag.org
An edited version was originally published at Biopharmatrend.