You didn’t need deep learning to generate new molecules

Mostapha Benhenda
The AI Lab
Published in
4 min readSep 21, 2018


Molecule generation is a hot topic in AI for drug discovery. In a previous blog post, I exposed how some methods had issues with generating diversity. Since then, new papers appeared to address this problem (sometimes without citing my paper raising it). Most of the proposed solutions are quite complicated, like those from Insilico Medicine, Harvard-Toronto-Insilico Medicine, Israel Institute of Technology, or Stanford.

However, these complicated solutions are probably not necessary, because of another paper, much simpler, from a team led by Koji Tsuda, at the university of Tokyo. They propose a genetic algorithm, called ChemGE, for Chemistry Grammatical Evolution.

They consider a fitness score, which evaluates the desired output: druglikeness, expected activity, and so on. They start from a random collection of molecules, they select the fittest half of the population, and eliminate others (selection). Next, they double the surviving population by random sampling (reproduction), and they randomly tweak the chemical formula of the newborn molecules (mutation). They iterate until the population of molecules is acceptable.

The Tokyo team compared this genetic algorithm with their own deep reinforcement learning algorithm, ChemTS, inspired by AlphaGo. They found that ChemGE performed at least similarly to ChemTS: generated molecules achieved good fitness, and they were sufficiently different from each other. Moreover, ChemGE was much faster.

ChemTS, the deep reinforcement learning baseline

Genetic algorithms have a long history in molecule generation. They date back to 1995 at least, with a paper by Glen and Payne, from Wellcome labs (an ancestor of the big pharma GSK).

Before genetic algorithms, there were even other methods, like this 1989 paper from Abbott Labs (an ancestor of the big pharma Abbvie). You can check this 1994 survey for more details (consider Sci-Hub, if you didn’t subscribe). This historical background can explain why so many people in the pharma industry are skeptical about deep learning: they keep wondering whether it brings anything new to the table.

So the main contribution of this Tokyo paper is the benchmark GA vs. DRL, which is reasonably well performed (an earlier benchmark was also attempted by BenevolentAI, but it was poorly executed, see an older blog post). The result should not be too surprising: similar observations were made about video games tasks. In April 2017, an OpenAI team, led by Ilya Sutskever, compared deep reinforcement learning with evolution strategies, and they often found that both have comparable performance.

Another interesting fact is that this Tokyo paper remains largely under-noticed. It appeared in April 2018, more than five months ago, but apparently, the conclusion didn’t fit the narrative or agenda of mainstream tech journalism, industry and academia. This paper makes the field look embarrassingly cheap, that’s bad news for business. The community is often reluctant to perform careful benchmarks. For example, there is this tweet by Olexandr Isayev, an academic from the university of North Carolina, author of a recent paper about deep reinforcement learning:

Olexandr Isayev should know that the adoption of deep learning followed a benchmark, the famous 2012 ImageNet, in which a deep learning classifier, AlexNet, outperformed alternatives by a wide margin. No benchmark, no deep learning revolution.

Shallow vs. Deep learning: it’s all about benchmarks

Benchmarks are the way to track progress, or lack thereof. I outlined a benchmark proposal in February 2018, and I am still looking for sponsors.

In the meantime, if you meet marketing people, from industry or academia, who want you to adopt their deep learning solution, ask them: how is it better than older, simpler, and faster methods ?


Koji Tsuda reacts:

Anvita Gupta, from Stanford (paper here), reacts (multi-tweet thread):