TRANSFORMing natural product drug discovery: machine learning for high-fidelity chemical property prediction from metabolomics data

August Allen
7 min readOct 11, 2022


Teaching computers the language of chemistry using mass spectrometry: Part 1

Tom Butler, David Healey, August Allen & Viswa Colluru, Enveda Biosciences

The core problem in natural product drug discovery: finding needles in nature’s haystack

Natural products are unique in that they are simultaneously the most validated, yet untapped source of new drugs. Depending on the exact definition, somewhere between ~30% (internal analyses, most conservative) and ~60% of FDA-approved small molecules through the year 2020 owe their origin to molecules from nature. Yet, when complex chemical samples collected from natural sources are subjected to the best analytical techniques for recall of known structures contained in the sample, less than 10% return a match. This implies the existence of a large, untapped pool of chemistry derived from billions of years of evolution — both constrained by biology (the kinds of molecules that can fit into biosynthetic protein pockets), and constrained for biology (production of these specific compounds increased the chances of greater fitness for the species in which it is being produced, presumably by interacting with protein pockets of other forms of life in its environment). Moreover, latest estimates of the total number of natural products known to date are in the mere hundreds of thousands, compared to millions of compounds that have served as the backbone of high throughput screening (HTS) campaigns in traditional pharmaceutical discovery over the last couple of decades. These relative numbers are suggestive of greater translatability of natural products. In fact, several independent analyses exist to support their greater translatability (recapped elegantly by our SAB member, Ryan Shenvi). For example, chemical space defined by high Fsp3, high stereochemical content, high oxygen content, high ring content, and low aromaticity (properties enriched in natural products) correlated with increased progression through clinical trials.

So, why then, did natural product libraries fall out of favor for drug discovery? The answer is that these libraries led to a lot of non-deterministic failures when applied to modern drug discovery. Apart from assay incompatibility with emergent molecular screening techniques like SPR, these difficulties fell into three main categories:

  1. Inability to quickly prioritize lead-like structures [“Chemical Annotation”]
  2. Inability to confidently identify the bioactive molecule in a mixture [“Biological Annotation”]
  3. Inability to access enough material to enable preclinical, clinical, and commercial development [“Material Access”]

At Enveda, we have built a technology platform that solves each of these three core problems to achieve incredible hit rates across difficult biology — historically undruggable targets or emerging modalities such as molecular glues. Today, we are excited to peel back the curtain on some of our work solving Problem #1 — Chemical Annotation. Nearly two years ago, we asked ourselves the question: How can we prioritize the most interesting, attractive, and tractable novel chemistry without trial and error isolation using expensive NMR spectroscopy?

Unlocking metabolomics using mass spectrometry could scale natural product drug discovery

The ideal answer (for startup timescales), we figured, lay in new ways of looking at data produced by analytical instruments rather than inventing a new analytical technique. We turned to tandem mass spectrometry-based (LC-MS/MS) metabolomics, that can (i) take a mixture of compounds extracted from a natural source, up to 1000s of molecules at a time, (ii) separate them using chromatography, and (iii) pass them through a tandem or two-stage mass spectrometer. The first stage (MS1) measures the mass of the individual compounds and their abundances. The second stage (MS2) fragments the compounds into pieces, and for each piece, measures its mass and abundance. Mass spectrometry has tremendous advantages for data collection over NMR:

  1. Individual compounds need not be isolated for analysis (parallelization)
  2. Thousands of compounds can be analyzed in minutes (throughput)
  3. Low variable cost per sample (cost)

Moreover, it turned out that mass spectrometry hardware was well ahead of any companion software, generating millions of data points per experiment that are largely analyzed by bespoke software packages tailored to one-off analyses on data stored locally. In fact, Pieter Dorrestein, Enveda’s scientific co-founder, helped buck this trend to introduce some of the first digital infrastructure for storing and searching mass spectrometry raw data (see here, here, and here for some examples). We were primed to make a breakthrough, but were helped by one more critical factor: metabolomics data was ideally suited for machine learning. This allowed us to explore whether we could perform more than library matching with mass spectrometry data.

Transformers are an ideal match for metabolomics

As David so elegantly explained in his blog post last year, machine learning and metabolomics are an ideal match. This is because (to quote directly from David’s blog) mass spectrometry was fundamentally a problem of data representation: can you represent bags of masses and abundances (i.e., MS2 spectra) in a way that preserves structural similarity or identifies structural fragments or denotes structural class?

MS2 spectra lack a straightforward sequential or spatial dependency between the peaks. We realized that this makes them poor fits for traditional deep learning via convolutional or recurrent neural networks (CNNs or RNNs). Transformers, a neural network architecture originally introduced to capture linguistic structure over entire passages of text, on the other hand, could be ideal for MS/MS spectra. Their self-attention layers would allow learning of complex dependencies based on the identity of the fragments alone without locality or ordering assumptions inappropriate for MS/MS data. Transformers have only been in use for a few years, but the recent commoditization of transformers has made these powerful models much more accessible.

Applying transformers to the prediction of properties in novel chemical space works exceptionally well

Using transformers, we built MS2Prop: A machine learning model that directly predicts chemical properties from mass spectrometry data for novel compounds. In other words, the model predicts chemically relevant properties of compounds for drug discovery directly from mass spectra, without relying on an actual or predicted structure. As such, we can generate predictions independent of whether the compound is in an existing database. MS2Prop performance has an average R2 of 70%, meaning its predictions explain about 70% of the variation in properties from structure to structure (see preprint for list of all 10 properties) for novel compounds. This is in contrast to an R2 of 22% for the default method of looking up the closest spectral match in a database and calculating the properties from that molecule or an R2 of 9% using CSI:FingerID, a publicly accessible tool which combines fragmentation tree computation and machine learning. We show performance gains across key properties like synthetic accessibility (addressing Problem #3 above), fraction of sp3 carbons, or quantitative estimation drug-likeness (QED).

For the first time, MS2Prop enables confident decision-making about novel chemical space directly from mass spectrometry data.

We are industrializing natural product drug discovery using MS2Prop

MS2Prop is not only significantly more accurate. It is also orders of magnitude faster (12,000 times, on average) than the state of the art. It takes just ~2 milliseconds for MS2Prop to generate a prediction from an MS2 spectrum. This performance efficiency allows us to:

  1. Generate predictions in step with the throughput of our platform (analyze MS2 spectra related to tens of thousands of compounds daily), and
  2. Study unannotated natural chemical space across hundreds of millions of spectra for drug-likeness

We are already using MS2Prop to guide the prioritization of interesting molecules prior to investing in the isolation and NMR analysis of any individual compound. As we build a growing collection of natural product extracts in our labs and annotate their function across a range of interesting biological assays, this capability is key to ensuring that our platform delivers molecules that will one day become medicines. Without MS2Prop, we (or more correctly, our medicinal chemists) would be disappointed an overwhelming proportion of the times we isolated a molecule from an extract. While we will reserve our findings across libraries of plants, or even one plant, to another blog (and paper), you can get some idea by the fact that only 0.637% of 500M publicly available spectra met the criteria for QED >0.8. Millions of needles, but in an enormous haystack.

Let’s zoom out for a bit. We represented approximately 210k unannotated spectra in a Uniform Manifold Approximation and Projection (UMAP) and colored them based on the numerical score for quantitative estimate of drug likeness (QED property from MS2Prop), highlighting drugs with FDA approval. We identified regions with many FDA approved compounds (orange/red, Regions A and B), denoting highly-successful chemical space covered by unmined natural products. We also observed several regions (Regions C-E) that were sparsely occupied by FDA-approved drugs but score equally well for drug-likeness, showing yet additional high-potential chemical space covered by natural products. When overlaid with sample source annotations and other data, these sorts of analyses tell us whether we are on the right track in our hunt for new medicines.

We are building the world’s largest metabolomics datasets for training our next generation algorithms

We are incredibly proud of our work behind MS2Prop, but it is just the beginning. We know that a model is only as good as its data. To retain our advantage at the forefront of metabolomics, we are building the largest metabolomics dataset purpose-built for machine learning. We have started with phytochemicals, which are historically among the richest sources of therapeutic drugs. Over time, our search for new drugs to bring to the clinic will feed massive data into our ML algorithms, which will in turn provide better guidance to our drug discovery programs. Active learning strategies will help us identify and characterize the mass spectra whose identity is most likely to improve our models, which we can then actively characterize, until our models perform well across all phytochemical space. And all of natural chemical space. Alea iacta est!

PS: If you want to build cutting-edge ML tools for metabolomics and chemistry, discover new leads for targets that others have thought were too difficult, or turn a pipeline of unique molecules into medicines, get in touch!