Not One Size Fits All

ML Models Must Match Their Use Cases in Drug Discovery

12 min readJun 20, 2023

Co-authored by LabGenius’ CTO, Leo Wossnig.

Drug Hunters Continue to Pursue the Ultimate Breakthrough

Drug discovery is historically slow, expensive, and riddled with failures — AI/ML is changing this paradigm.

The drug development process remains expensive (estimated at $1–2B all-in cost) with high failure rates (estimated at >90% once reaching clinical trials). Most of us working in TechBio, specifically in drug discovery, are pursuing a common goal: to get maximally efficacious treatments to patients as quickly and cheaply as possible.

**Figure 1:** Potential time saving with AI enabled drug discovery. (*Source:* Subbiah.)

Quicker and more efficacious drug development has been enabled by advances in lab automation, genomics, precision trials, wearables, sensor technology, and clinical trial recruitment and management (nice summary in Shah, et al.). While significant strides in these areas have been made, there is still a long way to go to unlock their full potential.

Here, we focus on the ability of machine learning (ML or artificial intelligence/AI) to revolutionise the drug discovery process. In the media, we have seen remarkable advances in ML, ranging from acing standardised tests to generating art. Application of these same algorithms in drug discovery remains in its nascent phase, largely due to challenges with acquiring the appropriate training datasets and building biologically accurate models. Consequently, the vast potential of AI/ML to revolutionise this industry looms large, holding promise for a transformative breakthrough.

Are We Aiming for Faster? Cheaper? Better?

AI/ML in drug discovery has historically prioritised faster and cheaper but future breakthroughs will likely be focused on finding superior molecules.

The acceleration and cost reduction of preclinical stages can have profound impact on the overall drug discovery process. Biotech startups have already shown the ability to drive down both of these variables to develop drugs faster and cheaper.

**Figure 2:** Biotech startups deliver new molecular entities (NME) more cost efficiently than big pharma. (Source: Bay Bridge Bio analysis.)

Regardless, there remains opportunity to gain additional efficiencies: even a marginal reduction in time and cost per program can result in significant overall savings when multiplied by the sheer volume of early stage initiatives. For instance, if there are a thousand preclinical programs, and each can save a week of time and a thousand dollars in cost, the cumulative effect is a saving of a thousand weeks and a million dollars.

The next breakthroughs are likely to focus on higher quality drugs

**Figure 3:** Financial impact of savings in speed, quality, and cost at each stage of the drug discovery pipeline. (Source: Bender and Cortes-Ciriano.)

Given the breakthroughs in speed and cost, it seems likely that ML is most likely to drive drug discovery by helping us to find and develop better quality compounds. ‘Better’ can be broadly defined as: superior drug targets that improve clinical outcomes; increased functional activity in the biophysical assays; lower rates of adverse events in preclinical models and human subjects; and, ideally, higher efficacy in human patients. AI/ML has the potential to discover new molecules with superior properties along every one of these axes, generating molecules that address the shortcomings of existing therapeutics. With better quality comes reduced failure rates of drugs throughout the discovery pipeline, meaning more effective therapies for more patients.

Exploring ML Approaches in Preclinical Drug Discovery

Here we provide an overview of the preclinical stages at which ML can be used to improve the discovery of therapeutics, including the data available to address the problems and an overview of the current efforts.

1. Target Identification

This stage involves the identification of protein target(s) and clinical indication(s) for a new therapeutic. For example, immuno-oncology often focuses on identifying receptors that are uniquely co-expressed in cancers, but not (or to a lesser extent) on healthy tissues.

Available data: bulk gene expression (GEO, ArrayExpress), single cell gene expression (Human Cell Atlas, Single Cell Portal), proteomics (ProteomicsDB, PRIDE), histology (Human Protein Atlas), summary databases (Therapeutic Target Database), and scientific literature (PubMed).

In terms of publicly available data, target identification is the most data rich step of the drug discovery process. Available data sets span from molecular measurements that recorded human knowledge to approaches that integrate the data and prioritise targets with equal diversity.

ML approaches: Network analysis, NLP, and LLMs
Data driven approaches to target identification have focused on integrating different sources of omics data. Network biology, machine learning, and Bayesian approaches have all emerged to combine these data and propose therapeutic targets.

**Figure 4:** Examples of data sources and approaches for their integration to identify promising drug targets. (*Source:* You, *et al*.)

With the emergence of CAR-T and bispecific antibody therapies, the question has changed from “which single target maximally distinguishes cancer from normal tissues?” to “which combination of targets have this ability?” Multi-specifics (therapies which target multiple antigens) can decrease the on-target, off-tumour effects of cancer therapies by limiting their killing of healthy cells. A similar effect can be achieved by engineering antibodies to have avidity driven activity (more active when more antigen is present) or selective activity in physiological properties unique to the disease environment (e.g. low pH and high ATP levels in tumours).

**Figure 5:** Example integration of data to build logic gated CAR-T therapies. (*Source*: Dannenfelser, *et al*.)

Natural language processing (NLP) and biomedical question answering has been useful for querying data that is otherwise locked in scientific literature. An emerging solution is to query large language models (LLMs) including both generic models, like ChatGPT, and domain specific models, like BioMedLM. For example, querying ChatGPT “What proteins could be targeted for the treatment of triple negative breast cancer by antibody therapy?” yields the suggestions of EGFR, VEGF, PD-L1, PARP, and IGF-1R. While none of these are revolutionary proposals, more domain trained LLMs are likely to aid in the acceleration of target identification in the near future.

2. Lead Identification

This step aims to find binders for the predefined target.

Available data: protein structures (PDB, SAbDab), patents (Google Patents, Lens.Org).

ML approaches: Generative ML
Generative ML is a class of ML models (including GNNs, LLMs, GANS, VAEs, and diffusion networks) that generate novel data/responses. An emerging capability of generative ML in drug discovery is the de novo design of molecules. In this instance, an ML model generates entire molecules (SMILES, structures, sequences, etc.) when given an input drug target often in the form of a sequence or structure. Most of the best performing models for protein design are diffusion based, with some even co-generating sequence and structure simultaneously.

New papers come out on a weekly basis that continue to push the boundaries of what is possible with these generative algorithms for protein sequence synthesis. For example: Yeh et al. generate novel luciferases by ‘hallucinating’ structures and then identifying the sequences that fulfil them; Wu et al. focus on binding repeated peptide sequences through docking and hashing peptides; and Luo, et al use diffusion models to generate antigen-specific binders.

**Figure 5**. Example workflow for generating CDR designs for antigen-antibody interactions. (*Source:* Luo, *et al*.)

Due to the increasing availability of open-sourced or commercially available generative models most companies can now make use of them. However, they are left to grapple with how to evaluate these new technological capabilities alongside their current toolkit; answering questions such as “do you have a higher likelihood of success with de novo hits or naive/synthetic phage display screen or animal immunisation?” and “How many de novo hits need to be tested before achieving biological validation of activity?”

For venture capital firms (VCs), these questions are compounded. How do they identify whether portfolio companies are capturing the maximum value of generative technologies? (see questions from a16z). As these are early stage methods, major limitations remain:

In drug discovery, coming up with complementary structures that bind is not sufficient for a drug. Instead, we need to introduce multiple properties in parallel, such as function (which commonly goes beyond the affinity of the protein against the target, e.g. cytotoxicity, in vivo efficacy), developability (e.g. thermostability, yield, purity), and safety (e.g. specificity, immunogenicity). This is what we call goal-directed de novo design, goal-directed generative design, or multi-objective generative design.
Whilst they work relatively well when used off-the-shelf for very common proteins and targets (e.g. kinases), these methods face challenges in niche areas such as VHHs, multi-specifics (e.g. BiTEs), conditional antibodies, or other applications where little data is available (crystal structures for proteins are often the limiting factor).
Going from mono-VHHs to multi-valent and multispecific antibodies, goes beyond the training data and capabilities of existing approaches. This can be driven by the complexity and novelty of the molecules, as well as the challenges with predicting the impacts of long/flexible linker regions. Lots of biotechs’ attention is focused on these more complex formats as they have the potential to overcome drawbacks in existing treatments.

Viewing the positive angle, due to there being plenty of data available for common targets, generative methods might already be able to help us with the design of better binders. While binders are just the first step in the lengthy drug discovery process, this is certainly useful for many biotech companies as it may reduce timelines ahead of the lead optimisation process beginning.

3. Lead Optimisation

Requires finding the best molecule over a defined space while simultaneously optimising for multiple properties, known as ‘co-optimisation’.

Available data: most data at this stage are private assets.

ML approaches: multi-objective optimisation, active learning
Once a lead molecule has been identified for therapeutic development, the next task is to optimise the therapeutic properties of the molecule so it can progress towards the clinic. Unfortunately, this is not as simple as maximising a single property of the molecule (e.g. binding selectivity to cancerous cells). Instead, drug developers must explore a high dimensional space to ensure that the molecule is efficacious, safe, manufacturable, and stable.

**Figure 6**. Visualisation of drug development space being considered when developing new therapeutics. (Source: LabGenius.)

Unsurprisingly, modulating the desired property often inadvertently worsens others (for example in medicinal chemistry). Very quickly, these multi-dimensional design and measurement spaces become massive. As one example, let’s say we have a lead mono-VHH molecule with a sequence of 130 amino acids for optimisation along the axes of potency, specificity, toxicity, immunogenicity, yield, and thermostability. There are up to 20^130 variants of this (relatively short) molecule which we can measure along 6 axes. How do we efficiently explore and optimise this sequence space?

Researchers are commonly turning to active learning methods to solve this question. Active learning is a machine learning technique where the model actively and prospectively selects the most informative data points to learn from, rather than using a fixed set of training data. This approach helps improve the model’s performance with less (labelled) data, making the learning process more efficient and accurate. For background on active learning, we recommend this very useful explanation of active learning in Bayesian optimisation and this dive into more technical details.

Although the potential applications of active learning are far reaching, in drug discovery it’s usually used in tandem with supervised ML to improve a (multi objective) optimisation process. This is achieved by selecting the molecules such that the resulting models are optimal for the defined criteria along the Pareto frontier. Example published active learning work includes the field of antibody design (AntBO and Seo, et al.) and (more bountifully) chemical exploration (Yang, et al., Berenger, et al, Khalak, et al., Gusev, et al., Thompson, et al., and Graff, et al.).

At LabGenius, we have deployed active learning through Multi-Objective Bayesian Optimisation to optimise a HER2 T-cell engager (full poster). In the optimisation, we simultaneously improve T-cell activation and tumour selectivity (increased activity in cells with high HER2 density relative to cells with low HER2 density) over a 5 cycle design campaign. The average performance of the top 25 molecules improves with every cycle and surpasses the benchmark molecule, Runimotamab, after 5 cycles with respect to compound score (which represents both normalised activation and selectivity).

**Figure 7.** Left: top performers in perspective. Right: top 25 improvement per campaign. The compound metric (y) is a combination of the normalised activation and selectivity of the compound, scaled to be in the range [0,1]. (*Source:* *LabGenius*.)

4. Molecular Property Predictions

This step involves predicting molecular properties of proposed therapeutics, usually from protein sequence.

Available data: protein structures (PDB), thermostability data (NbThermo), publication datasets (PPCPRed), protein sequences (UniProt), internal data.

ML approach: supervised ML
As illustrated in Figure 6, there are a large number of properties which we need for any given sequence: efficacy, safety, manufacturability, and safety. In the ideal world, all of these properties could be predicted from the sequence alone and not require wet lab validation. While this is not yet the case, meaningful progress is being made along many of these axes.

When given sequence alone, there are models available to predict a subset of molecular properties, though they have often been trained on suboptimal datasets. Properties with available prediction tools include: thermostability based on annotated thermostability of bacteria in UniProt (HotProtein, TemStatPro); protein purity based on in-lab training datasets (fDetect, PPCPred); and toxicity based on known toxic organisms (ToxinPred2, CSM-Toxin). Transferability of these predicted values to therapeutic proteins like antibodies is likely to have limited utility. In most cases, companies are best suited to train similar classes of models on data they generate in-house from similar protein sequences.

A significant breakthrough has come in our ability to predict protein structure from sequences using algorithms like AlphaFold and its antibody specific variants like IgFold, NanbodyBuilder, ABodyBuilder, or ABlooper. Accurate structural predictions are particularly meaningful because they increase the accuracy when predicting other important properties, like thermostability (Thermometer), aggregation propensity (AggScore), and some early attempts at binding affinity (CSM-AB).

**Figure 8**. AlphaFold structural prediction schematic. (Source: Jumper, et al.)

An important caveat to the current state of these structural prediction algorithms is that modelling of more complex and/or disordered structures (e.g. CDR regions of antibodies, which are crucial for binding; multi-valent molecules) remains uncertain. Predictions of different models might disagree with each other and it might not be clear which is correct; models might predict completely inaccurate structures. How do we know when these predictions are good enough to enable the advancement of therapeutics? How many angstroms from known structures must our predictions of the CDR regions be for the structures to be useful in our prediction tasks? These questions are best answered through experimental validation.

Seamless Integration: Piecing Together Computational and Experimental Methods for Pipeline Success

To maintain high-quality data and minimise human error, ML methods should be fully integrated with automated experiments — this enables fast cycle times and direct feedback.

Given the capabilities of ML in drug discovery described in this post, how does a company go about effectively piece together these tools to drive their therapeutic pipeline?

In the absence of experimental systems, integration of machine learning models with physics-based simulations has been shown to improve overall predictive performance (e.g. integrating structure, docking, and more structure). However, performance markedly increases when generative algorithms that are proposed are validated with in-the-loop experimentation and reinforcement learning. Similarly, active learning approaches inherently require iterative rounds of high quality data acquisition/experimentation.

For example, goal directed generative/de novo design can work well when protein structures (or sufficiently accurate predicted structures) are available. However, to design new molecules with properties that are not easy to compute (e.g. complex docking simulations) we need labelled data of known binders/non-binders. Access to labelled data is usually insufficient (except for some classes of small molecules) for most interesting use cases (e.g. identifying T-cell activating compounds), which significantly dampers expectations around the potential of ML in drug discovery. And here lies the problem.

In drug discovery, the most challenging and currently unresolved problems are the ones where there is little or no data at all: new approaches, new modalities, new targets. For example, how do we model a new binder if the target sequence is very dissimilar (beyond even the generalisability of the best ML models) from targets with known binders? In these cases, even general models often fail to provide value as they don’t generalise (ironically).

In practice, deployment of ML in drug discovery requires a tight integration of the wet lab (experimental) and dry lab (computational) processes. To develop novel drugs, computationally splitting datasets into train and test cohorts is insufficient. Instead, proposed molecules must be synthesised and experimentally tested for functional activity. High quality experimental data must be gathered that is more directly relevant to the therapeutic question than the data that’s available in the public domain. By making the design-build-test-learn process iterative, we can allow ML models to train on the most biologically relevant data and improve performance over time.

**Figure 9.** Design-Build-Test-Learn workflow at LabGenius. (Source: LabGenius.)

Discussion

ML holds huge potential to transform the drug development pipeline through the creation of better, faster, and cheaper therapeutics. Current capabilities are constrained primarily by the quality of the data which powers models that are available to researchers. While recent growth in ML models in the drug development space has been staggering (network analysis for target identification, generative ML for lead discovery, active learning for lead optimisation, and supervised ML for molecular property predictions), sufficiently useful data and experimental systems to power and train these models are vital. As a result, the next breakthroughs in drug discovery ML are likely to be powered by the generation of high quality, domain specific data. In part 2, we will deep dive into the qualities of data necessary to power a ML-driven drug discovery company.