From Biotech to TechBio: ML-powered Drug Discovery (Part II)

Shubham Chatterjee
10 min readMay 4, 2022

--

A tidal wave of TechBio start-ups have emerged in the last few years, focused on transforming early-stage drug discovery. But what does the landscape look like, and what are the key success factors? Read on!

In Part I, I provide a high-level introduction to the rapidly burgeoning space of AI/ML in drug discovery. In this article, I will focus on:

  1. How ML creates value in early-stage drug discovery
  2. What the biotech landscape looks like in this space
  3. Key success factors for start-ups to win in this nascent area

Note that I will remain at a more strategic level in my analysis. For a more technical review of how different branches of ML specifically apply to drug discovery techniques, please consult these excellent reviews by Greener, Vamathevan, and Angermueller.

ML expands horizons and deepens efficiency in early-stage drug discovery

The promise of applying machine learning in R&D lies in not only enhancing R&D productivity (thereby breaking Eroom’s Law), but also pushing our understanding of biology and disease states in ways previously unimaginable.

Predicting biology in silico: Given the high cost of many wet-lab assays, ML models offer the opportunity to predict biological interactions computationally without running such experiments. Such predictions are especially useful in scenarios where experimental data is lacking (e.g., a specific target-ligand structure is tricky to crystallize). By computationally generating potential therapeutic candidates, ML models also provide a route of experimental discovery that circumvents the serendipity characterizing typical HTS approaches. Atomwise’s AtomNet remains an exemplar in virtually screening billions of molecules against a target by predicting spatial conformations and affinities.

Revealing novel biology: ML models can also push our understanding of biology. In silico predictions of spatial structure enable discovery of novel, allosteric epitopes against difficult-to-drug targets. Frontier Medicines, for example, seeks to reveal transient binding sites on targets using chemoproteomic techniques to screen targets in cellular environments and identify temporary structural pockets; its lead program directly inhibits both active and inactive forms of KRASG12C by capturing a transient, shared binding pocket between the two. Alternatively, AI/ML models interrogate biological systems to reveal novel targets and multifactorial biological relationships driving disease pathophysiology (e.g., Recursion’s PhenoMap highlights multi-gene interactions and clustering in different disease states).

Improving R&D productivity: Simply put, the ultimate promise of AI/ML in drug discovery is to make R&D better, faster, and cheaper.

  • Better: By predicting in silico compound properties prior to wet lab synthesis, ML models seek to generate candidates with higher probability of success. The key, here, is scale. Massive multiplexed simulations of compounds and searches over an enormous solution space of molecularly diverse candidates enable a systematic, deterministic approach to drug discovery.
  • Faster: Candidates with higher probability of success should lead to accelerated commercialization, as ML applications in R&D seek to fail earlier and faster. Whether it be sorting more quickly through high vs. low potential candidates, generating thousands of candidate variant instantaneously, or iteratively improving lead optimization via experimental validation of predictions, ML models aim to streamline time to market.
  • Cheaper: As mentioned before, in silico simulations reduce reliance on costly experimental assays. More critically, however, identifying failures earlier in the development cycle (when costs are lower) and eliminating experimental variability have the potential to drastically cut R&D expenses. Moreover, generalizable learnings across pipeline programs continuously improve the broader platform’s accuracy, speed, and sophistication.
Source: Recursion Pharmaceuticals

Nota Bene: What ML can vs. cannot do (as of yet)

Before becoming enamored with the power of AI/ML in biology, it’s worth noting that many in the biotech industry understandably feel that computational models thus far are not yet sophisticated enough to truly disrupt drug discovery. In analyzing such views, it is worth remembering that AI/ML are simply tools, not silver bullets, with both strengths and limitations.

Strengths: Machine learning, at its core, is highly skilled at pattern-recognition based predictions and recognizing emergent phenomena, without understanding all of the individual components or rules. This becomes particularly powerful in complex biological systems, in which each element can be difficult to define and track. Unsupervised deep learning approaches are also adept at rapidly processing vast quantities of complex data, often identifying important yet often overlooked data elements (i.e., removing bias). As discussed, the underlying computational power also expands the breadth of biological analysis (e.g., searching over vast solution space) to identify opportunities to go beyond canonical biology (e.g., de novo protein structures). Finally, as mentioned, ML models have the ability to iteratively improve over time with more data, creating scale efficiencies.

Limitations: Thus far (and likely for the next decade or more), AI/ML cannot yet capture, represent, and perturb a biological network in silico — models are simply not sophisticated enough to reconstitute a multiplexed biological system. This underlines the need for continued in vitro and in vivo validation of ML predictions, and for the integration of wet lab and computational capabilities. Furthermore, it is challenging for ML models to characterize phenomena from a first-principles approach (i.e., ‘what makes a drug a drug’), and its underlying calculations/methodology can remain somewhat of a black box: this can become particularly tricky in evaluating potential therapeutic candidates, and how the model came up with them. The strength of these models are also strongly determined by the quantity and quality of biological data to train the initial model, a key barrier in initially scaling TechBio start-ups that may struggle in acquiring or generating such training data (e.g., access to patient tissue samples, engineered cell lines, etc.).

Ultimately, ML models are tools that can create tremendous value, if deployed against a clear value proposition, but they do not guarantee drug-like assets unless there is an intentional approach to therapeutic development.

Source: xkcd

Market landscape of leading AI/ML biotechs in early-stage drug discovery

By all accounts, both the capital inflows and number of AI/ML biotechs have ballooned in the last decade, particularly the last few frothy VC years. Pitchbook estimates 200+ companies in this space, while some reports estimate nearly $14B invested in AI-based drug development in 2020 alone.

Yet within the frame of early-stage drug discovery, such TechBio upstarts can be dimensionalized across two main approaches.

Biosystems-focused target discovery: These biotechs are typically focused on developing synthetic biological environments and perturbing them (e.g., CRISPR screen, candidate testing) at scale to uncover novel targets driving disease biology. Here, computational models are applied to massive data sets generated from multi-parametric and multiplex experimentation. As AI/ML models are not yet sophisticated enough to recapitulate biological systems, such start-ups rely on cellular and tissue-based experiments to test disease/drug perturbations (e.g., Phenomic AI, Valo Health). Deconvoluting biology to elucidate target space, however, does not always translate readily to therapeutic assets.

Computationally driven binder design: In this space, the focus is on designing synthetic binders and predicting binder-target molecular docking in silico at unprecedented levels of accuracy. Most computational models still require a starting ligand to work off of (e.g., Nimbus Therapeutics), though compute power needs can rapidly climb in searching vast chemical search spaces or combinatorially optimizing multiple properties beyond affinity (e.g., Exscientia). As this approach does not capture the broader impact of the binding event on a biological system, biotechs typically pursue well-characterized targets with low translational risk to de-risk biology and prove the viability of their computational models (e.g., Relay Therapeutics’ FGFR2 inhibitor). Despite this computational sophistication, however, wet lab validation of in-silico binding predictions will likely always remain.

Like any market map, the above does not represent an exhaustive list, and several biotechs could rightfully argue that their logo cannot be constrained to a single spot — the placement of which I tried to determine based on where AI/ML was generating the most value in that biotech’s respective drug development process.

However, in scanning the landscape, I found a few high level categories in which players seemed to group:

Target identification

  • Multi-omic data crunchers: Gather and analyze multimodal data to elucidate novel drivers of disease & key mechanisms of action, often focusing on network relationships (e.g., gene interactions, multi-target connections).
  • Cell phenome analyzers: Apply ML models to high-resolution microscopy to investigate cellular phenotypes in healthy vs. diseased states, allowing evaluation of cellular perturbation with potential candidates.

Drug discovery

  • Generative chemical designers: Simulate molecular docking of tons of candidate ‘variants’ to identify new binders to targets, while simultaneously optimizing for parallel therapeutic properties (e.g., ADMET, solubility, etc.).
  • Allosteric hotspot finders: Uncover novel epitopes on previously ‘undruggable’ targets to non-competitively alter target behavior, often analyzing/simulating molecular dynamics to reveal new druggable pockets.
  • Smart screeners: Advance typical high-throughput screening methods by applying ML models to generate ‘smarter’ starting libraries via improved integration between in silico & wet lab capabilities.
  • De novo biology generators: Generate protein molecules without core focus on screening campaigns, with desired therapeutic properties via first-principles approach to structure-based design.
  • Quantum-enabled drug discoverers: Leverage quantum-mechanics to enhance molecular docking accuracy and accelerate the search over enormous solution spaces, by reframing binding events as energy minimization problems.
  • Novel modality applications: Apply ML models to new biotech modalities (e.g., cell therapy, gene editing) to design more optimized therapeutics and manufacturing processes.

What will it take to win?

Such a broad and burgeoning landscape at the intersection of AI/ML and biotech: who will win? And what key success factors can we glean?

Based on my evaluation of the space, I have identified 4 key criteria characterizing the leading biotechs in this space:

Clear AI/ML Value Proposition: Are its AI/ML capabilities directed towards solving the right drug-discovery problem?

  • What to look for: Clear technology-market fit should demonstrate how the bioplatform uniquely solves a R&D challenge, and how that challenge is worth solving (e.g., lead optimization in silico vs. incrementally better ligand-target simulation). Furthermore, the outcome of such ML applications should have clear drug development benefits and a clear path to creating meaningful patient impact (e.g., not just generating massive data, but translating it into developable assets).
  • Why it matters: AI/ML, while powerful, are still tools that must be applied to the right problem, to differentiate in an increasingly crowded space. Many start-ups, particularly those from an academic lab, seek to commercialize a discovery rather than start by addressing a high-impact R&D challenge.

‘Design-Build-Test-Learn’ flywheel: How well-integrated are in-silico & wet lab capabilities? How does the platform continuously improve?

  • What to look for: DBTL is considered table-stakes within the space: Experimental outputs must feed into an ML architecture, which in turn make predictions which require validation by assay, driving end-to-end iterative discovery. Furthermore, the underlying data infrastructure should focus on relatability of inputs/outputs. {Here, relatability refers to connecting each experimental data point (e.g., controls, parameters, multi-factorial analyte relationships) in a way that is consumable for ML models to analyze, like the Recursion OS}
  • Why it matters: The flywheel enables continuous platform development (e.g., in silico predictions systematically improve with in vitro / in vivo validation) and de-risks biology, which becomes particularly critical as AI/ML is not yet sophisticated enough to recapitulate biological complexity

Access to sufficient and relevant data: Are there enough relevant data available to train computational models effectively?

  • What to look for: Examine the biotech’s contractual partnerships with academia and industry and/or internal generation of proprietary experimental data. They must have access to the right (e.g., physiologically relevant) data, access to it at scale, and access to it early enough to train their ML models initially.
  • Why it matters: Initial data is critical to develop the right ‘starting point’ for target & drug discovery, and continued access will be key to accelerate the speed of computational platform development. Internally generating the right kind of data at scale also enhances the biotech’s defensibility over their platform.

Promising, multi-disciplinary leadership team: Does the team hold expertise in both computational biology and traditional drug development?

  • What to look for: Given the nascency yet competitive density of the space, ideal teams hold both significant drug and clinical development experience as well as unique expertise over their branch of computational biology. Leading players in this space specialize in combining a vast array of diverse disciplines — medicinal chemistry, gene editing, structural biology, molecular dynamics simulations — as they seek to unlock disease biology through novel therapeutics.
  • Why it matters: The taxing requirements over multidisciplinary expertise and capabilities can lead to potential recruiting challenges and significant cash needs in such a crowded market. The ‘secret sauce’ is knowing how to exploit a ML-powered platform to overcome traditional drug development barriers, and then translate that competitive advantage into commercializable assets.

What’s next?

It’s clear we are in the early stages of computational biology, and earlier still in AI/ML applications to drug discovery. As we embark on the next Century of Biology, it will be incredibly exciting to see how AI/ML will revolutionize R&D and drive the creation of better medicines faster and transform human health.

[Disclaimer: The views above represent my own, and not my current or previous employers. They reflect my understanding of the space, but may not be the latest, most comprehensive coverage of all companies, scientific advances, or clinical results.]

--

--

Shubham Chatterjee

Wharton MS/MBA Candidate. Biotech stories @ LifeSci Beat Podcast. Passionate about next-gen biotech commercialization