From Biotech to TechBio: ML-powered Drug Discovery (Part I)

Shubham Chatterjee
9 min readMay 4, 2022

--

A Cambrian explosion of TechBio start-ups has swept the market, with massive valuations emerging from its ML applications to drug discovery. What’s going on, and why now? Read on!

Source: F-Prime 1000

The last decade has witnessed extraordinary advances in the biotechnology sector, from plummeting genetic sequencing costs to strides in synthetic biology and gene editing, to most recently mRNA vaccinating billions. Yet such innovation has come at a cost. Despite today’s biotech bear market, earlier years observed incredible biotech investment, including nearly $24B in 2020 Nasdaq biotech IPOs and $28B in 2021 global biotech VC investments. Yet this funding cloaks a grim reality of declining biopharma productivity despite increases in R&D investment — a trend that could be broken, and even reversed, by novel AI/ML applications to drug development.

In this article, I will cover:

  1. Biopharma’s R&D challenges, and how AI/ML could reverse the trend
  2. Why we are now at a point to transform drug discovery using ML
  3. What ML applications look like across the drug development value chain, and the value potential if we get it right

For readers already acquainted with this space, Part II focuses on ML applications in early-stage drug discovery, the landscape today, and key success factors.

Eroom’s Law: Declining R&D productivity plaguing biopharma

The average R&D cost to bring a drug to market has risen to $2.6B, with fewer drugs winning commercial approval given a 2% ROI on every R&D dollar spent. Such has been the severity of this decline that analysts have labeled it as “Eroom’s Law” — Moore’s Law in reverse — defined as the decline in R&D efficiency per dollar of R&D invested.

Source: Nature

What’s driving such a steady decline? Forces both inherent to the nature of biology itself, as well as the conventional drug development process. Biology is a complex, networked phenomenon with millions of interconnected relationships driving any given single phenotype. Developing models to reconstitute and study biology, synthetically or otherwise, has been incredibly challenging, which in turn has made biological predictions of newly discovered compounds incredibly difficult to forecast. Such complexity has been compounded by the rising bar for discovery and approval. With all ‘low hanging fruit’ therapies already discovered, scientists have been pushed to develop therapies that go above and beyond the efficacy/safety profiles of standards of care.

Yet the conventional drug discovery process has also exacerbated the R&D productivity decline. Traditional ‘trial and error’ drug discovery approaches rely on high throughput screening — experimental screens of a static library of potential compounds against a drug target to identify ‘hits’ — which rely on serendipity to discovery a new therapeutic candidate rather than a systematic, deterministic method to discovery. Such brute force processes do not necessarily improve with greater data or larger compound libraries, and are often limited by biophysical constraints to library size. Perhaps most importantly, as this excellent BVP review explains, HTS methods simply identify drug-target binding affinity, and cannot capture the multiplexed, unpredictable repercussions of such binding within a disease biology network; for example, whether a drug-target interaction is influenced by endogenous regulatory factors or knock-on pathways that only exist within cell-based systems. This is a key driver as to why promising preclinical candidates often fail in clinical trials once tested in human subjects.

The concomitant effect of such approaches is costly, slow, and failure-prone drug discovery.

Reversing Eroom’s Law — How computational biology can upend R&D

I believe we are now at an inflection point, entering a new era where we can engineer biology and move from serendipitous discovery to deterministic design. As a16z’s Vijay Pande argues, we have developed sufficient sophistication of biological understanding to apply engineering principles to it: in the way we construct drugs, design experiments, and measure outcomes.

A key factor enabling this inflection point is our own understanding of biology and its inherent complexity, which has evolved not unlike learning a new language: in first discovering key biological principles like the Central Dogma (learning grammar), we were then able to interrogate biology to uncover new pathways (forming sentences), and are now at the point where we can engineer biology beyond its canonical purpose towards novel outcomes like CAR-T (crafting ideas). Specifically, computational effects should transform R&D with new characteristics in the future:

  • Repeatable and reproducible: Overcoming typical variation in experimental results, computation will allow us to parametrize, automate, and standardize experimental designs to achieve greater consistency in outputs
  • Greater control over outcome: Examples such as CAR-T or base editing has demonstrated our ability to develop therapeutics that improve upon nature and drive towards intentional outcomes. Computation allows us to not just discover a therapeutic by chance (e.g., HTS), but systematically determine an optimal candidate to perturb a biological process (design-build-test-learn)
  • Continuous improvement: The rise of bioplatforms (most famously, Moderna) has driven an iterative R&D process in which each experiment, data input, and new candidate will ‘teach’ the platform to get ‘smarter’ over time, improving the probability of technical success
  • Efficiencies with scale: The confluence of bioplatforms and reproducible, deterministic discovery is shifting drug development towards an industrial revolution in R&D. This means that each experimental outcome, biomarker identification, data input, and candidate design make the platform not only better at the next iteration, but faster and cheaper as well, eliminating potential failures early.

“We can now apply the language of AI to biology to truly engineer new biological systems. It’s paradigm-shifting — moving from empirical discovery process to one where we can control the outcome we want.” — Molly Gibson, Co-founder of Generate Biomedicines

Why now

A convergence of scientific advancements is unlocking computationally-driven drug discovery.

Improved data processing with data explosion: The explosion of new types of biological data (e.g., genomics, transcriptomics, proteomics, high content imaging) has created novel ML training data sets based on more physiologically relevant data, data sets that have now also been specifically structured for ML consumption.

  • ML models can now be trained on more human-centric data: new data types like cell morphology, gene heat maps and temporal data (e.g., patient biomarkers over time) mean ML models can better predict human biology in silico. Valo Health’s Opal Computational Platform seeks to excel by using massive human-centric data banks to identify new targets by linking genotype-phenotype-biomarker relationships on a patient-cohort level.
  • The challenge with applying biological data to ML, however, is its variety and multidimensionality: flow cytometry or LC-MS readouts can’t readily be analyzed by neural networks. As such, advances in data cleaning and processing (e.g., layered information, more labels, more descriptive tags) has improved the readiness of data to be consumed by ML models. Start-ups like DeepCell have developed methods to intake novel data like cell morphology to utilize neural nets to sort cells and identify biomarkers for diagnosis.
  • Finally, enhanced IT infrastructure of TechBio start-ups cyclically feeds experimental data outputs (across data types) as training inputs into predictive ML models. Recursion OS is a leading example of an infrastructure layer integrating outputs from experimental hardware, massive biological and chemical data sets, and the underlying ML software.

More sophisticated biological understanding: The rise of new ‘-omics’ data combined with better experimental tools have enabled an enriched view on biological networks and the multifactorial drivers of disease (e.g., relationships between multiple genetic variants).

  • Greater collection of ‘-omics’ data (genomics, transcriptomics, proteomics) have been applied to novel tools like CRISPR to elucidate biological functions, improve target identification, and engineer novel activity. For example, Insitro’s discovery process relies on engineered iPSC lines collected on a patient cohort level — i.e., take patient samples and de-differentiate into iPSCs while retaining initial patient genetics — which allows them to develop improved disease models on which to train their ML algorithms. As such, their in silico target identifications are specific towards defined genetic cohorts to enable a precision medicine approach.
  • The explosion of data has been combined with improved experimental tools and multiplexed experimental design, creating novel analytes (e.g., high content imaging, phenotypic screens of genetic perturbations), with tandem biological measurements at scale (e.g., sequencing DNA in parallel, detecting multiple metabolites simultaneously). Octant Bio assesses the impact of small molecules on multiple receptors (GPCRs) and disease pathways at once to treat multifactorial conditions like neurodegeneration.

Enhanced algorithms, models, and compute power: Computational approaches also allow for the exploration of vast chemical and biological spaces for potential therapeutic compounds, and can also better characterize and predict drug candidate structure and affinity.

  • Recent advances in predicting molecular interactions and spatial conformations in silico (e.g., AlphaFold 2, Rosetta Fold) has transformed ML-simulations of drug-target affinity. A leader in this space, Relay Therapeutics utilizes a super-computer to drives its protein-ligand molecular dynamics simulations, ultimately identifying novel allosteric binding pockets on targets for its small molecule candidates to bind to.
  • Greater compute power allows start-ups to navigate massive chemical search spaces. In this approach, drug discovery can be viewed as multi-dimensional topology of peaks and valleys, each point as a potential drug candidate based on its therapeutic properties. Xtalpi utilizes quantum mechanics to explore this vast search space for the optimal candidate, generating tens of millions of drug-like molecules for subsequent screening.
  • Beyond target affinity, ML models are also beginning to simulate drug-like properties (ADMET, PK/PD) to accelerate lead optimization post hit-identification.

AI/ML across the drug development value chain

From target discovery to final implementation, the drug discovery value chain can be split into several key steps: target identification, drug discovery, preclinical development, clinical trails, and finally diagnostics and administration. Along each of these steps, AI/ML has begun to create meaningful value:

Note: Non-exhaustive. Certain full-stack biotechs may play in multiple steps along the value chain.
Summary table of ML value creation across drug development

Size of the prize: Tremendous potential value from TechBio in the future

What’s the opportunity if we can get this right? Unlike other burgeoning spaces in biotech, the economic value potential of applying ML to drug discovery is nearly impossible to quantify and clearly enormous. The question is not whether this technology revolution will deliver value, nor even how much: question is when, based on how quickly the technology can evolve. Skeptics await as the “proof is in the pudding” — that is, the actual clinical success of FDA-approved ML-designed therapeutics.

The good news is that there is growing clinical validation of ML-powered discovery. Over the last decade, there have been 15+ assets discovered via AI-driven discovery process now in clinical development (and 150+ in preclinical programs). With a 36% annual pipeline growth of these TechBio start-ups, some analysts estimate drug development timelines (from discovery to IND submission) to shrink from the industry average of 7 years to close to 2–3 years. The most recent memorable demonstration of such rapid R&D remains Insilico Medicine’s ISM001 program for idiopathic pulmonary fibrosis, claimed to have been discovered and developed in 18 months (now in clinical trials). Big Pharma seems to have also recognized this potential, evidenced by the numerous ongoing collaborations between established pharma and nascent TechBio players.

How do such technology applications translate to economic value?

  • Streamlined time to market: E.g., FDA approval 2 years earlier creates extended peak sales opportunity prior to future loss of exclusivity
  • Increased R&D productivity: Greater shots on goal improves probability of technical success
  • Market expansion: Computational applications expand the overall market by going after novel targets and drugging previously undruggable targets
  • Reduced R&D costs: Reduced clinical trial failures and streamlined development timelines should squeeze now-ballooning pharma R&D budgets.

But how does AI/ML actually create value in early-stage drug discovery? What does that landscape look like? And what should you look for in evaluating companies in this space? Read on to part II!

[Disclaimer: The views above represent my own, and not my current or previous employers. They reflect my understanding of the space, but may not be the latest, most comprehensive coverage of all companies, scientific advances, or clinical results.]

--

--

Shubham Chatterjee

Wharton MS/MBA Candidate. Biotech stories @ LifeSci Beat Podcast. Passionate about next-gen biotech commercialization