An Introduction to AI in Drug Discovery

Amee Kapadia
Cantos Ventures
Published in
11 min readMar 8, 2022

Perhaps the most obvious example of platform TechBio is AI-enabled drug discovery (AIDD), a sector that has experienced a breakthrough year in funding with some $4.1B of capital being committed in 2021 alone. Much like the name suggests, AIDD is the application of computational tools to the drug discovery and development process with the hope of speeding up the time and capital it takes to see a drug from research to clinic. With AIDD, we hope to move from drug discovery to deliberate drug design. In order to fully understand its potential, it’s worth commenting on the process of drug discovery.

Drug Development Process. Source
Drug Discovery Process — step one of the drug development process. Source

Drug discovery is all about finding 1) a biological target and 2) a drug that acts on the target. Taking a step back, a target is any cellular structure involved in controlling disease biology. It could be a protein, enzyme, hormone, string of nucleic acids, or other biological structure. With that in mind, it becomes apparent that there are millions of possible targets for any given disease. And another million possible drugs that can bind or otherwise interact with the target to cause a desired therapeutic effect. That’s a lot of permutations to run through using brute force wet-lab experimentation.

Four major drug targets. Source

Early-stage drug discovery is a lot like fitting a lock to a key. We first need to find the right lock (target), and then we need to find the right key (drug) that will open that door. However, before doing any of this we need to choose what room (disease/indication) we want to open (or lock shut) in the first place. Since nothing in biology is without layers of complexity, some indications are much harder to target while others don’t have obviously druggable targets. At first glance, many doors seem locked from the inside.

There are two approaches to pharmacology — classical and reverse. With classical pharmacology you know what the diseased and healthy states look like so you screen for compounds that cause the desired phenotypic change in what is traditionally thought of as target identification. In reverse pharmacology, you have an idea of what molecules/targets are involved in the disease and want to inhibit/activate that target in some way so you screen for compounds that bind the target. We typically skip over target identification and instead skip straight to experimental validation and hit generation. Regardless, the process is similar given the end goal is to find a target to modulate and a relevant drug candidate that does so.

Simplified Drug Discovery Process

1. Target Identification and Validation

The goal of target identification is to find a few proposed biological sites or molecules with which potential drugs may interact to alter disease activity.

Traditionally, targets were discovered through experiments that characterize specific disease pathways. These studies are heavily biochemical in nature and involve characterizing the relevance of the target in the big picture of the disease. A relevant question to ask is does inhibiting or activating this target cause a favorable change in disease activity?

However, even if you find a promising target that affects disease activity, it might not be the best one for the job. The problem is, you wouldn’t know there is a better target out there because you only get what you screen for. This is where machine learning can help as screens wouldn’t be limited by experimental capacity.

Methods for finding potential drug targets. Source

After a target(s) is identified, the focus turns to validation. Whereas target identification is concerned with finding a molecular element that is involved with a disease of interest, validation has to do with making sure that modulating the target can have a desired effect on the disease. Targets can be validated through various biochemical experiments including creating gene knockouts, measuring protein interactions, and evaluating binding and/or kinetics.

2. Hit Identification and Lead Generation

Once a target has been validated, the next step is to find “hits” or molecules that favorably interact with the target(s)…drug candidates! Depending on the type of drug desired, this can range from massive small molecule library screens against the target to deliberately engineer molecules (often proteins) that bind to the target. AIDD companies especially are starting to develop a preference for deliberate drug design by anticipating molecular properties and structural constraints. Some companies are also screening against panels of already FDA-approved drugs to uncover new uses for them (see drug repurposing).

It’s worth noting that “target”, “hit”, and “lead” have different meanings to different researchers. While the names may change, the process of drug discovery remains consistent. For this article, we’ll define targets as biological sites, hits as potential drug candidates that interact with targets, and leads as chosen drug candidates for further advancement.

Successful hit identification results in promising drug candidates to be further evaluated in later research and eventually trials. At this stage, not only is target/ligand-binding and disease-effect being evaluated, so too are the pharmacokinetic (PK) and pharmacodynamic (PD) properties of the hits. These properties are commonly referred to as ADME/Tox (absorption, distribution, metabolism, elimination, and toxicity) and can be tested through both in silico (computer-run) and wet-lab studies to weed out poor drug candidates as initial hits are refined in a process called “hit to lead” or lead generation.

The in silico tests are a quick and inexpensive way to check ADME/Tox properties of hits before choosing to advance them. Schrodinger’s QikProp program is an example of how such a test may run. Lipinski’s Rule of 5 is also still engaged to help with hit screens. Suggested by Christopher Lipinski in the late 1990s, Lipinski’s rules were formulated to guide drug design based on clinical drug candidate data from Merck and Pfizer. These rules are highly specific and formulation-dependent; for example, “an orally active agent must possess no more than 5 hydrogen bond donors and 10 hydrogen bond acceptors.” Despite the granularity, they are strikingly accurate and underscore how much value there is in analyzing a large volume of data to guide drug discovery and optimization.

While not always followed exactly, such generalizations are useful guides for streamlining the search for hits. If no promising hits are found for a given target, it’s back to the drawing board.

3. Lead Optimization

After hits are identified, the focus turns to lead optimization or advancing the lead candidate into a preclinical candidate. This conversion relies on enhancing the properties (ADME/Tox, structural, etc) of the lead for increased efficacy and decreased toxicity, both of which are in-line with the safety & efficacy profiles that the FDA will look for before Investigational New Drug approval and trials. Animal model studies, dosing studies, and both in vitro and in vivo assays are performed to quantify target/lead interactions and general toxicities.

The focus is on improving lacking aspects of the drug candidate while maintaining the favorable properties from before, a strikingly difficult task given the ambiguity and interdependence of biological structure and phenotype. For example, when increasing the absorption profile of the lead, developers must be careful not to also increase aggregation. You can think of this as trying to solve a multi-dimensional Rubik’s Cube one side at a time.

This is where computational modeling becomes incredibly useful. With lead optimization, the developed candidate needs to be close to perfect. Multiple characteristics are getting fine-tuned in parallel, which means it is easy to generate endless variations of the lead over a time period of several years. Algorithms make predicting and testing modifications more data-driven and efficient.

The value and complexity locked in lead optimization is high enough to make it the sole focus of several companies. Once a hit and lead combination emerges, the optimized lead candidates leave drug discovery and begin the longest stage of the drug development process — clinical research.

Simplified visual of the drug discovery process. Source

Artificial Intelligence in Drug Discovery

With that context, it’s easy to see how there’s room for AI to disrupt the discovery process, not only in accelerating discovery but also in helping discover higher-fidelity candidates. Computational approaches may even move the field away from discovery and closer to outright design.

Types of ML models in the drug discovery process. Source

Covering the entire array of computational methods that are applicable to drug discovery is akin to boiling the ocean. Instead, we have highlighted a few common approaches below, a breakdown borrowed from Manolis Kellis’ computational biology lecture series.

  1. Simulation: a structure-based method where you use virtual simulations to glean information about molecular structure and dynamics—things like ligand docking and binding profiles. This strategy allows you to get information about the target and potential ligands but still requires sorting through a vast number of compounds and verifying the dynamics in the wetlab. Virtual simulation may help elucidate how a target and ligand interact to guide further experiments but does not do much to speed up the drug discovery process since confirming experiments are still required at each step.
  2. In silico screening: sometimes referred to as “virtual screening”, is the computer-based version of high throughput screening where algorithms sift through online libraries of molecular compounds. This requires access to relevant and organized datasets, whether public or proprietary. However, the scale of this testing is still not sufficient—there are over 10⁶⁰ drug-like molecules and we can only really screen for ~10⁸ compounds a day. Screening through all possible combinations would take 10⁵² days. In comparison, the Earth is only 10¹² days old. Virtual screening can be useful for targeted applications but such brute force still has the issue of only getting what you screen for. For example, you can predict certain properties like solubility and biodistribution and screen for the best candidates but even if a virtual screen returns a few potential hits, you have no way of knowing if those are indeed the best hits; in other words, you may be left in a position where the local maxima look deceivingly like the global maxima.
  3. Novel drug design: this is the most interesting group of methods as we move from drug-hunting to deliberately engineering the optimal drug candidate. Whereas simulation and virtual screening involve mapping from the chemical space (known compounds) to the functional space (desired properties), novel drug design goes the other way—we know the properties we want and use computational techniques to find or design the right chemistry.
Visualization of ligand binding a docking site through PyMOL, a common tool used in Simulation. Source

Datasets in AI Drug Discovery

Given the overlap in types of algorithms used (deep learning, computer vision, natural language processing, etc) it is often the type of data used that distinguishes companies in AIDD and adds to competitive advantage. The goal is almost always to find a target and lead that interact to modulate disease in a meaningful way. There are two types of datasets that are compelling for figuring this out: sequencing data and compound libraries, and imaging data with each company using a combination of both, from proprietary and publicly available datasets.

  1. Sequence data and compound librariesdata libraries that come from known molecules and/or molecular sequences at the DNA, RNA, and protein levels. Much of this data is readily available in massive publicly available databases such as UniProt and PDB. As drug discovery becomes more design-forward and data-enabled, it is not unlikely that epigenetic sequencing data becomes a major consideration. Omics data is especially helpful in building these datasets and includes insights from spatial transcriptomics, genomics, proteomics, and multiomics. Companies relying heavily on sequencing data and compound libraries include Phenomic, DeepMind’s AlphaFold, Exscientia, Enveda Bioscience, Unnatural Products, and Valence Discovery.
  2. Imaging datacellular/molecular imaging is one of the highest leverage ways to understand how molecules interact and glean insights on reaction formation. Tools such as x-ray crystallography, cryoEM, and various microscopy platforms can generate detailed data on what is happening between molecules and within cells. Companies building platforms around imaging data typically have a unique advantage in building proprietary datasets of cellular/molecular images as this type of data needs to be granular and it’s often hard to find task-specific public data. Examples of those focusing on imaging data include Recursion Pharmaceuticals, Eikon Therapeutics, Gandeeva Therapeutics, and AbCellera.
CryoEM data (left) from SARS-CoV-2 Omicron variant. Source

Algorithms in AIDD

A variety of algorithms are then used to make sense of the data and create learned predictions for target identification or hit generation, depending on what stage of the drug discovery process you’re going after. Three computational methods are described below:

  1. Deep Learning — subset of machine learning that derives insights from unstructured data. Deep learning requires a higher volume of data but the data does not need to be labeled. It can be used for a variety of drug discovery tasks including advanced image analysis, structure prediction, and novel chemical design. Various algorithms are often implemented into the same platform; for example, one set of algorithms can help identify target/ligand binding while another set optimizes potential drug candidates for ADME/Tox properties and still another makes sure that the molecules suggested are not impractical to synthesize at scale. Gradient descent, reinforcement learning, and neural networks (concurrent, convolutional, graph, etc) are examples of deep learning frameworks that can be applied to AIDD.
  2. Computer Vision — sorting through molecular and cellular images by turning image data into numbers that can be processed with higher throughput and used to make informed predictions. The classic example of computer vision in drug discovery is Recursion’s machinery that processes images from diseased cells vs healthy cells to better understand disease biology and uses those insights to identify new targets and drug candidates.
  3. Natural Language Processing (NLP) — reading biological code to understand how nature designs molecules/biologics (or “spells” words) and predict structure accordingly. An example of NLP applied to drug design is Nabla Bio’s* approach of reading amino acid sequences to predict protein structure and function.

There is no one way to apply machine learning to drug discovery—it’s more common than not to see companies spanning the full stack of drug discovery and design, from mapping cellular interactions with computer vision for improved target discovery to using NLP to predict entirely novel proteins to using patient data to discover genetic variations. ML can be implemented in any or all of the stages of drug discovery and, ideally, as we gain a better grasp of biological interactions and the limits of AI, algorithms will increasingly be used to design the drugs we want outright.

We fully expect AI-powered drug design companies to outcompete top-ten pharma with respect to capital efficiency to the clinic. While we are optimistic, an important caveat is that the first AI-discovered drug has yet to receive the utmost clinical validation — FDA approval and post-market surveillance. So while the jury is officially still out, early data points to applying AI to drug discovery as our best way forward in raising the hit rate, increasing efficiency, decreasing the time to clinic for new drugs, and ultimately improving outcomes for patients.

Snapshot of AI Drug Discovery market based on modality

Below is a snapshot of some of the players in AI Drug Discovery, segmented by main therapeutic modality. Stay tuned for a longer post characterizing TechBio and AIDD startups…

AI Drug Discovery market map segmented by pipeline modality

Huge thanks to Eric Dai, Jonny Hsu, Nima Ronaghi, Benji Leibowitz, Dana Watt, and Ian for the conversations around and revisions to this post

--

--

Amee Kapadia
Cantos Ventures

exploring bio and the near frontier at Cantos Ventures 🧬🌎