The State of the Art: In Silico Drug Development

This article was originally posted by IndieBio, SOSV’s life sciences accelerator in San Francisco.
In every startup’s pitch deck, there’s some form of a “State of the Art” slide. Our series, by the same name, is to help investors, entrepreneurs, and journalists get up to speed on what the true State of the Art is. In this paper, we explore the in silico end of the drug development pipeline. Just how well can computers model the complexity of living organisms?

The promise of in silico drug design and testing has been tantalizing. But to get it right — to revolutionize drug development — the physics have to be accurate. Physics are at the base of the technological stack that govern how drugs work in the body.

Biochemistry Tetris

To design a drug for a target, it starts as a sort of game of biochemistry tetris in three dimensions. A good drug is fundamentally a better puzzle piece than the natural bioactive compound; it fits the target pocket even better than what the human body naturally puts there. It fits so well that it even knocks out the natural substrate and blocks it from getting back in. The drug, now in the pocket, acts by performing differently than the natural substrate — turning off the target protein, or turning its activity up.

A classic Richardson ribbon diagram

A decade ago, the science of designing protein folds — to make these intricate structures — was exploding. Part of the challenge was that parts of the drug accomplish different functions: most commonly, some parts do the anchoring — the binding — while other parts enzymatically act on the protein, and so they need to perfectly match up with the active area on the target protein. Sort of how a rock climber might use one leg and one arm to stabilize himself in a crack, using his free hand to drive in a bolt anchor. As machine learning emerged, this space was, for a while, a kind of man vs. machine drama. John Henry vs. the steam drill all over again. As the molecular and genomic data of proteins was accumulated into libraries, machine learning had the data to train on.

You might think the race was over. But a huge portion of drug discovery and drug development today still does not use computers to optimize their drug hits into drug leads. Part of this is habit; chemists know chemistry, biologists know wet lab work. They don’t stop doing what they do. But part of it is that computational design hasn’t consistently created wonderdrugs. Think of it this way: if Google Maps sent you to the wrong location a fair amount of the time, you wouldn’t use it.

There are a lot of reasons for this, which we’ll sketch out here. At IndieBio, we have the privilege of visiting many great labs, and seeing computational biology applicants from all over the world.

Visualizing the Physics Aspect

Fundamentally, in silico is a model of reality. If the model isn’t insanely accurate, then small errors will accumulate in the action chain, leading to inaccurate predictions. Here’s the many dimensions in which computational models are improving.

If you look really close at a protein target pocket in the body, you see something you don’t see in a conventional visualization:

The puzzle pieces are moving, almost like in a hallucinogenic dream. Side chains are flexible, and can circle like helicopter blades. So if the molecular chemists’ challenge is to create the perfect puzzle piece, how do you account for this constant movement? The very notion of “fitting” gets conceptually looser — and takes our journey into the domain of physics.

In organic chemistry, we model or draw molecular bonds. A covalent bond is when two atoms share electrons. Single bonds allow atom movement; double and triple bonds are more stable but still have vibrational energy.

Creating drugs with covalent bonds in the target pocket is very desirable, but also hard to do, and rare. In most cases, they’re more like a free climber, using no rope or anchors. What keeps them in place are forces in the domain of physics, both at a Newtonian level and a quantum level. These include electrostatic forces, where negative and positive charges attract and can swap an electron or form a salt bridge. Molecular structures also interact through Van Der Waals forces, which repel at a certain distance, attract at a closer distance, and ultimately repel again when too close.

Hydrogen bonding

Now, to use machine learning to teach a computer to play chess, you can put in some very basic rules — how each chess piece moves. Or you can just have the computer watch some games and learn the rules . But either way, the computer learns the rules. Then by playing a quadrillion times, it deduces strategy. This same process is true when you use artificial intelligence to design drugs. Just that the rules to learn are insanely complicated, and understood only by the world’s best molecular scientists and physicists. In the realm of in silico design, the rules are governed by genetics, molecular chemistry, and the atomic levels — Newtonian physics, and quantum physics.

Here’s just one part of the quantum level that has to be modeled accurately: electron indeterminacy.

Cartoon of electron indeterminacy

Electrons are like a little kid who never stops playing hide and seek. They’re never totally in one place. This is stuff that even machine learning can’t extrapolate accurately. So to do computational biology, and to design drugs in silico, the software models energy landscapes, probability fields and force fields. The binding attraction is not stable, it’s fluid. To go back to the rock climber metaphor, he’s not just a free climber, he’s constantly fidgeting, and as he fidgets, he nearly falls off the wall, then recovers.

Zoomed into the binding pocket of a protein, these energy landscapes behave like this:

Image courtesy of Gavilán Biodesign

Now imagine trying to design a drug that remains stable in that pocket.

While there’s a fierce philosophical divide in the field between rules-based models and pure machine learning, we look at it more practically: wherever machine learning can help, it should; wherever physics models are better, they should be used. Hybrid systems use both. What really matters isn’t how the software learns the rules of biophysics; what matters is that it’s accurate.

Any artificial intelligence computation engine that hasn’t accurately learned the rules of the game is going to lose, time and again, to the real complexity of living biological systems. The drug it designs might work, but it might not.

The Proof is In Vitro

In truth, it’s not enough just to “work.” You have to prove it works, at every step. It’s like a court case: it’s not enough to know he did it, you have to prove it beyond all doubt. And so you prove the drug in wet lab work like assays and tissue cells, then you prove it again in organisms where human genetics are preserved, and you prove you can manufacture the drug exactly the same way every time, and then you prove it’s safe in humans before anyone even asks if it “works” in humans.

So you might be wondering, how do these computational chemists even know if their model is accurate? How do they confirm their work?

Various ways. First, their energy field models don’t just come from anywhere. Usually, they come from the physics department at a university where there are world experts in physics who’ve been validating their models for years (though not inside living organisms.) So their energy model has a known level of mathematical accuracy. Force field models have been measured and errors improve the model. Confirmation of the electrostatic forces that govern long range interactions have been done on full amino acids. Protein movement is trained on the Richardson libraries; Jane Richardson, a biophysicist at Duke, invented the 3D schematic ribbon diagram used to draw and classify proteins; the Richardson libraries reflect the culmination of decades of structural biophysicists’ expertise and insight. Then the computational protein designers can do retrospective studies to confirm their model results in accurate predictions — using approved drugs that have already been proven to work.

For a new drug they design, their predictions have to be confirmed by wet lab work, where tools use light or heat to probe the molecules, in order to measure how much energy it takes to break the molecules apart, or destabilize them. In this way, biochemists measure and score biochemical properties such as binding affinity, specificity, and stability.

It’s these properties that pharma companies want when they have a drug designed.

  • Binding affinity means the drug attaches to the target better than the natural substrate.
  • Specificity means the drug binds only to the target and not accidentally somewhere else.
  • Stability means the drug doesn’t fall apart or change shape in unintended ways.

But this is where it gets really interesting. Because in theory, if you have the binding affinity, specificity, and stability exactly right, you can make the ultimate prediction, a sort of Holy Grail proof — which is predicting how the drug will work when the target protein mutates over time.

This is the frontier of in silico work today.

Explaining Resistance Mutations

Genetics controls the shape of cell receptor target proteins, so when the genetics mutate, those proteins change shape. The target pocket morphs. For the rock climber, that outcrop they were counting on grabbing with their next move is suddenly gone.

Let’s focus on the interaction between cancer cells and drugs for a moment. Most cancer drugs only work against a specific pocket shape — against a certain genetic code. So over the last decade, pharma companies have used bioinformatics to match patients (with certain genetics in their tumors) to the drug. This has been transformative for the pharmaceutical industry, and as a result, the rate of FDA drug approval has gone up meaningfully in the last five years. To be clear, it’s not that the drug necessarily works better, rather it’s only being given to patients with the specific tumor genetics.

But in most cases, a tumor has some genetic diversity even before an expert oncologist tries to treat it.

As a result, there are usually cancer cells unaffected by the drug. Maybe only a few of them. But they keep dividing, growing fast, and the tumors come back. In this way, cancers escape precision drug targeting. This is one of the largest factors in what oncologists call resistance. The capacity of tumors to develop resistance to the drug, even if that drug, at first, was very successful.

Over the last fifteen years, oncologists have mapped which genetic mutations emerge against drugs on the market today. For instance, it’s well documented that a lung cancer patient taking the tyrosine kinase inhibitor erlotinib (trade name Tarceva) will likely see a benefit for some time, then eventually the tumor genetics will undergo a T790M mutation. At codon 790 of exon 20, the amino acid threonine becomes amino acid methionine. The tumor comes back.

Now, to be even clearer: this isn’t the only mutation the cells in the tumor make. Cancer cells are fast growing, and almost every time a cell divides, a mutation happens. The human genome has 3 million base pairs, so there are an almost infinite number of mutations that can happen over time. Cancers mutate fast. But most mutations are silent, they don’t change anything. Some mutations cause the cancer cell to fail, and it dies. Some act on the protein, but not in the pocket where erlotinib is, so the drug still works on the tyrosine kinase enzyme. Each of these mutations has an energy cost to the cell, and so they all have different likelihoods.

So the T790M mutation isn’t the only mutation that causes the pocket shape to change and the drug to fail. There are hundreds that do it. Rather, it’s the one with the highest probability.

At the very leading edge of computational biology today, data scientists and biophysicists are trying to predict what mutations will emerge to drugs, using their in silico models, even before patients have ever tried the drug.

Predicting Resistance — and Designing Against It

Only a few labs in the world have been able to succeed at the level where they publish a science paper about their results. At MIT, Bruce Tidor’s lab has been publishing results, but the one lab in the world that leads this effort is led by Bruce Donald at Duke. They first predicted resistance mutations in the superbug MRSA, then they predicted resistance mutations in HIV, and then their software was used by The Institute of Cancer Research in London; it correctly predicted the likeliest resistance mutations to a wide number of precision cancer therapeutics.

Your average cancer drug company doesn’t even know this can be tried; it’s so advanced that it’s not even on their radar. They find out the hard way — at the clinical trial stage — how cancers escape their seemingly perfectly designed drug. Two hundred million dollars into the FDA process, and their drug ultimately fails.

But predicting resistance mutations would represent several major, historic advancements. First, if you could predict resistance mutations like T790M, it means your in silico model is insanely accurate. Second, you could then select between different drug designs at the start of the drug development process — anticipating which drug is going to give patients longer lifespan. Third, you could design drugs in silico that could be so perfectly designed, they would work despite multiple resistance mutations, even dozens, or hundreds.

The software from Bruce Donald’s lab was used to design a new antibody against HIV, in cooperation with the National Institute of Health’s Vaccine Research Center. HIV is the fastest-mutating thing known to man; every time it copies itself, it mutates. They started with an existing HIV-1 antibody, VRC07. Then they optimized it — finding a drug design that would still work against the likeliest mutations. The drug that was designed, VRC07–523LS, was predicted to be eight times more potent than its precursor, and then it was confirmed in lab tests. It’s now in nine clinical trials.

The Speed vs Accuracy Tradeoff

Speed of arriving at an optimized drug is also critical. The Schrödinger Equation is the only purely computational way to calculate energies, but it’s only used to model small molecules because it’s so computationally expensive. Most molecular dynamics software use approximate models based on Newtonian physics. There’s a term of art used in the field, which is “accurate enough to be meaningful.” But as we’ve discussed, just how useful the predictions are varies greatly. Schrödinger (the company) can model the mechanics to evaluate the affinity, specificity, or stability of a drug in an hour, though they only model one dimension at a time. For biotech companies that just want to know exactly how their drug is going to work in the protein pocket, it’s been a great solution.

But if the goal is to search through all possible biochemical structures to find the best one for a target pocket, speed is imperative. Neural network methods are 300,000 times faster — but they give up significant accuracy.

Recently, a team from Bruce Donald’s lab at Duke has started a new company, Gavilán Biodesign. Their newest software, which uses a hybrid model, is a trillion times faster — while still being accurate enough to perform the incredible feat the lab is known for, predicting resistance mutations. Gavilán’s software generates 100 trillion molecule designs a day, evaluating their physics and scoring them against all criteria simultaneously, including future mutation resistance.

Undruggable Targets are Next

At IndieBio, we find that other people’s biases are often our opportunity. In Silico drug discovery shares a skeptical perception bias with other sectors like stem cells, gene therapy, and the microbiome: there was a lot of hype, then a quiet decade where the miracle that had been teased wasn’t forthcoming. Investors cool off. By the time real solutions are on the horizon, investors are looking the other way.

We strongly believe that in silico could transform pharma economics, both speeding up discovery and reducing the number of drugs that fail where it’s so costly — in clinical trials.

But the biggest impact could be in designing drugs for targets that are currently considered “undruggable.” Up to 85% of the proteome is considered, currently, beyond our reach. This is because, in some cases, the active site on the protein lacks a pocket where a drug can bind and stay put. Often the site is designed (by the cell) for a protein-protein interaction. Proteins usually don’t work on their own; they operate in complexes, and these interactions are usually transient. But proteins are big, compared to drugs; even if a drug can anchor onto a shallow pocket of one protein, it couldn’t simultaneously secure the other protein.

Drugging these targets has been elusive, but the effort is well underway, because a few of these targets — such as tumor suppressor p53, oncogene kRAS and proto-oncogene c-Myc — are so impactful on how cancer cells multiply.

The first kRAS covalent inhibitor, designed at chemist Kevan Shokat’s lab at UCSF

So will machine learning replace theoretical models?

The new contest of John Henry is not man vs. machine. It’s biophysical approximation vs. machine learning. At IndieBio, across many companies we have seen the incredible value of machine learning in the loop. But we have not seen it arrive at algorithms without significant assistance from a theoretical model that is based on deep human insight into biological systems.

To classify blood cells by type in a microfluidic device, it took a theoretical understanding of cells scored in ten dimensions — five of which are imaginary. To find the blood markers of Autism, it took decades of work into the complex interaction between 25 reactions in the metabolism pathways. Machine learning then improves the signal — but could not find the signal on its own.

AI is not going to replace lab work, and it shouldn’t be thought of that way. The desire is to make an even tighter loop between wet lab testing and the data models — in an ideal world of automated discovery, the wet lab is roboticized, and it talks with the computational machine directly.

About IndieBio

At IndieBio, we are always looking for the next great company. We don’t just invest in startups — we help create them, often working with post-docs and principal investigators to build a team and transform them into scientist-entrepreneurs. To learn more, visit

Keep up with the SOSV community. Subscribe to the Newsletter, and link up on Facebook, LinkedIn, and Twitter!

Originally published at on May 6, 2019.