Calculating the Incalculable

How computational biology is transforming drug discovery and delivering new hope

Christopher Gibson

Published in

World Positive

10 min readJul 18, 2017

For 60 years now, we’ve made fewer new drugs per dollar each year on average, even when adjusted for inflation. Regulatory shifts and big advances in technology led to improvements in the short term, but the longer-term trend remains unchanged. Last year was particularly rough, with more than $150 billion in R&D spending in the industry, and only a handful of new drugs approved. While extrapolating trends is always dangerous work, we won’t be able to afford any new medicines as a society if this one continues.

Why is it so hard to find new treatments, despite the hundreds of thousands of brilliant and dedicated scientists alongside technological advances in so many fields?

No one factor explains the trend, but one overarching theme sticks out to me: In the face of a system as complex as our own biology, and because of human nature, we have sought to distill biology down to relatively straightforward, linear hypotheses that are easy to digest and communicate. You need only visit the homepage of any major medical or biological journal today to see the evidence.

This reductionist approach is tempting in the face of enormous complexity: isolate a question and hypothesis down to simple, testable sets of variables. But this strategy doesn’t really contend with the fact that biology is complex. In biopharma, the frequency with which treatments fail at various stages, due to unexpected toxicity or underwhelming efficacy, isn’t that surprising. But our regulatory system is designed to account for this by placing drugs into increasingly complex systems and looking for what one hopes to see (a benefit) and what one hopes to avoid (almost anything else!).

The process itself is straightforward: Scientists generate a large amount of data about the ways drugs behave in cells, and then in animals. They explore the propensity of the drug to act on specific targets that we know are dangerous. For example, drugs that act on an ion channel referred to as hERG are known to cause a type of cardiac arrhythmia, so every new drug that’s being developed is checked to make sure no problems are likely to be encountered with an off-target hERG effect.

When everything checks out in cells and animals, after years of work and many millions of dollars, a drug moves into humans — typically in escalating doses in healthy humans where safety is assessed, and later in larger trials monitoring both for safety and efficacy. This path is designed to minimize risk precisely because (1) we know surprisingly little about biology, and (2) we are often surprised by how it all actually works in practice.

Tech Buzzwords to the Rescue?

The advance of big data, machine learning, and artificial intelligence is leading to both new hope and renewed skepticism about our ability to fundamentally change the pace and scale at which new medicines can be discovered.

The worry about inflated expectations among biotech industry veterans is understandable. After all, promises of computational biology have been made and broken before, both in the ‘80s (new drugs will be designed by computers!) and in the 2000s (gene sequencing will provide the blueprint for life and make sense of it all!) While these technologies have contributed to many improvements and myriad successes, they haven’t quite been the holy grails they’ve aspired to be.

This time might be different.

Here’s why.

Data Explosion: First, new technologies allow us to collect significantly more data. Human Longevity Inc., for example, is on a mission to collect and sequence 1,000,000 human genomes. At hundreds of megabytes per genome, that’s a lot of data! Other organizations are seeking to build a cellular atlas, categorizing the level at which every gene is expressed in various cell types, and identifying which proteins are sticking together in each cell type. Each of these datasets, and thousands of others like them, give us a narrow slice of how the system that is biology all works.
Plummeting Costs: Second, it is much less expensive to analyze data than before. While the biopharma industry’s efficiency has been declining for 60 years, the cost to compute has been improving at an extraordinary pace. This means that for the cost of a few Starbucks coffees, you can activate hundreds of powerful computers in the cloud to conduct analyses that two decades ago would have cost millions — and a computer the size of your house.
Deep Learning Inferences: Third, advances in analyses like deep learning are likely to change the game in their ability to draw more human-like inferences from massive datasets that are millions of times too big for a human to digest. While this burgeoning field is still young, there are concrete wins to point to. Deep learning has enabled algorithms to beat the very best humans in games like Go, in which there are an incalculable number of possible moves to map, and in games like Texas Hold ’em, where the algorithms are making decisions with imperfect information and learning to bluff. When applied to modeling biology — a system of incalculable possible interactions and dependencies — these new analysis approaches can deliver previously unattainable insights that further our understanding on any number of complex problems, biology included.

The Power of High-Dimensional Data

For decades, scientists conducted time-consuming experiments and measured just one thing about the cell or protein they were studying. For example, they might test millions of compounds for their ability to modify the activity of a protein or change where it spends most of its time inside the cell. In each instance, arduous work ends in single-number outputs. This is the reductionist approach, and it assumes sufficient understanding of biology such that conclusions about how to drive a new drug program can hinge on a single number.

In contrast, it has recently become much easier to measure lots of things in each experiment due to massive reductions in cost. For example, when you add a drug to a set of cells, you can now measure how that drug affects thousands of genes in that cell. Other approaches include measuring hundreds of cellular metabolites, decoding chemical modifications to millions of bases of DNA sequence, profiling the status of thousands of proteins, or measuring changes in the appearance of cells in the form of thousands of features related to shapes, sizes and textures.

A decade or two ago, when the promise of computational biology fell flat, this type and quantity of data was too expensive to acquire and nearly impossible to analyze. (e.g., The first human genome [Craig Venter’s] was sequenced in 2000 for nearly $100 million. Today, similar sequences cost a couple thousand dollars and can be quickly analyzed). It is this perfect storm of steep decreases in the cost of acquiring and computing high-dimensional datasets, along with massive advances in analysis algorithms, that is driving the most progressive biotech and pharma companies to focus on high-dimensional data over reductionist approaches to drug discovery and development of the past.

More (Data) Bang For Your Buck

Knowing where to start and what high-dimensional dataset to focus on is one of the most pressing challenges in biotech and pharma today, with some of these data still requiring more than $1,000 per sample. Assuming that the conclusion-yielding value of each dataset is within an order of magnitude of the others, the best approach might be to focus first on those techniques that yield the most data per dollar.

Where to begin? At Recursion, we believe image-based assays are likely the answer because images are inexpensive to acquire and incredibly data-rich.

In our early automation pipelines, we can test the effects of various drugs on human cells and take images of the resulting biology for pennies per sample. From these images, we get thousand-dimensional data for each of millions of cells.

For example, we measure the shape, size, texture, and spatial relationships of various cellular components across tens of thousands of biological variables each week. Scale matters in biology, and by testing millions of biological variables in our own laboratory, where we control both the variables and the annotation, we are identifying potential new therapies at scale. Simultaneously, we are building a highly relatable dataset that can enable the beginnings of a model of how human cellular biology actually unfolds.

This is possible because in biology, structure suits function. For example, most of us can recognize people with Down Syndrome because of specific archetypal changes in the way they look. In fact, this is widely true in biology and medicine. Well-trained physicians can diagnose hundreds of diseases based mainly on changes in appearance. Clubbing of the fingertips is one characteristic of certain lung and heart disease, whereas the location and topography of a rash can send a physician down various differential diagnoses. Under the pathologist’s microscope, the number, type and look of blood cells belies innocent inflammation versus blood cancer and countless other diseases.

Scaling Drug Discovery, Without Bias

Our image-centric approach generates hundreds of millions of images of billions of human cells and embraces the complexity of biology to answer complex questions instead of relying on a reductionist hypothesis.

An illustrative example: That mutations in the CFTR gene lead to cystic fibrosis, or that mutations in the NF2 gene lead to neurofibromatosis Type 2 is not a matter of debate. But by using special biological tools, we can perturb each of these genes and take a picture of the result and measure a thousand different features of each of millions of cells representing each condition. By leveraging computer vision and machine learning techniques, we can often identify robust changes in the way cells look that are specific to a disease model of interest. We then add thousands of drugs to these cells, take pictures, and ask whether any of the drugs make the cells look more ‘normal’.

We do not at this stage know why a given drug restores the cells back to health, nor do we care. We simply want to harness the massive complexity of biology to help guide us to the answer, and only then understand why it is the answer.

Our approach has thus led us to answers that carry less bias than reductionist approaches. How? We sometimes find that a drug is working in a way opposite of what others might expect based on the small amount of knowledge we’ve amassed about biology. Instead, we just see that it works, and then try to turn over the stones until we understand why. Other times our approach leads us to the biology we expect, and we ‘rediscover’ drugs that are known to work in humans for various diseases. These validating discoveries help give us confidence that we can trust our findings even when they don’t make sense in the context of the biased understanding of how we think biology works.

This entire process can be carried out by robots under the guidance of computers, allowing us to test tens of thousands of disease and drug combinations each week. And as our company grows, we expect to be able to scale to hundreds of thousands, and maybe one day millions of these tests at a time. It’s worth noting that many extraordinary, recent advances in machine learning and deep learning have been made in the analysis of massive image-based datasets.

In the last 18 months, we’ve identified nearly two dozen new treatments that we think hold promise for patients with a variety of rare genetic diseases.

This is just the beginning.

Mapping it All

Finding your way is always easier with a map, and despite the complexity, this shouldn’t be any different in the field of biology. By building a massive database of biological images, each of which is relatable over time with all the others we produce, our ambitions are much larger than identifying treatments for the hundreds of diseases we work on today. Instead, by noticing patterns in how cells look under varying conditions, we can start to understand and model the complex interactions that make it all work.

We are in the process now of breaking every known gene and measuring the resulting changes in cellular images across multiple human cell types. If breaking two genes results in a very similar high-dimensional phenotype, for example, we might be able to conclude that there is a higher likelihood that those two genes are operating in similar processes. Because our techniques are so inexpensive, we plan to ask hundreds of millions of questions like this.

And while we prefer image-based approaches for the reasons mentioned above to build our foundation — the low-resolution map — we expect that we will need to layer other high dimensional datasets like gene expression, proteomics, and metabolomics on top of image data to improve the reliability and resolution. By doing this in a targeted way, however, we expect to generate a map faster and more inexpensively than anyone else.

In the end, whoever can build a map will be able to massively increase the speed and decrease the cost of finding treatments, potentially to the point where we can design the right drug for the right patient at the right time. The impact of such a success would be unfathomable — this will improve the lives of billions of people.

And so maybe this time it will be different than it was before; maybe the simultaneous, exponential advancement in various technologies will enable us to shift the discovery curve in biology akin to the rate of improvement in other industries in which technology has been applied. While many failures, much iteration, and millions of hours of work lay ahead for our industry, the potential improvement in outcomes for patients is nearly incalculable.

World Positive is powered by Obvious Ventures.
Creative Art Direction by Redindhi Studio. Illustration by Rune Fisker.