Daphne Koller
5 min readMay 1, 2018

insitro: Rethinking drug discovery using machine learning

Modern medicine has given us effective tools to treat some of the most significant and burdensome diseases. Widespread use of vaccines and antibiotics has considerably reduced the risk of death from most infectious diseases across large parts of the world. Antiviral therapies are allowing patients with HIV to live an almost normal life, and patients with hepatitis C can now be cured. Advanced therapies for cancer, including targeted therapies and immunotherapies, are improving long-term outcomes for certain groups of patients. Recent developments promise to transform the care of patients with certain genetic diseases (such as cystic fibrosis). However, many diseases still pose a significant unmet need, whether because current therapeutic options only serve a small subset of patients, because the options only ameliorate symptoms rather than address the true underlying cause, or because there are no meaningful treatments at all.

At the same time, it is becoming consistently more challenging to develop new therapeutics: clinical trial success rates hover around the mid-single-digit range; the pre-tax R&D cost to develop a new drug (once failures are incorporated) is estimated to be greater than $2.5B; and the rate of return on drug development investment has been decreasing linearly year by year, and some analyses estimate that it will hit 0% before 2020. One explanation for this phenomenon is that drug development is now intrinsically harder: Many (perhaps most) of the “low-hanging fruit” — druggable targets that have a significant effect on a large population — have been discovered. If so, then the next phase of drug development will need to focus on drugs that are more specialized — whose effects may be context-specific, and which apply only to a subset of patients. Figuring out the appropriate patient population is often hard, making therapeutic development more challenging, thereby leaving many diseases without effective treatment and many patients with an unmet need. Moreover, the reduced market size forces an amortization of high development costs over a much smaller base. The solution to this problem cannot be that we continue to pay enormous amounts to develop new drugs, most of which fail, and then pass those costs on to our patients. This is neither economically sustainable for society, nor is it ethical, since it prices many new drugs out of reach for many people who need them. We must find a different approach to drug development.

What might another approach look like? I’m a longtime machine learning (ML) researcher, working in this field for 25 years. But even with that long experience (which modulates expectations), I find myself constantly surprised these days. ML is currently solving problem after problem that I did not expect would be solvable within my lifetime: translating sentences between languages at close-to-human performance, recognizing unconstrained speech in multiple languages, providing accurate natural language descriptions of the content of images, or learning to play complex games at beyond-human-level performance. ML is transforming sector after sector of the economy, and the rate of progress only seems to be accelerating.

These successes are based not only on better ML algorithms — resulting from the work of many smart people over the past few decades — but as much or more so on the availability of very large amounts of data. Such datasets were not available in most areas even a few years ago. And for small data sets, most reasonable ML methods perform similarly to each other. It’s the availability of increasingly large data sets that allows significant distinctions between the performance of different models, and has opened the door to the creativity of thousands of ML researchers whose work has enabled the dramatically better solutions that we find today.

New York Times columnist and writer Thomas Friedman wrote in 2012 (when covering the launch of my previous venture, Coursera), that “Big breakthroughs happen when what is suddenly possible meets what is desperately necessary.” Our hope at insitro is that big data and machine learning, applied to the critical need in drug discovery, can help make the process faster, cheaper, and (most importantly) more successful. To do so, we plan to leverage both cutting-edge ML techniques, as well as the profound innovations that have occurred in life sciences, which enable the creation of the large, high-quality data sets that may transform the capabilities of ML in this space. Seventeen years ago, when I first started to work in the area of machine learning for biology and health, a “large” data set was a few dozen samples. Even five years ago, data sets with more than a few hundred samples were a rare exception. We now live in a different world. We have human cohort data sets (such as the UK Biobank), which contain enormous amounts of high-quality measurements — molecular as well as clinical — for hundreds of thousands of individuals. At the same time, a constellation of remarkable technologies allow us to construct, perturb, and observe biological model systems in the laboratory with unprecedented fidelity and throughput.

Using these transformative innovations, we plan to collect and use a range of very large data sets to train ML models that will help address key problems in the drug discovery and development process. To enable the machine learning, we will use high-quality data that has already been collected, but we will also invest heavily in the creation of our own datasets using high throughput experimental approaches, datasets that are designed explicitly with machine learning in mind from the very start. The ML models that are developed will then help guide subsequent experiments, providing a tight, closed loop integration of in silico and in vitro methods (an insitro paradigm).

To succeed, our effort must bridge the disciplines of life sciences, engineering, and data science, and will require significant innovation and collaboration from experts across these areas. We are fortunate to have the strong support from the top investors in both biotech and tech — ARCH Venture Partners, Foresite Capital, a16z, GV, and Third Rock Ventures — who will provide their experience, their networks, and their capital to this effort. At the same time, we also plan to build a remarkable team that embodies a new type of culture, one based on a true partnership between scientists, engineers, and data scientists, working closely together to define problems, design experiments, analyze data, and derive insights that will lead us to new therapeutics. We believe that building this team and this culture well is as important to the success of our mission as the quality of the science or the machine learning that these different groups will create. As we work towards the challenging and aspirational goal of insitro, we’re actively seeking talented and passionate individuals who share in our mission and vision.

There is a lot of hype today around machine learning, with hyperbolic promises that it will magically solve all of humankind’s problems (and dire warnings that it will lead to the destruction of humankind). We at insitro don’t expect ML to be the solution to all of the problems in drug development, nor to be the magic bullet that helps find a treatment for every disease. However, we do believe that the time is right to rethink the drug design process using a different and more modern toolkit, in the hope that a new paradigm may help us cure more people, sooner, and at a much lower cost.