Computer Aided Drug Discovery

8 min readOct 25, 2017

People mostly die from preventable causes, if we understood biology and disease well enough to craft efficacious drugs. While technology has disrupted many fields, these advances have failed to improve drug discovery efficiency. Below I lay out some of the puzzle pieces that are now aligned to blend an effective full-stack computational discovery program, and how we’re using these at NeuroInitiative to develop new therapeutics for diseases like Parkinson’s and Alzheimer’s.

Source: Trends in Biotechnology 2017. The Business of Anti-Aging Science

While there has been great progress in a lot of disease areas like cancer, heart disease and HIV, disorders of the nervous system have lacked new therapeutics, and prevalence is getting worse as life expectancy increases.

Source: Alzheimer’s Association 2017. Facts and Figures

This slow (or in some areas non-existent) progress isn’t for lack of trying. Typical drug development costs now exceed $2.5 billion for each new drug, a number that has consistently climbed over the past 40 years.

Source: Scientific American 2014. Cost to Develop New Pharmaceutical Drug Now Exceeds $2.5B

This trajectory is striking when compared to technology where costs are dropping while capabilities advance.

Source: Medium @BalintBotz 2016. Moore’s and Eroom’s Law in a Graph — Skyrocketing Pharma R&D Costs Despite Quantum Leaps in Technology

So now let’s dive a little deeper into the normal process for drug development and common causes of failure. This is commonly described as a funnel, screening thousands of molecules to identify a few hundred to take through deeper pre-clinical development in disease models, to narrow into a handful which will be tried in humans, ultimately yielding a single approved new drug.

Source: Genome Research Limited 2016. How are drugs designed and developed

Two key points from this process are striking: First that effectiveness isn’t determined until phase II almost 15 years into the process, and second that the attrition rates in the early stages are really high. Screening fewer molecules, and substantiating efficacy early are great opportunities for improvement. Driving this point home further, when you look at the stats around why drugs fail to clear clinical trials, over half are failing for lack of efficacy.

Source: Statsols 2017. Why do phase III Clinical Trials Fail

Now we get into the fun part — How can we make this situation better, and why might it work now when it hasn’t in the past? I see two key enabling changes in the last few years that open options: 1. There is a ton of publicly available data both in peer-reviewed journal articles, and aggregated/curated into databases that can be built upon. 2. Compute capacity has exploded, primarily from Nvidia’s efforts to apply GPUs to non-graphical workloads. Coupling this with access on cloud platforms, it’s now possible to spin up 100,000 core cluster in minutes, and stop paying for it after usage. This compute power was limited to a few supercomputers around the world just a few years ago, and have really opened the way to new methods and broader adoption.

To dig into the types of data that are available, and how we think about computational biology at NeuroInitiative, let’s go back to basics (with a simplified view of the world, and acknowledging that this is my engineers view of biology). As the “Central Dogma” covers, DNA is transcribed to RNA, is translated to Protein, and protein serves as the functional building block of cellular machinery. A vast and heavily interconnected network of interactions yield dynamic living cells from these building blocks. Data describing key parts of the biological system break into:

There are some great resources for each of these. http://gnomad.broadinstitute.org/ is adding an exome every 12 minutes with metadata attached, yielding petabytes of joy for miners looking to dig up new associations between variants and disease.
https://www.snpedia.com/ has descriptions and disease details for known single-nucleotide polymorphisms that can used to cross-reference or identify SNPs of interest for research.
If you want to research your own DNA, https://www.23andme.com/ has an at home test and will run a saliva sample through a gene chip to identify common trait, ancestry, and disease-linked variants. Beyond their stock reports, you have access to the raw data and can do some fun analysis against the other DNA DBs.
Moving into the Transcriptomic data, the Alan Institute has some great resources including http://www.brain-map.org/ which has mRNA data from human samples for over 1000 brain regions. They also are breaking new ground with the cell atlas project to characterize a broad array of cell types.
For protein data http://www.uniprot.org/ is the go-to with structure, sequence, names, modifications, etc. all in one place.
The Interactome is the least built out of the data types, but is improving with NLP and several intense curation efforts. My favorite right now is https://thebiogrid.org/ which has about 1.5 million protein/protein interactions with a great API, ability to grab the whole DB, and a nice UI online to get started.

Of course, the next challenge is to unify these disparate data sources into a cohesive picture of the system. Given differences in naming and variety of maturity among sources, parts of this get interesting. The problem is technically straightforward and common Extract/Transform/Load patterns get the job done, but it does require some understanding of the data. Further, most sources are dirty with duplicate or incorrect entries, so some form of curation is necessary. Building a consolidated database has been a key effort from our team, which sets the foundation for a variety of uses.

For a given biological challenge, and given this foundation of data to work with, we usually start with a glimpse at the interactome to get an idea of what a protein of interest does in the cell. Some great tools already exist for this, including Cytoscape for visualization or the NetworkX python library for path-finding or more custom analysis. This still provides challenges though, for instance the Lrrk2 has over 150 direct interactors (left) and explodes into an incomprehensible mess when blended into the context of other Parkinson’s disease related mutations (right).

This complexity, along with the lack of visibility available from current lab techniques drove us to create the SEED simulation platform to provide data-driven in silico cells that we can use for experimentation. Blending the omics and physical properties of biochemical entities, along with a physics engine to integrate dynamic position and interactions, we can create a homeostatic healthy cell. From there we introduce mutant variations of entities and drugs to model disease and intervention. Below is a screenshot showing the cellular structure for locational context, and zooming into a synaptic bouton where you can see biochemical entities and organelles.

As with any new method, our first question is whether, and to what extent it works. Our benchmark for accuracy is replication of biological study, and so far things are looking pretty good. When comparing a variety of data points (e.g. phosphorylated protein, GTP bound, etc.) plotting log2 fold change from published biological studies against our simulation results for the same manipulation (mutant knock-in, knock-out, overexpression, etc.) we see a Pearson’s correlation coefficient of 0.83. We’re just scratching the surface, but this gives us a pretty good idea that our simulation is predictive of biology. To validate new simulation findings which haven’t been previously reported, we have to get back into the lab. We are fortunate to be working with Dr. Heather Melrose at the Mayo Clinic, who is using transgenic LRRK2 knock-in models to measure downstream changes that we see in simulation.

So, now we have a new method for simulating biology that looks good. What do we do with it? Over the last year in a project funded by the Michael J Fox foundation we’ve built LRRK2 G2019S knock-in simulations to try to identify new targets. When running simulations we measure quantities of all biochemical entities across all time points, so we can start post-simulation with an unbiased screen (left) looking at a clustered heat map where each row is an entity, each color is a time point, and color indicates log2 fold-change. Key observations are that 1. most things don’t change, confirming it’s a big haystack within which to find needles, and 2. some changes are transient and cycle higher and lower, so catching these in lab measurements might be hard/impossible. Next we can dive deeper into changes and see impact of mutation on a single entity (right).

Once we’ve seen alterations downstream of genetic manipulation, we can start testing hypothetical molecules in the simulation to reverse the abnormalities, and bring the cell back to “healthy” normal levels. As seen below, one of our experimental molecules is able to partially rescue several changes caused by LRRK2 mutation, to varying degrees. This allows us to validate and prioritize the biology around drug targets, before diving into the chemistry, and in a very short time and low cost.

Now we have a promising target with mechanistic understanding of how it ties into disease, and need to shift into chemistry to identify a chemical compound that can hit that target to activate or inhibit its interactions. Computational chemistry has had a lot of success and a become a core tool for most drug development operations, so there are a lot of good resources in this space. Databases like ChEBI and Pubchem contain hundreds of thousands of known compounds with structure and known binding partners, and PBD has structure information for most proteins. Using this data the primary approaches are molecular dynamics using packages like MGLTools from Scripps, or Schrodinger’s suite, or using structure as features to drive AI algorithms to predict ligand binding.

Once we have a target and molecule that look good at a cellular level, we need to consider impact to the whole body as a system. Pharmacokinetics has been a focus of mathematical modeling for some time, and with considerable success which has helped to improve safety results for new drugs. Again, equations, methods and tools are readily available, and can be used to refine our molecule design before long & expensive laboratory efforts.

Ultimately, computation will only go so far, as we are building physical compounds that will be administered to living beings. All biotech & pharma companies are using some level of computation to improve results, and this is a field that still holds a ton of potential with gaps to be filled and tools to be improved. Given the value of new therapeutics, both monetary and impact to human life, incentives to disrupt are high. I encourage anyone who can to join this fight to get better at making medicines!

Computer Aided Drug Discovery

Written by Andy D Lee