Drugs, Data, and Deep Learning

Why it’s taken so long to disrupt drug discovery. And how we’re finally doing it.

Harry Rickerby
LabGenius
6 min readMar 6, 2020

--

Technology has fundamentally disrupted nearly every industry on Earth and has created immense value, particularly for startups and their investors. Until now, one clear exception has been Pharmaceuticals. There just isn’t an Amazon or an Uber in the Pharma world. Outsiders are dissuaded from getting involved due to the sheer scientific complexity of the field, intensive capital requirements, and complex regulations that require careful navigation. This has, in part, meant that large, 50–150 year-old multinationals continue to dominate.

Times are changing though: Tech companies and investors are getting increasingly involved, providing significant investment to computational approaches to target and drug discovery. Start-ups in this space raised over $1B in 2018 alone, while Facebook, Google, and Microsoft have all started projects in drug discovery. Google’s DeepMind in particular has demonstrated significant potential for deep learning in drug design through their AlphaFold project. Pharma companies are excited too, with giants like GSK, Novartis, and Astrazeneca investing in their own internal AI/ML programs.

Why now?

So why does everyone, from pharma execs to tech entrepreneurs believe this is the time for a technological revolution? Over the past 30 years, the process of drug discovery has changed very little. A starting pool of candidate molecules is whittled down through a series of screens and tests, starting out in the test tube, then moving into animals, and finally into people. If you’re really lucky, your molecule will make it to clinical trials, after that, you have a 1 in 10 chance that your molecule will be safe, will do something, and will be approved. The opportunity for technology is rooted on one simple fact: this process is failing us.

This is illustrated by Eroom’s law — the trend that keeps every pharma executive and drug hunter awake at night. Since 1950, for every billion dollars (adjusted for inflation) invested in drug discovery, we have become exponentially less efficient at converting those dollars into approved drugs. Today, on average it costs the industry $2bn to take a drug from discovery through to the market. In the 1960s, it cost just $100m. This trend is driven by a number of factors, but most prominently, it is driven by failure rate. Drugs fail for a host of reasons: because they are toxic, because they aren’t efficacious, because they are immunogenic, because they aggregate, because they can’t be produced at scale, because their in vitro characteristics aren’t replicated in vivo, because they are unstable, and many more besides. Finding a drug that meets every single one of these criteria is a game of complex multi-parameter optimisation, and it is becoming increasingly difficult to find drugs that tick all of these boxes.

Data source: https://www.ncbi.nlm.nih.gov/pubmed/26928437

The consequences of these growing development costs are grave. Today, the cost of drug discovery means that it is limited to diseases with the potential for large economic returns. The market systematically fails those with rare diseases and diseases of the developing world. Without government incentives like the US Orphan Drug Act, it is not economically viable for companies to gamble so much capital on drugs that don’t have a chance of becoming blockbusters. This is particularly concerning at a time where personalised healthcare appears to be one of the most promising avenues for treating disease. Without a dramatic decline in drug discovery and development costs, we won’t be able to capitalise on this new-found knowledge. If drug discovery R&D costs continue to grow, it will cease to be a viable business model altogether.

How can technology reverse Eroom’s law?

We have been here before: in the early 2000s, there was great promise and expectation that technology would transform the productivity of the Pharma industry. Despite significant investment around this time, our productivity has continued to fall. What’s different this time around? Could machine learning and AI really be transformative here?

The short answer is yes. Modern machine learning methods are capable of capturing and uncovering incredibly complex and multi-faceted patterns, particularly in establishing correlations that may be completely non-intuitive to a human. These strengths make drug discovery a perfect target application of machine learning, where the basis for a successful drug is multifactorial and our understanding of these factors is still very limited.

The power, and challenge, of good biological data

The question here is not really whether machine learning could be a useful tool in drug discovery, but it is in fact whether we can provide it with the requisite data. Machine learning is only as effective as the quality and quantity of the data that is used to train it. It may be able to point out complex correlations that a human would miss, but it cannot do this in a vacuum.

Could publicly available datasets provide the answer here? The public database PubChem for example, contains over 250 million data points from over 1 million biological activity experiments. What about proprietary datasets from within established companies? Many of these companies have decades of data from previous discovery campaigns. Surely these public and private datasets provide a wealth of information that could feed into machine learning models?

The problem is that these experimental datasets were not generated with machine learning in mind: the data is not structured in a way that makes it amenable to machine learning. Data produced in different labs or by different people are generated using subtly different protocols, different equipment, and often without appropriate documentation. Data can be incomplete, lacking the requisite negative controls to avoid modelling the experimental assay rather than the underlying feature. Biological data is noisy, meaning machine learning will find ‘trends’ where there are none.

This issue is highlighted by a concept known as the ‘Data Science Hierarchy of Needs’. It illustrates that AI / deep learning can only be usefully applied when the challenges of data collection, processing, storage and analysis have been addressed. Outside of biology, starting at the top of this pyramid without building its foundations remains a common pitfall — particularly by start-ups trying to generate buzz through their use of AI, or corporates who feel the need to have an “AI strategy” without truly understanding what this entails. In biology, these requirements are ignored even more frequently due to the massive investment required to establish high quality, high throughput data flow from a biological laboratory.

Most of LabGenius’ formative years have been spent establishing these foundations — putting in place reliable, high throughput, automated processes for producing and storing the right data, before dipping our toes in machine learning. To ensure that the right data is produced, data scientists are instrumental in designing our experimental workflows — key experiments are simply not designed without these people in the room. We have learned the hard way that without them, data that traditionally would be seen as useful ends up being scrapped. This is costly and it isn’t especially sexy, but to really capitalise on the potential of ML in drug discovery, it cannot be overlooked.

The way to harness the power of computation and machine learning is to build from the ground up — to build a platform that not only analyses existing biological data, but to build one that can generate the right data, process it, and appropriately store it. Ignore these foundations and we risk machine learning becoming yet another footnote in our efforts to cure disease — another technology that promised to transform drug discovery and failed to deliver. Invest in them, and we will start to unravel the unintuitive and unintelligible rules that govern why some drugs work and others don’t.

--

--