Beyond the hype of AI/ML in biomedicine
What are the real capabilities of AI/ML and what is needed to efficiently incorporate it into day-to-day scientific research?
In the last few years, AI/ML has undoubtedly lived up to its expectations. Since the early 10’s, AI/ML tools have multiplied, starting from a neural network capable of recognizing images of cats without previously labelled data, and passing through the algorithm that beat the GO world champion (DeepMind). Every year, tools and methods evolve, with notorious applications such as the GPT3 “robot wrote this article” language model or the recently released AlphaFold2, which predicts more accurately than any other algorithm the 3D structure of proteins, one of the core challenges of biology (at Ersilia, we have already successfully tested it to generate the structure of the malaria parasite target PfATPase4, for which we are trying to find new drug candidates). These are all giant steps on the AI/ML roadmap, but some fear the technology is over-hyped and will soon enter its third winter.
Our field, biology, and, particularly, drug discovery, is also on the cusp of the wave, with publications that use some kind of AI/ML doubling each year (State of AI Report, 2020, https://www.stateof.ai/). Conversely, though, in an interview to over 300 scientists, BenchSci reported that effective AI/ML implementation was still far for many organizations, with the main barriers being lack of expertise, benchmarks and trust.
Undoubtedly, the novelty of the technology and the lack of thorough validations has prevented a faster implementation of AI/ML methods in scientific projects, but between those who fear a computer rebellion for world control, and those who think of AI/ML as a sort of magic wand that will solve all our problems out of the box, there is a middle ground where the technology can thrive and provide validated answers to scientific questions. This requires three basic pillars: education, data sharing and transparency.
Education
Experimentalists have little to no computer science knowledge (not a criticism, I myself wasn’t capable of writing more than two lines of code a year ago), and this creates a major barrier when it comes to AI/ML. We either do not use it because we don’t understand what it can do, or we seek the support of data scientists with the idea that, by simply pressing a button, they will give us the definitive answer. The latter is particularly damaging to the technology, as overly-trusting scientists disappointed in the wizardry of data scientists can become hyper-sceptical and deny the many benefits that AI/ML methods (within certain limits) can deliver. We have encountered such characters more than once, so demystifying AI/ML models and clearly explaining the boundaries of what can be expected from them has become central to EOSI. Increased knowledge on how AI/ML works will facilitate collaboration between experimental and computer scientists, and interactive platforms such as the Ersilia Model Hub (for drug discovery models) or the similar Hugging Face (for natural language processing models) will become a central asset of these collaborations. Moreover, the development of state-of-the-art automated modelling (AutoML) enables non-experts to take a glimpse inside the black box, and play with entry-level models.
Data
A model is just as good as the data it sees. More than 70% of the effort to develop a new tool is spent in assembling a training dataset of good enough quality. Too much data of low quality will create noise, too little data will lead to overfitting, and render the model useless when presented with new samples, and of course datasets can contain all kinds of biases and imbalances that must be accounted for before training the algorithm. To avoid these issues, proper labelling of the datasets and publication in open access repositories is key for the advancement of AI/ML applications in biology. Luckily, more and more top scientific journals require publication of the original data in dedicated repositories (GEO for genomic data, PubChem for Bioassays etc), facilitating its reusage. Then it becomes a matter of choosing the adequate data points to build models that perform at their best in your domain.
Explainability and transparency
Finally, there is an understandable lack of trust in AI/ML methods, because, on many occasions, the rationale followed by the computer to produce the results cannot be explained. In fact, the most common concern from reviewers when we submit grants and project proposals is how we can assess the reliability of our predictions. To understand and explain what drives AI/ML decisions is not yet feasible, although some companies are working hard to provide solutions to the issue (https://analyticsindiamag.com/top-initiatives-on-explainable-ai/) To mitigate this effect, we need to work on the Transparency and Reproducibility of our methods. The best way to achieve this is to open-source all the code (currently, less than 15% of the scientific papers publish their code), enabling the community to test, refine and improve upon the work, making it more robust and reliable.
At Ersilia, we are working hard to achieve a sustainable implementation of AI/ML methods in day to day research, taking into account all their limitations but also their potential to revolutionize the biomedical field, particularly to speed up research in pressing health issues that are currently neglected. Let me finish this blogpost with a brief applied example of our work:
Early in May, we started a short collaboration to identify new antimalarial hits of a particular chemical space, in the frame of the Open Source Malaria Project (if this arises your curiosity, you can read a detailed and beautiful explanation of how and why we did it in this other blogpost by Ersilia co-founder Miquel Duran-Frigola). We chose to apply already developed methods to the problem, rather than developing our own algorithms from scratch, to, first, save time and efforts, but also to use something that had been validated by others already and was, therefore, more trustworthy. Thanks to what is called “Reinforcement Learning”, i.e the machine learns and improves in iterative rounds, we generated over 300.000 new molecules with predicted activity against the Plasmodium Falciparum parasite. Our starting point was an excellent dataset of 400 molecules tested in vitro by OSM and readily available via their repository in GitHub. Hence, the accessibility of data and validated algorithms reduced the problem to “simply” adapting the tools to our problem of interest. After several rounds of generative modelling, the only remaining issue was how to screen all the computer-proposed solutions. Here, the expertise of the OSM scientists, who have been working in the project for over 8 years, was key. Based on their suggestions on what type of chemical substituents had more probabilities of being active, which molecules were impossible to synthesize in the lab, and what chemical properties was best to optimize to obtain an efficient drug, we reduced the 300.000 predictions to a manageable 90 selected compounds that can be manually screened by the OSM scientists. To close the cycle, we also published all our results and code in our Github repository, and, importantly, all the scientific discussions were also held in the open, for other scientists to learn and contribute. We hope more and more projects will be adopting the excellent example of the Open Source Malaria project.