‘More data, please’ — reflections on a head-spinning week for protein AI at NeurIPS
“More data”; the plea of panelists at the NeurIPS’ workshops on Machine Learning in Structural Biology (MLSB) this week when asked: “If you had a magic wand, what would you wish for: More data or more compute?”. Their reasoning? The current data sets available for training AI models for protein design have insufficient coverage protein sequence and structure space. A large number of protein families, functional groups, or enzyme classes either have very few (often less than a handful) known ground-truth sequences, or are highly redundant. This means that an AI algorithm trained on this data won’t have ‘seen’ a true representation of these classes. Being a protein AI company that combines discovery and design, this is exactly what we’re tackling at Basecamp Research.
NeurIPS 2022 happened in a week of head spinning developments and news in the AI world: OpenAI’s ChatGPT3 was released, and more specific to the Life Sciences, David Baker’s group and Generate Biomedicine released protein design models based on diffusion models. These models are built on the same principles as OpenAI’s DALL-E 2 and capable of generating proteins with specified shapes or symmetries and other properties such as improved binding affinities. Whilst previous design methods required testing or filtering of tens of thousands of candidates, proteins designed by guided diffusion have a higher success rate when tested experimentally.
At NeurIPS, it was exciting to see how the field has moved beyond, and built on top of, the successes of AlphaFold2, RoseTTAFold, and others from the past 2 years. The work presented at the MLSB workshop was incredibly impressive — for example a new way to perform docking (prediction of ligand binding to a protein) through a diffusion generative model, referred to as DiffDock, which is going to advance the field of in silico drug development significantly.
What’s been fascinating to see is that everyone in the field is excited about each other’s success and the eye-watering rate of collective progress. A key element of our discussions was the question of ‘what’s next?’ And that brings it back to the opening question — a lot of limitations center around the redundancy and scarcity of protein sequence training data. On top of this, ‘controllability’ of protein design, that is the conditional & fully controllable design of a highly specific function or performance of a protein, is still an unattained holy-grail for the field.
This is where Basecamp Research comes in. Being a first-principle protein AI company, we combine the discovery of novel ground truth protein sequences from Nature with computational design. As such, we are incredibly excited about how well the progress in protein design complements BaseGraph™, our knowledge graph of never-seen-before protein sequences with more than a billion relationships. BaseGraph™ links hundreds of millions of proteins to their complex chemical and biological contexts, each collected through sustainable partnerships around the globe, from the Antarctic to rainforests and hot springs. Originating from locations with a 110 °C temperature range, BaseGraph™ is 4x more diverse than other protein AI resources such as UniProt which addresses the issue of representative coverage for AI model training. On top of that, with BaseGraph™ being the only protein AI resource with comprehensive environmental metadata, we are excited to leverage these strengths for protein design tools, including those that we develop in house.
One example for a new protein design tool we are currently developing that addresses the topic of controllability is ZymCTRL: A large conditional protein language model that generates user-specified enzymes upon a simple prompt. We presented our work at NeurIPS as part of our collaboration with Noelia Ferruz (University of Girona), and first authors Geraldene Munsamy (Basecamp Research) and Sebastian Lindner (University of Erlangen & Heidelberg). The generated sequences have already been verified in silico through structure prediction & docking, and are currently tested in the lab for the upcoming publication. We are also fine-tuning the model for under-represented enzyme classes after incorporating novel ground-truth sequences from BaseGraph™: This is again demonstrating the benefit of utilizing the discovery of more natural sequences to improve protein AI model performance.
In summary, we’re incredibly excited about the breadth of new advances in protein ML design, and for how our work at Basecamp Research will continue to solve the current limitations that stand in the way of the impact this field can have in healthcare, environmental solutions, and fundamental science.