Why we’re so excited about the Arc Institute’s new Evo model
At Basecamp Research, we’re on a mission to build the perfect foundational datasets for AI in biology. Now, finally, it feels like the model architectures are beginning to get sophisticated enough to unlock the true potential of what we’ve built.
The pace of development in bioAI is electric, with new models being released each week. They all promise (and many realise) significant steps forward. Earlier this week, the Arc Institute published a new model called Evo — one that we’ve discussed with them over the past few months… Here, we dive into Evo and why it’s so synergistic with our work at Basecamp Research.
In summary, Evo is the first biological foundational model to be released publicly that generalises prediction & generation tasks across the fundamental languages of biology: DNA, RNA and proteins. The secret to the model’s power is its ability to apply genomic context to prediction & generation in biological design. Considering this context allows it to match existing problem-specific state-of-the-art models, but also allows it to generalise to solve much wider and more sophisticated problems.
AI and biology: a bit of background
For any deep learning approach to work well, you fundamentally require sufficient data with a sufficient signal to noise ratio. At its most basic level, AI is a pattern recognition tool that will replicate whatever patterns are in the training dataset (whether it is signal or noise). If it replicates signal, it works, if it replicates noise, it spits out sh**.
When modelling biology, there are many levels of complexity to consider (it’s a fractal problem), but let’s simplify this to three levels here for ease of discussion:
- Simplest level: protein (molecules)
- Much more complex: genome (organisms)
- Most complex: metagenome (communities)
Starting at the simplest level: proteins
Let’s start with what it takes to model proteins. So far, in bioAI, the logic has been that to predict and generate proteins, you just need protein sequences. The assumption is that there are enough known protein sequences (in well studied protein classes) with sufficient quality for these simple protein AI models to work reasonably well.
The hidden secret of all these foundational protein models is that they all use the same public datasets as their foundational datasets. These public datasets were built before the AI era and present AI with many fundamental problems; not least the lack of size, diversity, quality, context, traceability & sustainability.
The dataset that a model is trained off defines the AI’s ‘imagination’ — it’s ability to think creatively to solve problems. In simple terms, the AI will, by definition, never be able to ‘think outside this box’ — it can only ever recapitulate and reorder patterns that it has already been exposed to. Therefore, by expanding the training dataset, you quite literally allow it to think outside the box.
This is well understood in other fields of AI, but, in bio, there are only very few foundational datasets available. This is where Basecamp Research comes in.
Of course, if you expand the training dataset for these simple protein models (more protein sequences) you get better results. At Basecamp Research, we have indeed used our vast proprietary data and extra diversity to improve the SOTA models across protein structure (pre-print coming soon), protein generative AI and functional annotation. Obviously this has been good to show, but this still only uses a fraction of the power of the datasets we’re building at Basecamp Research (i.e. just using the protein sequences).
So why is context so powerful?
At Basecamp Research, we’ve spent millions of dollars on our mission to build the first foundational datasets purpose-built for bioAI. As biology is a complex, multidimensional system, we have always strived to capture and computationally recreate all these biological complexities as completely as possible.
This has involved collecting vast amounts of labelled, curated, metagenomic data from around the world that is as diverse and high quality as technically possible — we’ve pioneered data collection and curation processes that allow our data to be the same quality as human clinical data. We’ve put in incredible amounts of effort to add so much extra context with such a high signal-to-noise ratio.
So why go to all this effort? Especially if we’re just expanding the pool of known proteins for the relatively ‘simple’ protein AI models.
Put simply, the bioAI models that will truly be able to take advantage of all the latent information in our dataset haven’t yet been built.
This brings us to why Evo is so exciting
Evo is the first generative model for genome scale modelling. It’s very cool, very exciting and deserves a lot of attention. Why?
Because it shows for the first time, amongst other things, that a model that isn’t specifically designed to generate proteins (it can do much more) has the potential to outperform a protein-specific model. This is the equivalent of ChatGPT writing poetry better than a poetry-specific model. Inconceivable by most, before it was achieved.
Evo-type models have the potential to design more complex proteins, with specific functions, specific expressibilities and higher success rates than many of the specific protein models.
So how can it do this, and why does this relate to Basecamp Research?
Evo is able to do all these fantastic things because it understands the wider context that the protein evolved in. By understanding the surrounding genes and regulatory elements, it is able to utilise incredibly complex biological information towards a successful design output.
This is exciting because it shows that the fundamental leaps forward in bioAI will come from going up a level in complexity and context (in Evo’s case, from protein to genome).
So what about all the signal-to-noise stuff?
So, if we want to achieve the most complex designs possible, logic follows that we need to expand the biological context as much as we can. This makes biological sense; everything in evolution is connected, but, from a data perspective, it gets more and more challenging — the data doesn’t exist to go deeper.
The fundamental point here is that as you expand the context window (you don’t just look at single proteins but begin looking at the whole genome), you begin looking at much more information. Not all of this information is going to be as relevant (think: every amino acid in a protein is relevant, but not every regulatory element in a genome is relevant to that protein) which means that, while Evo-type algorithms will be more powerful, they will also be much more sensitive to the noise in the data.
This may be one of the reasons why the Evo team only used 80k genomes (from the petabytes of available sequence information) — very little prokaryotic sequence information has high enough quality (low enough noise) to be reliably used in training this kind of model.
The Evo authors say that their model will benefit from “additional scale, longer context length and more diverse pre training data”. Over time, Evo and other Evo-type models will get more and more hungry for far more, high quality genomes. Of course, we can provide that. Of course, as with protein AI, we will see better model performance across the board.
But, what if we could go one step further? What about reference genome quality data at the scale of global metagenomes, fully labelled, fully contextualised, fully traceable (with benefit sharing to host countries) and purpose-built to enable the latest generation of AI models.
Wouldn’t that be good?
So, when does it get really exciting?
Remember I said there were 3 levels of complexity? Evo was trained off ‘sterile’ genomes, mostly isolated from culturable organism’s genomes in a lab. Therefore these genomes lack any of the ‘excitement’ that goes on when bacteria, fungi, viruses etc. are all fighting for survival in the real world.
To use the metagenome level of complexity to inform design would be another step change in bioAI — more complex designs with more success — true foundational design for biotechnology. However, the signal to noise problem gets much greater here — the metagenomes in public databases are a complete mess — any AI that tried to learn off them in this way would be little better than a random number generator.
So, what you’d need to unlock the pinnacle of AI design would be a vast metagenomic dataset, all collected with human-clinical quality data curation. If only someone was building this…
To summarise, Evo is an important step forward for bioAI and Basecamp Research. It makes the leap from our data being useful simply as ‘additional diversity’ to truly and practically demonstrating the power of biological context as a zero-to-one step change enabler in bioAI.