To succeed in biotech, you need more than just an AI model

Published in

Enzymit.log — The Enzymit Blog

3 min readAug 5, 2024

The recent launch of AI protein design startup EvolutionaryScale, backed by a staggering $142 million in seed investment, follows closely on the heels of the reveal of AlphaFold3 by DeepMind and Isomorphic Labs. AlphaFold3 not only surpasses its predecessor, AlphaFold2 in accurately modeling protein structures, but also extends its capabilities to modeling small molecules and their interactions with proteins. The latter is immensely important for tasks such as drug discovery and design, as knowing how drug candidates interact with a protein target can predict a multitude of critical factors from efficacy to cross-reactivity to mechanism of actions and more. These have the potential of cutting years of R&D time from the development of new drugs, and as such, significantly cut costs.

Given this, it’s no wonder these companies are attracting so much public attention and, by extension, significant investor capital.

However, let’s take a step back and try to discern where the real (monetary) value is. Let’s examine other companies working in the generative AI space and specifically language models. Roughly speaking, most AI models have 2 key elements: the model architecture, which you can think of as the virtual hardware (how the model receives and processes the data to make predictions or generate new data) and the training which is the process whereby data is fed into the model and over many iteration the model learns the prediction task or can generate new data. The training outcome is saved in the model weights which are stored separately from the model architecture. This is important to understand. Even if you had the model architecture of ChatGPT or AlphaFold or Claude they would be useless without the weights. This separation between architecture and weights is where the true value and complexity lie.

It cost OpenAI roughly $4M to train GPT3, $100M to train GPT4 and a predicted $2Bn (!) to reach GPT5. By comparison the estimated cost of training OpenFold2 (an open source implementation of AlphaFold2) was $150K (10 hours with 2,080 NVIDIA H100 GPUs), an order of magnitude less than what it cost to train even early models of GPT. The enormous costs associated with training its models explain why companies like OpenAI require such significant capital investments.It also creates a significant barrier for potential competitors due to the vast resources required to train a new LLM model that can approach the performance of GPT-4.

This is not the case when training protein prediction models. The relatively low computational resources required enables teams of developers to pull resources together and develop open source alternatives that match and at times even surpass commercially developed models. Additionally, protein prediction and design are active fields of academic research which means that labs around the world are also developing and releasing new models and by the nature of academic research are obligated (and motivated) to release the model and weights. Lastly, one could argue that data could be a determining factor that separates commercial models from open-sourced or academic models. While this remains to be proven, since models like OpenFold have demonstrated performance parity with Google’s AlphaFold2 I would argue that data would not turn out to be the winning factor. The task of designing or modeling proteins is much simpler than conducting human-like levels of reasoning and conversation and the rules that govern parameters like protein stability and function are much better understood. I argue that existing publicly available data is more than enough to train a model that will generalize to nearly any protein design challenge, improving future models will not come from additional scaling or data, as the case with LLMs, but rather from different architectures.

So where is the monetary value? Don’t get me wrong. These models are incredibly important and they do have real-world value. The question is how do you monetize this value? These models are ultimately very sophisticated CADs (Computational aided design software). They enable users to design products which in turn could be extremely valuable. CAD developers generate revenue through software licensing, a field becoming more competitive and commoditized by the rapid advance of AI.

Protein design software is a crucial enabling technology, but the true value comes from the products we put in people’s hands. The competitive advantage is derived from protein patents, practical applications, and specialized expertise, not merely from the software used to create them.

To succeed in biotech, you need more than just an AI model

Written by Gideon Lapidoth