Where generative protein design meets reality

Giulio Chiesa
11 min readDec 8, 2023

--

The other day I was talking to my friend E., who is an expert in that particular branch of machine learning where you take as much data you can from patients of a specific disease and use The Algorithm for making sense of it. He has been in the Artificial Intelligence world from way before this was cool and he is my go-to person for this type of topic. I was quite surprised when he had no idea that there are already companies that use generative models for designing proteins entirely in silico — in other words, use AI for making new molecular machines. Last time he checked, it was some sort of far flung hypothesis, not really applicable yet. Instead, at least two companies are already deep into this process: Generate Biomedicines in Boston and Cradle Bio in Zurich and things are accelerating. But actually… What are they doing? And How?

Ahead of the Fold

It’s true, in the past five years, protein design leapfrogged ahead to a speed I honestly find hard to keep pace with.

At the beginning, there was Protein Engineering, where you take an already existing protein, with a known and well characterised function and you modify it to make it better or to try to repurpose it, by introducing mutations, which is changing the nature of some of the aminoacids that compose that protein. Protein Design takes a whole different approach: instead of adapting and repurposing, you combine all the available knowledge in biophysics of protein folding and biochemistry to design a new protein with specified functions. This is possible only if you have a software that enables you to visualise that protein, before you make it in the lab.

Before the announcement of AlphaFold, the only group able to really do protein design was the group of David Baker, thanks to their RoseTTAFold. This software condenses many physical principles that guide protein folding (hydrophobicity, steric clash, charges…) and combines them with sampling algorithms, which are ways to choose subsets of solutions in a fashion that is representative for the whole population. To strip it to the bones, these algorithms are what we had before the invention of machine learning. A good analogy would be to have an entry-level Japanese grammar book and a small Japanese-to-English dictionary and try to watch an anime without subtitles. Imagine now writing a new anime based on what you learned.

Things started to go real fast when the Google DeepMind AlphaFold won the CASP13 competition for structure predictions in 2018.

This software is no longer programmed to include physical concepts, but instead uses a Neural Network, trained on every protein for which the structure is known and its relative amino acid sequence. Returning to the anime analogy, AlphaFold learned Japanese by watching all the anime ever made with subtitles, from Astro Boy (1963) all the way to the latest season of Sword Art Online. Trying to write your own anime is a bit easier now.

From Protein Predictions to Protein Design

The same team who developed RoseTTAFold was also the first to venture into computer-guided protein design, with rosetta.design. One could argue that, conceptually, this shouldn’t be that hard.

If you studied a bit of biochemistry in college or even just in high school, you might remember that proteins are made of strings of aminoacids and that these building blocks are quite finite, indeed biological systems use just 20 of them and we have been identifying them with simple letters (A for Alanine, C for Cysteine, D for Glutamate, etc…). If you are a bit more familiar with proteins, you might know that certain structural shapes (i.e. folds, which have exotic names such as ɑ-helices, β-sheets or random coils) have preferences for subsets of amino acids and really dislike other subsets. The structural components (ɑ-helices, β-sheets) could be seen as words composed by single amino acids, then. Different structures are organised hierarchically in motifs (small and repeatable functional units), which in turn compose domains (large functional unit), always abiding by the same rules of physics and chemistry. This organisation reminds me a lot of an underlying grammar of protein folding, and actually the word grammar started to be common also in protein biophysics. Once you know the rule and you know the amino acids, you should be able to build back the shape you have in mind.

I remember in a talk David Baker gave at MIT years ago — just at the onset of the AlphaFold phenomenon — somebody in the audience asked how easy in reality it is to go from a design to a real functioning protein and he replied outright: “Not easy. You need to make many proteins and test them all before finding one or two that work.”

This snippy answer is better unpacked in a very cool review from his lab, which summarizes the state of the art in this field, setting some ground rules for designing proteins that perform tasks, or enzymatic reactions. Enzymes are catalysts of reactions that would otherwise occur, just either very slowly or very infrequently, and this occurs by physically forcing in close proximity the two molecules that need to react together. Some of them help chemical reactions by forcing molecules together and by hosting in the inner part of the enzyme, called catalytic site, single atoms of a metal, like zinc ions, Zn2+, or Iron ions, Fe2+. The authors argue that, if you know the reaction you want to catalyse, you can build a protein around it, or adapt one that does something similar, using these modern computational tools. This sounds very easy but it’s not. You need to transform the shape you have in mind into a series of helices, sheets and loops that resemble that shape.

The software helps you design them, but it’s still only a prediction. To compensate for the error in this prediction, you can generate many different designs and test them all, which means you will have to generate all the proteins and think of experiments that test their activity.

Truth is, despite more than a century of research in structural biology, many of the rules that dictate how you go from a set of instructions (i.e. a protein sequence) to an object (protein) that performs a specific function (a reaction) are still quite obscure, because complexity scales with size of the protein and you often need big proteins for complex tasks. To obtain one functional design you had to screen hundreds to thousands of different designs. Then, maybe you were obtaining only a sub-optimal protein and you had to use your invention to identify smart mutations to change the protein to better fit your purpose. Many of these designs failed, no matter what you did.

This explains why the Baker lab maintained supremacy on protein design for such a long time: to use rosetta.design you needed to have an extensive knowledge of protein structures and conceive every backbone and side chain that would form the structure you wanted, or at least most of them. They were the only ones to have both the deep knowledge in structures, the computational tools and the experimental expertise (and resources) to make real progress in this direction. They only could speak protein.

“Protein Design takes teamwork!” — Image generated with Dall-E

First, you have to speak protein

Things were about to change: while AlphaFold was becoming better and better at predicting structures, on the other side of the AI world, ChatGPT and its spawn saw the light. And this happened just in the past two or three years.

You just saw how you can think of proteins as short assays in an alien language. You shouldn’t be surprised if, from the day after ChatGPT was announced, the bioinformatics community produced a flurry of articles generating better and better versions of transformers (which is the AI word for a program that generates an output sequence based on an input sequence). I recently tweeted (Xed?) a very good technical review article, written by Sam Sinai and Eric Kelsic at Dyno Therapeutics about this.

My understanding is that we are getting to the end of this very fast race, with two — probably three — candidates.

The first one (at least chronologically), is Chroma, from Generate Bio. According to the article accompanying the release of the transformer, Chroma generates sequences that explore structures never tried by evolution, but are still functional (or at least properly folded). The prompts you can provide span from structural constraints, as well as natural language prompts (i.e. “design a bundle of six ɑ-helices”). They even made it freely available, so I tried to design a protein with the shape of my initial, G.

The second one is Cradle Bio, from the homonymous company. They provide a simple and elegant web application for designing proteins, based on a set of customizable requirements and then refine the sequences based on the data you can provide. Their business model stands in providing this sleek service for companies (yes, it’s paywalled).

Third of this group, and currently the only one entirely academic, is RFdiffusion (RoseTTAFold diffusion), again from the Baker lab. In this new iteration of RoseTTAFold, they combine their old approach with new “denoising” techniques that improve the final structure, like introducing a filter that knows what you should keep and what to discard.

They how RFdiffusion works here and showcase its functionalities in this other article.

The next challenge in Protein Design is testing

Ok, now you can use a few software for better and better designs, but then what? Testing these designs is the next challenge.

People often argue that synthetic biology is not a real engineering discipline because in most cases you design something, you test it and it turns out it is doing something else completely or it is just not working. This statement also holds true for protein design.

Let’s assume that we solved the problem of speaking protein with these new tools and move into the real design part, approaching it in a similar way as content creators do for AI generated images. You create a prompt, including all the details you want in the picture, the type of image (portrait or landscape), maybe even the style. You obtain 3–6 outcomes, but you don’t really like them, yet. So you include more details and run it again. And again and again until you reach something that reflects exactly what you had in mind, or it’s maybe even better than that.

In the case of a protein, the first few outcomes will be strings of amino acids that are fundamentally unintelligible by humans: you will need to use predictors to actually see (predictions of) the structure of each protein. This is both computationally demanding and imprecise, but helps to exclude some clearly wrong solutions and you can use this first set of data to slightly refine your prediction. There is a risk here: if you do it too many times, the algorithm will start to identify underlying patterns that have nothing to do with the function of the protein, but happen to emerge and “work well”. In the AI world these are called hallucinations.

Cycle of design and testing for protein design: you set the preferences, run the software. Then you synthetize the DNA, clone it, perform the experiments. Hopefully, one of the designs will give you something similar enough to what you are looking, and you will be able to detect it in your experiment. Once you get the first set of results, you do it again, until you find the perfect G-shaped protein.
— — -
Figure generated with Illustrator and bioicon, protein structure generated with Chroma

You are soon forced to start screening those designs by performing experiments, in the same way as the Baker lab has been doing it for all these years. This requires synthesizing long sequences of DNA that code for each protein (hundreds of them). Even with the ever decreasing price of DNA synthesis, this might sum up to $10,000 of gene synthesis. Then, you wait for the DNA to arrive into your mailbox and start designing your experiments. Once you have the DNA, you have to adapt it to be used for gene expression. This often requires intermediate steps in which your DNA sequences, which we call “a library”, are included into circular pieces of DNA that can be delivered into cells. We call these bigger circular pieces “vector plasmids”, or “vectors”.

While you were designing the experiment, you were also deciding where you want this protein to be produced: does it perform an enzymatic reaction with a specific substrate? You might want to do fully in vitro experiments (i.e. no living cells involved). Does the protein interfere with biological systems? Then maybe a cell assay is better, meaning you need to introduce it in the cells you want to test and then develop an assay that tests for your outcome. At the end of it, you might obtain some good candidates, but most likely you will get something that kind of works, but needs further improvement. Good thing: now you can feed this data into the same algorithm and train it for a better outcome, get new sequences, dish out another $10K and restart.

No more free lunch

You can see how that can get very expensive very quickly. And there is more, in a recent blog post of the Science blog In the Pipeline, the always sharp Derek Lowe gives a reality check on the ability of AlphaFold to predict binding of small molecules to larger proteins, making this tool less useful for drug discovery as people would like to think. Mostly because there are not enough crystal structures of small molecules (i.e. drugs or substrates of enzymes) bound to proteins to be representative of the large and complex landscape of chemicals available in nature and generated by humans, so AlphaFold did not train on enough of them.

If this is true for AlphaFold, then also designing enzymes that convert one organic molecule into another would experience similar problems. But there is a solution also for that: More Data. But again, in biology, data — especially large amounts of comprehensive data — is the most expensive thing you can get, and the most valuable.

The beginning of this revolution started with freely available databases generated by consortia (Like the PDB, which trained AlphaFold), but, now that these have been already mined, you need more data and more specific to your purpose.

This data can come from all the previous screenings that your lab or company have been sitting on, but could hardly make sense of before. Now you can adapt it for training better and better custom versions of those algorithms, turning all that garbage data and failed screenings you had into “valuable assets”, protected by intellectual property.

A tweet from Michael Baym that nails it on how to do biomedical research in an AI world.

Whereas if your lab didn’t have previous data, now you have to hack away one screening at a time at high financial cost. Of course, other academic groups have done similar screenings and data is “available upon request”, but we all know that it’s just the academic form for “fuhgeddaboudit!”. The only edge you have here is that you can start fresh, thinking of the best way to perform these screenings to serve the purpose of training the AI.

I think this is the shape that Innovation will take: inventing new experiments that provide rich information intelligible for an AI, rather than a single scientist, and then train that AI to be better and better at giving the desired result. Some people call this “wet AI”.

This generally involves large, labor intensive and expensive rounds of experiments, but that can be rationalized down to tasks that a robot can perform. This type of research is more common in companies, rather than in academic labs, because of their large cash flow and the availability of expensive equipment for full automation and high throughput screening approaches.

So, all things considered, AI-driven protein design is not much more democratizing than previous approaches, but maybe this is not going to be forever: the constant decrease in price of DNA synthesis propelled the type of synthetic biology research we are doing now and that was impossible 10 years ago. Further decreases in price of DNA synthesis and, hopefully, in access to automation could make space for smaller realities of labs that do not sit on multi-million-dollar caches. But, for now, protein design is an expensive game.

Thanks to Dr. Tara MacDonald for her insights :)

--

--

Giulio Chiesa

Synthetic Biologist based in Boston, MA. Protein Biophysics, genetic circuits and creative writing, when I can. Twitter https://twitter.com/gchies1