Multi-Modal Protein Language Models: Protein Function Prediction

A simple implementation using amino-acid sequence and catalytic site information

Published in

Krishna Yerramsetty

6 min readJul 10, 2024

EvolutionaryScale caused quite a stir with their newest protein language model (PLM), ESM3 in late June 2024. Even though the models themselves are not available publicly, the 40-age appendix with the details of the model architecture, pre-training, and evaluation are much appreciated. Here, I want to briefly focus on how the ESM3 model tries to address predicting the function of a protein which is a challenging biological problem. Inspired by the ESM3 model, I built a much simpler dual-modal protein language model that is trained on a) amino-acid sequences and b) catalytic site (residue) information. But first a primer on protein function prediction:

The Need for Predicting Protein Function

The recent wave of protein language models like the ESM family of models, Progen, SaProt, etc can generate protein sequences that are much different from known natural proteins while still maintaining the general “grammar” of a protein. For example, some protein language models like Progen can generate functional proteins that are less than 40% similar to the starting protein but still retain the function of the original protein. However, at the lower sequence similarity score thresholds, the number of functional proteins drops off significantly. A good example of this phenomenon was shown in the ProteinMPNN paper where the activity of designed Tobacco Etch Virus (TEV) proteases is markedly diminished as their similarities to the original TEV protease are reduced to lower than 70% (Figure 1 below). Therefore, augmenting the current protein language models (PLMs) with function information or annotation would make these models even more useful.

Figure 1: From https://pubs.acs.org/doi/10.1021/jacs.3c10941 (Figure 1 (g))

The most common methods to predict protein function are usually based on sequence similarity, structure similarity, or text-mining on published research articles or patents. The major challenge with these existing methods is their failure to extrapolate to new proteins that do not have adequate sequence or structure similarity to known and annotated proteins. And, since the current PLMs can generate proteins that are much more divergent from their natural counterparts, the existing protein function prediction methods might not be useful. Now that I hope you are convinced about the utility of predicting protein functions, let’s look at how models like ESM3 predict protein function.

Multi-Modal Protein Language Models: Fusing Sequence and Function Prediction

One powerful way to leverage protein function information is by incorporating these data during the protein language model pretraining or finetuning stage, along with the usual protein sequence information. Such models that can train on multiple data-types are referred to as multi-modal models (Figure 2 below):

Figure 2: From: https://www.nocode.ai/what-is-multimodal-ai/

Typically, the most common multi-modal models combine text and image data since they are the most abundant types of unlabeled data. In the case of protein-language models, the different types of data could be sequence data, structure-data, functional annotations, etc. At a high level I see two major themes for multi-modal PLMs that combine amino-acid sequences and the corresponding protein function information:

Llama + Protein Language Models

The Llama family of language models is one of the most powerful and widely applicable open-source language models. And not surprisingly has been adapted to protein prediction tasks as well either by itself (ProLlama) or in combination with a dedicated protein language model like ESM as in the case of ProtST. For example: The architecture used for training ProtST is shown in Figure 3 below, where the amino-acid sequences of the protein and the corresponding text describing the protein’s function are used to pre-train the model

Figure 3: From https://arxiv.org/abs/2301.12040 (Figure 1 (a))

ESM3

I am using ESM3 as the representative model for the second type of multi-modal protein languge models. These models do not use a pre-trained language model like Llama but instead start from scratch in pre-training a protein language models using all the available data types. The architecture of ESM3 is shown below in Figure 4:

Figure 4: From https://www.biorxiv.org/content/10.1101/2024.07.01.600583v1 (Figure S1)

Some interesting features of the ESM3 model as pertaining to function prediction:

Multiple data-tracks or types like structure, function, solvent accessible surface area (SASA), along with amino-acid sequences.
Function track is based on Interpro data: TF-IDF and locally-sensitive hashing to convert the keyword frequencies from Interpro into a “vocabulary”
Summation of the token embeddings across all tracks before feeding this sum to the transformer stack
Regression heads to parse out the final layer outputs into multiple output tracks corresponding to each input data modality

Interpro is great but the active site annotation is too broad and often includes residues that are not necessarily active. To keep things simple, I used the catalytic site database from the Mechanism and Catalytic Site Atlas (M-CSA).

Catalytic Site Prediction

Catalytic sites or residues in an enzymatic protein are the residues that are most important for the catalytic activity of the enzyme. This information could be useful when designing new enzymes. A recent paper lays out a good model to predict catalytic sites of enzymes and also provides curated training, validation, and test datasets to compare the different models. See Figure 5 below:

Figure 5: https://pubs.acs.org/doi/10.1021/acs.jcim.3c00273 (Figure 2)

The AEGAN model is trained on the approximately 14K catalytic sites and over 6 million normal sites and does a much better job than existing methods in terms of its F1-score across multiple enzyme families. I used the same data set as is used in this AEGAN paper but rolled up my own dual-modal protein language model to predict the catalytic sites using both the full amino-acid sequence of the proteins and their catalytic site information.

Multi-Modal Protein Language Models

Using the same Llama 2 like architecture from my previous post, I’ve added the catalytic site prediction track in addition to the original amino-acid sequence track. You can find the multi-modal repo branch here. For simplicity, I am only predicting the catalytic sites and did not use a regression head like in the ESM3 model. The final logits model layer is used to predict the catalytic site tokens. A few more implementation details:

Catalytic residues/sites are represented as “X” and non-catalytic sites as “Z”. A protein’s catatlytic site track is therefore represented as a sequence of Zs and Xs of length equal to that of the protein’s amino-acid sequence.
Since, there are a lot more non-catalytic sites than catalytic sites in a protein, the cross-entropy loss is replaced with a weighted cross-entropy loss. The weights for X labels are 10 times higher than the weights for the Z labels. This is a hyper-parameter that needs more tuning.
Used Context Specific Positional Encodings (CoPE) instead of the original rotary positional encoding. See my other post for the motivation behind using the CoPE embeddings for protein language models

The cross-entropy losses for the 14,230 training sequences, and the 3,175 validation sequences are shown below:

Figure 6: Cross-Entropy Loss for Training and Validation Data

Precision, sensitivity (recall), and F1-scores on the validation set are shown below:

Figure 7: More metrics on the Validation Data

This model achieves similar precision, sensitivity, and F1-scores as reported by the original AEGAN paper, proving that the sequence features learned by this multi-modal model helps it to learn the features predictive of a catalytic-site from non-catalytic site.

TO DO:

Test the final trained model on the benchmark dataset from the AEGAN paper
Test adding more features like solvent-accessible surface area (SASA) information and secondary strucuture (SS8)
Train models with and without using context specific positional encodings (CoPE)