Training Protein Embeddings Simple Walthrough using ProtBert Example

Abish Pius
Computational Biology Papers
8 min readJul 22, 2023
Protein Embeding Models

Full Article: ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing (biorxiv.org)

Layout

  • Quick Read
  • Full Article Review
  • Code Section

Quick Read

  1. Data Preparation: The first step is to gather a large dataset of protein sequences. This dataset may come from publicly available protein databases or from other sources like research papers. Each protein sequence is represented as a string of amino acids, where each amino acid is represented by a single letter code.
  2. Feature Representation: To convert the protein sequences into numerical representations suitable for deep learning models, we use one-hot encoding or learned embeddings. One-hot encoding represents each amino acid as a binary vector, where the dimensionality of the vector is equal to the number of possible amino acids. Learned embeddings, on the other hand, are dense vector representations of fixed size that are trained during the embedding training process.
  3. Model Architecture: There are several architectures that can be used to train protein embeddings. Recurrent Neural Networks (RNNs) such as Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) can be used to process the sequential nature of protein sequences. Convolutional Neural Networks (CNNs) can also be applied to learn local features from the protein sequences. Transformer-based models, like the ones used in natural language processing, have also been adapted for protein sequences.
  4. Training Objective: The model is trained to predict certain properties of proteins. The choice of the training objective depends on the specific downstream task or the information we want the embeddings to capture. For instance, the model can be trained to predict protein function based on the sequence, or it can be trained to predict protein-protein interactions based on pairs of protein sequences.
  5. Loss Function: The loss function measures the error between the model’s predictions and the actual targets. The choice of the loss function depends on the task being solved. For example, for protein function prediction, cross-entropy loss might be used, while for protein-protein interaction prediction, a binary classification loss might be used.
  6. Training Process: The model is trained using a large dataset of labeled protein sequences. The training process involves iteratively presenting batches of protein sequences to the model, calculating the loss, and updating the model’s parameters using gradient-based optimization algorithms (e.g., stochastic gradient descent or Adam). The process continues until the model converges or reaches a predefined stopping criterion.
  7. Evaluation: Once the model is trained, it is evaluated on a separate validation or test dataset to assess its performance. The evaluation metrics depend on the specific task, but common metrics include accuracy, precision, recall, and F1 score.

Background

Computational biology and bioinformatics have become treasure troves of data, especially in the form of protein sequences. In recent years, language models (LMs) from Natural Language Processing (NLP) have been harnessed to revolutionize prediction capabilities at low inference costs. This article explores the training of auto-regressive and auto-encoder models on massive protein sequence datasets, leveraging High-Performance Computing (HPC) for unprecedented performance gains. The study investigates the advantages of up-scaling LMs on larger models supported by vast protein data, enabling breakthroughs in predicting protein properties and uncovering biophysical properties governing protein shapes.

The Power of High-Performance Computing and Deep Learning

Advancements in HPC and deep learning have been instrumental in achieving scientific breakthroughs. More powerful supercomputers and advanced libraries, along with processing units like GPUs and TPUs, enable training complex models on larger datasets at remarkable speeds and efficiency. LMs, particularly Transformers, have excelled in various NLP tasks and have greatly benefited from HPC advancements.

Protein Sequences as Language

Protein research presents a promising application for transfer learning, where unlabeled protein sequence data, which grows exponentially, can complement the limited annotated datasets. The “sequence-structure gap” poses a challenge, as the number of proteins with known 3D structures is orders of magnitude smaller than those with known 1D sequences. Advanced LMs have demonstrated the potential to interpret protein sequences as sentences, amino acids as words, and grasp grammar-like constraints reflected in protein 3D structures, paving the way for efficient protein analysis.

The ProtTrans Project

The ProtTrans project seeks to explore two primary objectives. First, it aims to push the boundaries of up-scaling language models trained on proteins, as well as the protein sequence databases used for training. Second, it compares the effects of auto-regressive and auto-encoding pre-training on subsequent supervised training and compares the performance of LMs to existing solutions using evolutionary information.

The Promising Future of Protein Embeddings

The successful up-scaling of protein LMs through HPC and extensive data sets narrows the gap between models trained on evolutionary information and language models. These embeddings, derived from unlabeled protein sequences, capture essential biophysical properties governing protein shapes, indicating a grasp of the language of life encoded in protein sequences. By harnessing the power of LMs, computational biology and bioinformatics stand to revolutionize the prediction of protein properties and bridge the sequence-structure gap, making strides in understanding the language of life.

Results

Overall, the study shows that the protein LMs are capable of capturing important features of proteins and can be useful for various predictive tasks. However, they are still outperformed by methods using evolutionary information. The embeddings-based approach is computationally efficient and provides an alternative for fast predictions.

  1. Unsupervised Embeddings: The embeddings extracted from the LMs capture important information about individual amino acids, including biophysical features such as charge, polarity, and hydrophobicity.
  2. Capturing Protein Structure: The LMs are capable of capturing various aspects of protein structure, distinguishing between all-alpha, all-beta, alpha|beta, alpha&beta, multi-domain, membrane/cell surface, and small proteins.
  3. Capturing Protein Function: The embeddings also exhibit some ability to capture aspects of protein function, such as classifying proteins into transferases, hydrolases, and oxidoreductases.
  4. Capturing Domains of Life and Viruses: The LMs demonstrate the ability to separate proteins from different domains of life (archaea, bacteria, and eukarya), while viruses form less homogeneous clusters.
  5. Supervised Predictions using Embeddings: The embeddings are used as input for supervised predictions, including secondary structure prediction and protein localization. While the performance of the embeddings-based approach is better than uncontextualized methods, it falls behind methods using evolutionary information.
  6. Fast Predictions using Embeddings: One advantage of the protein embeddings is their speed compared to traditional database search methods (using evolutionary information). The embeddings-based approach is significantly faster in generating representations for proteins.

Discussion

Tackling HPC Challenges for Larger Protein LMs

The endeavor to up-scale language models to cope with vast protein databases presented unique challenges. Addressing these hurdles involved careful consideration of architecture, communication overhead, distributed training, file sharing, and pre-processing. Moreover, optimizing deep learning libraries, such as Pytorch and Tensorflow, facilitated efficient training and resource utilization on HPC systems like Summit.

Unsupervised Learning of Protein Features

Remarkably, unsupervised LMs demonstrated an ability to grasp fundamental features of proteins. These LMs extracted valuable information from amino acid building blocks, identified protein structures, and even discerned macroscopic features related to the domains of life. Notably, global structural and biochemical properties appeared most distinct, while local features were less separated.

Bi-Directional Models Take the Lead

While bi-directional context proved crucial for modeling the language of life encoded in proteins, the traditional parity between auto-regressive and auto-encoding models observed in NLP did not hold for proteins. Bi-directional models outperformed uni-directional ones, with TransformerXL (ProtTXL) performing less effectively than XLNet (ProtXLNet) due to their context-capturing mechanisms. Consideration of bi-directional context during training might be achieved through specific pre-training approaches, potentially offering even better results.

Bigger Data Not Always Better

The use of vast protein databases, such as BFD, resulted in a limited increase in performance for some second-stage predictions. While more data and larger models showed improvements, they did not lead to exponential gains. Nevertheless, the significant speed-up of inference with protein LMs compared to traditional evolutionary approaches offered practical advantages.

The Upper Limit of Protein LMs

While protein LMs present exciting opportunities for extracting valuable information from proteins, there might be an upper limit to their capabilities when relying solely on auto-regressive or auto-encoding techniques. Bi-directional models have shown superiority, and future research should explore additional approaches, such as incorporating auxiliary tasks and addressing model vs. data parallelism. Furthermore, full precision training might stabilize training and provide more informative representations.

CODE Walkthroughs

Training Protein Embeddings from Scratch in Python

Reference: GitHub — agemagician/ProtTrans: ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.

Link to colab for code: https://colab.research.google.com/drive/1g1YZJuBh2QUDlDpy3Tp_VYsR2zrBPMWG?usp=sharingshape (1024)

  1. Import necessary libraries:
from transformers import T5Tokenizer, T5EncoderModel
import torch

2. Check for the available device (CPU or CUDA/GPU) and assign it to the variable device.

3. Load the pretrained T5 tokenizer for protein sequences:

tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False).to(device)

4. Load the pretrained T5EncoderModel for protein sequences

model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc").to(device)

5. Choose the appropriate precision for the model based on the device. The model is set to half-precision (float16) on GPU and full-precision (float32) on CPU:

model.full() if device == 'cpu' else model.half()

6. Prepare a list of protein sequences and preprocess them. The code replaces rare/ambiguous amino acids (U, Z, O, B) with X and introduces white-spaces between amino acids:

sequence_examples = ["PRTEINO", "SEQWENCE"]
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

7. Tokenize the sequences and pad them up to the length of the longest sequence in the batch using the tokenizer:

ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)

8. Generate embeddings for the sequences using the T5EncoderModel:

with torch.no_grad():
embedding_repr = model(input_ids=input_ids, attention_mask=attention_mask)

9. Extract the residue embeddings for each sequence in the batch and remove padded and special tokens:

emb_0 = embedding_repr.last_hidden_state[0, :7]  # shape (7 x 1024)
emb_1 = embedding_repr.last_hidden_state[1, :8] # shape (8 x 1024)

10. If you want to derive a single representation (per-protein embedding) for the whole protein, take the mean of the residue embeddings along the sequence length axis (dimension 0):

emb_0_per_protein = emb_0.mean(dim=0)  # shape (1024)

Embedding Protein Sequence via HuggingFace Transformer

from transformers import BertModel, BertTokenizer
import re
tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False )
model = BertModel.from_pretrained("Rostlab/prot_bert")
sequence_Example = "A E T C Z A O"
sequence_Example = re.sub(r"[UZOB]", "X", sequence_Example)
encoded_input = tokenizer(sequence_Example, return_tensors='pt')
output = model(**encoded_input)
  • Parts of this article were written using Generative AI
  • Subscribe/leave a comment if you want to stay up-to-date with the latest AI trends and Computational Biology Research.

Plug: Please purchase my book ONLY if you have the means to do so. Imagination Unleashed: Canvas and Color, Visions from the Artificial: Compendium of Digital Art Volume 1 (Artificial Intelligence Draws Art) — Kindle edition by P, Shaxib, A, Bixjesh. Arts & Photography Kindle eBooks @ Amazon.com.

Donate by scanning QR

--

--

Abish Pius
Computational Biology Papers

Data Science Professional, Python Enthusiast, turned LLM Engineer