ProteinBERT: Revolutionizing Protein Analysis with Deep Learning

Abish Pius

Published in

Writing in the World of Artificial Intelligence

6 min readMay 14, 2023

Source Code: nadavbra/protein_bert (github.com)

Article

Proteins are fundamental building blocks of life, performing crucial functions in various biological processes. Understanding protein structure, function, and interactions is essential for advancing fields such as drug discovery, bioinformatics, and molecular biology. To aid researchers in these endeavors, a powerful Python package called ProteinBERT has been developed, which leverages deep learning techniques to analyze proteins.

ProteinBERT is a protein language model pretrained on a vast dataset of approximately 106 million proteins from UniRef90. Using the BERT architecture as inspiration, ProteinBERT incorporates several innovations, including global-attention layers with linear complexity for sequence length. This unique design enables the model to process protein sequences of any length, including extremely long sequences containing tens of thousands of amino acids.

The primary goal of ProteinBERT is to provide a pretrained model that can be fine-tuned on various protein-related tasks quickly. By fine-tuning the model on specific tasks, researchers can achieve state-of-the-art performance on a wide range of benchmarks. The package offers seamless integration with Keras and TensorFlow, popular deep learning frameworks, ensuring ease of use and compatibility.

One notable feature of ProteinBERT is its ability to incorporate protein Gene Ontology (GO) annotations as additional inputs. GO annotations provide information about the function of a protein, allowing the model to update its internal representations and outputs accordingly. This capability enhances the model’s inference abilities and enables more accurate predictions about protein function.

To get started with ProteinBERT, the package provides straightforward installation instructions. It requires Python 3 and automatically installs the necessary dependencies, including TensorFlow, TensorFlow Addons, NumPy, Pandas, H5py, lxml, and pyfaidx. Once installed, researchers can explore the provided working examples in the accompanying Jupyter notebook.

For those interested in training ProteinBERT from scratch, the package includes scripts and guidelines for the entire process. However, it’s worth noting that training the model from scratch is a resource-intensive task. It requires substantial storage (>1TB) and can take a considerable amount of time (several weeks). Researchers are encouraged to carefully evaluate their needs and available resources before undertaking the training process.

The pretraining of ProteinBERT involves creating a dataset derived from UniRef90, a comprehensive protein sequence database. The package provides step-by-step instructions for acquiring the necessary data, including the UniRef90 XML and FASTA files. By using scripts provided by ProteinBERT, researchers can extract the relevant Gene Ontology annotations associated with UniRef’s records and create a final dataset in the H5 format.

Once the dataset is ready, researchers can initiate the pretraining process using the pretrain_proteinbert script. The script trains the ProteinBERT model on the prepared dataset, saving the model state at regular intervals. Researchers have the flexibility to adjust hyperparameters and explore various options provided by the script to fine-tune the training process to their specific needs.

After pretraining the model, researchers can utilize the pretrained model state when fine-tuning ProteinBERT on specific protein-related tasks. The package provides functions to load the pretrained model and resume training or use the model for inference and analysis.

ProteinBERT has emerged as a valuable tool for protein analysis, enabling researchers to leverage the power of deep learning for understanding protein structure and function. By combining the strengths of BERT architecture, global-attention layers, and incorporation of GO annotations, ProteinBERT offers a versatile and powerful solution for protein-related tasks.

With its ease of installation, integration with popular deep learning frameworks, and extensive documentation, ProteinBERT empowers researchers to achieve state-of-the-art results in protein analysis. By leveraging the pretrained model and fine-tuning it on specific tasks, researchers can expedite their experiments and make significant strides in understanding the complex world of proteins.

As the field of bioinformatics and protein research continues to evolve, ProteinBERT will undoubtedly play a significant role in advancing our understanding of proteins and their functions. The ability to process protein sequences of varying lengths opens up new possibilities for analyzing complex proteins, including those with extended regions or unique structural features.

ProteinBERT’s pretrained model serves as a valuable starting point for researchers working on protein-related tasks. By fine-tuning the model on specific datasets, researchers can achieve remarkable performance on a wide range of benchmarks, surpassing previous state-of-the-art results. This fine-tuning process is fast and efficient, allowing researchers to iterate quickly and explore different applications of ProteinBERT.

The integration of protein Gene Ontology (GO) annotations further enhances ProteinBERT’s capabilities. By incorporating additional information about protein function, the model can better understand the underlying biological context and make more accurate predictions. This integration opens up opportunities for researchers to delve into functional genomics, protein classification, and other tasks that heavily rely on GO annotations.

The availability of working examples and a comprehensive Jupyter notebook simplifies the usage of ProteinBERT. Researchers can easily grasp the implementation details and adapt the provided code to their specific requirements. The package’s compatibility with Keras and TensorFlow ensures seamless integration into existing deep learning workflows, allowing researchers to leverage their knowledge and expertise in these frameworks.

While pretraining ProteinBERT from scratch is a resource-intensive process, it provides an avenue for researchers to explore novel applications and datasets. The guidelines provided by the package help researchers navigate the complexities of data preparation, ensuring that the resulting dataset aligns with their research goals. With careful planning and adequate resources, researchers can embark on pretraining ProteinBERT to create customized models tailored to their specific protein-related tasks.

ProteinBERT’s impact extends beyond the research community. It has the potential to facilitate advancements in drug discovery, protein engineering, and personalized medicine. By enabling more accurate predictions of protein function and interactions, ProteinBERT can assist in identifying drug targets, predicting drug-protein interactions, and designing protein variants with desired properties. These applications hold promise for accelerating the development of new therapies and enhancing our understanding of complex diseases.

In conclusion, ProteinBERT is a powerful Python package that harnesses the capabilities of deep learning to revolutionize protein analysis. With its pretrained model, seamless integration with popular deep learning frameworks, and support for GO annotations, ProteinBERT empowers researchers to tackle complex protein-related tasks with ease. As the field of protein research continues to expand, ProteinBERT will serve as a vital tool in unraveling the mysteries of proteins and driving scientific advancements in various domains.

Code Example

Fine-tune the model for the signal peptide benchmark

import os
import pandas as pd
from IPython.display import display
from tensorflow import keras
from sklearn.model_selection import train_test_split
from proteinbert import OutputType, OutputSpec, FinetuningModelGenerator, load_pretrained_model, finetune, evaluate_by_len
from proteinbert.conv_and_global_attention_model import get_model_with_hidden_layers_as_outputs
BENCHMARK_NAME = 'signalP_binary'
# A local (non-global) binary output
OUTPUT_TYPE = OutputType(False, 'binary')
UNIQUE_LABELS = [0, 1]
OUTPUT_SPEC = OutputSpec(OUTPUT_TYPE, UNIQUE_LABELS)

# Loading the dataset
train_set_file_path = os.path.join(BENCHMARKS_DIR, '%s.train.csv' % BENCHMARK_NAME)
train_set = pd.read_csv(train_set_file_path).dropna().drop_duplicates()
train_set, valid_set = train_test_split(train_set, stratify = train_set['label'], test_size = 0.1, random_state = 0)
test_set_file_path = os.path.join(BENCHMARKS_DIR, '%s.test.csv' % BENCHMARK_NAME)
test_set = pd.read_csv(test_set_file_path).dropna().drop_duplicates()
print(f'{len(train_set)} training set records, {len(valid_set)} validation set records, {len(test_set)} test set records.')

# Loading the pre-trained model and fine-tuning it on the loaded dataset
pretrained_model_generator, input_encoder = load_pretrained_model()
# get_model_with_hidden_layers_as_outputs gives the model output access to the hidden layers (on top of the output)
model_generator = FinetuningModelGenerator(pretrained_model_generator, OUTPUT_SPEC, pretraining_model_manipulation_function = \
        get_model_with_hidden_layers_as_outputs, dropout_rate = 0.5)
training_callbacks = [
    keras.callbacks.ReduceLROnPlateau(patience = 1, factor = 0.25, min_lr = 1e-05, verbose = 1),
    keras.callbacks.EarlyStopping(patience = 2, restore_best_weights = True),
]
finetune(model_generator, input_encoder, OUTPUT_SPEC, train_set['seq'], train_set['label'], valid_set['seq'], valid_set['label'], \
        seq_len = 512, batch_size = 32, max_epochs_per_stage = 40, lr = 1e-04, begin_with_frozen_pretrained_layers = True, \
        lr_with_frozen_pretrained_layers = 1e-02, n_final_epochs = 1, final_seq_len = 1024, final_lr = 1e-05, callbacks = training_callbacks)

# Evaluating the performance on the test-set
results, confusion_matrix = evaluate_by_len(model_generator, input_encoder, OUTPUT_SPEC, test_set['seq'], test_set['label'], \
        start_seq_len = 512, start_batch_size = 32)
print('Test-set performance:')
display(results)
print('Confusion matrix:')
display(confusion_matrix)

FREE ChatGPT Document Q&A: Get questions answered about any document type of any length!

Plug: Please purchase my book ONLY if you have the means to do so, I usually do not advertise, but I am struggling to stay afloat. Imagination Unleashed: Canvas and Color, Visions from the Artificial: Compendium of Digital Art Volume 1 (Artificial Intelligence Draws Art) — Kindle edition by P, Shaxib, A, Bixjesh. Arts & Photography Kindle eBooks @ Amazon.com.

ProteinBERT: Revolutionizing Protein Analysis with Deep Learning

Article

Code Example

Fine-tune the model for the signal peptide benchmark

Written by Abish Pius