BERT Model Restores Protein Structure

Researchers at Salesforce trained a language model to reconstruct a protein. The state-of-the-art language model BERT was used as the architecture. In addition to the model, the developers publish a tool for three-dimensional visualization of the trained model’s attention mechanism.

An example of visualization with ProVis. Protein sequence (spirals) with the attention mechanism of a trained Transformer (orange lines). Source: https://blog.einstein.ai/content/images/2020/06/blog2-1.gif

The model, which has been trained to predict the hidden amino acid in the protein sequence, restores high-level structural and functional characteristics of the protein. The researchers found that the language model uses the attention mechanism for two purposes:

  1. Reveal the folding structure of proteins, linking regions that are distant in sequence, but close in space;
  2. Recognize binding sites that are a key functional component of proteins

How it works

Proteins are complex molecules that play an important functional and structural role in living organisms. Despite the complex behavior of proteins, their presentation is simple. Each protein can be thought of as a chain of amino acids.

If we think of proteins as sequences of discrete characters (amino acids), then familiar language models that are used in NLP are applicable to them. The researchers took a pre-trained protein sequence BERT model (TAPE). TAPE completely repeats the BERT-Base architecture: 12 layers, each of which contains 12 attention mechanisms.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mikhail Raevskiy

Bioinformatician at Oncobox Inc. (@oncobox). Research Associate