Do you speak protein?
Natural Language Processing leads to breakthroughs in protein studies
Most human languages are represented as the sequences of letters in their written form. The letters are combined into the words and words form sentences. The proteins are also assembled from “letters” — the 20 common amino acids (not counting unconventional ones). The amino acids form quite stable modular patterns — protein motifs, secondary structure elements and domains — which roughly resemble words, phrases and sentences in human language.
There is much more information content in the sentence of human language than the sum of individual symbols. The meaning of the words and phrases depend on the context and could be interpreted differently by different readers depending on their backgrounds and a priory knowledge. Similarly, the protein sequence is translated into the 3D-shape of the functional protein, which is dynamic and context-dependent (depends on the species, tissue, cell type, cellular state, interactions with other proteins, post-translational modifications, etc.).
Thus is it very tempting to apply the methods, which are used to parse and comprehend the natural human languages, to proteins.
The recent comprehensive review summarizes numerous attempts to use the Natural Language Processing (NLP) machine learning algorithms for different protein-related biological problems.
The most promising results are demonstrated by the deep learning models, which are trained to predict the next token in a text depending on the context and the language model. The most successful models are based on Transfromer architecture, which is now becoming a standard for the field.
In terms of protein and genomics research deep learning transformer models are used to reveal the taxonomic origin of proteins and to assess the stability of natural or artificial protein sequences. One of the most striking examples of the protein language models is detection of viral proteins in the real-world metagenomics samples, which include various species.
Another promising approach is attention-based models, where the art of protein sequence is highlighted as “focused on” for particular task (for example, which residues are important for interactions withing the globule or between the proteins). The largest model of this kind comes, quite surprisingly, from the Facebook developers. The Facebook’s Evolutionary Scale Model is currently the largest protein language model with 36 layers and 700 million parameters. This attention-based model was trained on 250 M protein sequences. Like in conventional NLP with human languages, the larger is the models the better are the benchmark results without any hints of performance saturation in the near future.
The NLP is particularly successful in predicting the links in the knowledge graphs. Latest models are able to account for contextual information and to predict the cross-links between, for example, Wikipedia articles, based on the text itself. In terms of protein universe such task translates into predicting protein-protein interactions based on their sequences. However, this task appears to be unexpectedly complex. First, the training dataset for proteins is many orders of magnitude smaller. Second, it is strongly evolutionary biased — existing proteins are fine tuned as either interacting or not interacting, without many examples of generic unoptimized sequences. Third, known connections of the interaction graph are too sparse to allow efficient training.
One of the most famous NLP models of the recent years are generative models of GPT family, which gained a lot of hype in mass media because of their ability to produce impressively realistic texts. Generative models also strongly benefit from increasing their training dataset size and the number of parameters. GPT-3 have over 170 billion parameters, while the next generation models of this kind are even larger. In terms of protein language it means that the bottleneck is likely to be the number of available annotated sequences.
The obvious area where the generative models can be very useful is de novo protein design including construction of monoclonal antibodies and enzymes with dedicated activity. The reverse problem could also be solved to predict mutations, which might help the virus to evade the immune response from neutralizing antibodies. Such study was performed for the SARS-CoV-2 virus recently.
The most famous protein-related model — the AlphaFold2, which have recently de facto solved the problem of protein structure prediction, is not directly related to other “protein language” models and stands in its own class. The algorithm of AlphaFold2 is complex and includes several modules based on a transformer design. One of them optimizes the links between the pairs of amino acid residues (the residue-residue edges), while other operates on different sequences in the input sequence alignment (the residue-sequence edges). Internally these transformations contain “attention layers” that have the effect of bringing relevant data together and filtering out irrelevant data in a context-dependent way, learnt from training data. These transformations are iterated multiple times. According to developers interviews, the “attention algorithm … mimics the way a person might assemble a jigsaw puzzle: first connecting pieces in small clumps — in this case clusters of amino acids — and then searching for ways to join the clumps in a larger whole”.
The output of these iterations is then fed into the the final structure prediction module, which also based on iterative transformers architecture.
With all their advantages the deep learning models have a number of disadvantages. They are very slow to train, require large datasets and often perform worse than algorithmic techniques on simple tasks and noisy inputs. The NLP techniques are also better suited for long texts, while an average protein is only about 200 amino acid residues. The big problem is overfitting, when the model captures irrelevant random noise present in the data. Deep learning models are also unstable in terms of hyperparameters, which are often treated as a kind of “black magic”. Future research in this area should address these issues.
The Receptor.AI uses NLP approaches in several point on our drug discovery pipeline. First, we are utilizing the NLP for populating our in house knowledge graph, which integrates the data about diseases, targets, ligands and clinical trials data. Second, we are developing state-of-the-art transformer-based generative models for molecular generation. In contrast to protein sequences, small drug-like molecules are not so easy to represent as a series of letters and words, but it is intuitively understandable, that there is a certain language of chemical groups, which combine to each other according to well-established rules. We managed to develop the deep-learning model based on such language, which outperform traditional molecular generators. This technique is currently being patented and will be provided as a service for our clients.